- tidy-select or data-masking

For every applicable dplyr function discussed in this handbook I started its section with either the tidy-select or data-masking enunciation. That is an information that can be found in the help page of most functions, under Arguments, for....

tidy-select or data-masking functions differ between each other by the kind of object they accept and it is a difference that is important to know to use each verb correctly. Succinctly put, tidy-select functions accept positions while data-masking ones vectors of values.

Some optional arguments as well work with either a tidy_select (like .by/by) or data-masking (like wt) syntax.

- tidy-select syntax

tidy-select functions are

select()
rename()
relocate()
across()
pick()
rowwise()
c_across()
pull()

There are also the optional arguments
.before/.after (for mutate())
and .by/by ( for filter(), slice() and its helpers, mutate(), summarise() and reframe()).

These functions work by using indifferently either the name (quoted or unquoted)

df %>%
  select("Invoice")

A tibble: 525461 x 1

df %>%
  select(Invoice)

A tibble: 525461 x 1

or the position of one or several columns.

df %>%
  select(1)

A tibble: 525461 x 1

It is important to note that, even when we use the name, we are always selecting by the position as using the name is just a way to pinpoint to the position of the column.

If we want to change the mapping between names and positions, this will not work as, in case of ambiguity, dplyr will give preference to the mapping inside the data frame and not to the external one we specified.

Invoice <- 5
df %>%
  select(Invoice)

A tibble: 525461 x 1

So if we want to use the externally specified value of Invoice, we must for example wrap it with identity().

df %>%
  select(Invoice, identity(Invoice))

A tibble: 525461 x 2

We can also use a name.

InvoiceDate <- "Invoice"
df %>%
  select(identity(InvoiceDate))

A tibble: 525461 x 1

That applies to selections like the aforementioned Invoice, to Invoice:InvoiceDate

df %>%
  select(Invoice:InvoiceDate)

A tibble: 525461 x 5

df %>%
  select(Invoice:identity(Invoice))

A tibble: 525461 x 5

and toc("Invoice", "InvoiceDate").

df %>%
  select(c("Invoice", "InvoiceDate"))

A tibble: 525461 x 2

df %>%
  select(c("Invoice", identity(Invoice)))

A tibble: 525461 x 2

With selection helpers we can instead freely utilise external definitions.

first_letter <- "I"
df %>%
  select(starts_with(first_letter))

A tibble: 525461 x 2

If we want to use external vectors we still need to wrap them with identity(),

sel_vars <- c("Invoice", "InvoiceDate")
df %>%
  select(identity(sel_vars))

A tibble: 525461 x 2

or use either any_of()

df %>%
  select(any_of(sel_vars))

A tibble: 525461 x 2

or all_of().

df %>%
  select(all_of(sel_vars))

A tibble: 525461 x 2

Otherwise we will get a warning.

df %>%
  select(sel_vars)

Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(sel_vars)

  # Now:
  data %>% select(all_of(sel_vars))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

A tibble: 525461 x 2

- data-masking syntax

data-masking functions are

filter()
arrange()
slice()
mutate()
group_by()
summarise()
reframe()
count()
add_count()
distinct()

Then there are also the
weight_by (for slice_sample),
order_by (for slice_max() & slice_min())
and wt (for count(), add_count(), tally() and add_tally()) arguments.

data-masking allows to directly refer to the column inside the data frame we are manipulating,

df %>%
  mutate(Price_Eur_dplyr = Price * 1.14, .after = "Price")

A tibble: 525461 x 9

without the need to specify that the column belongs to df, like we would do in base R with $.

df$Price_Eur_R <- df$Price * 1.14
df

A tibble: 525461 x 9

Notice as well the difference with the .after tidy-select argument, that instead needs a name (quoted or unquoted) or a position.

df %>%
  mutate(Price_Eur_dplyr = Price * 1.14, .after = 6)

A tibble: 525461 x 10

As they accept vectors of values, it means that we can feed data-masking functions with expressions,

df %>%
  filter(as.Date(InvoiceDate) == "2010-04-01")

A tibble: 1719 x 9

without the need to initialize a new column beforehand.

df %>%
  mutate(Invoice_Day = as.Date(InvoiceDate)) %>%
  filter(Invoice_Day == "2010-04-01")

A tibble: 1719 x 10

If we were to feed them with names or positions, they still will be treated as vector of values, therefore recycled, if their length is unitary, to the size of the data frame.

df %>%
  mutate(Price_Eur_dplyr = "Price_Eur_R", .after = "Price")

A tibble: 525461 x 10

df %>%
  mutate(Price_Eur_dplyr = 9, .after = "Price")

A tibble: 525461 x 10

We would also receive straight up errors if we tried to perform arithmetic operations between strings and numbers.

df %>%
  mutate(Price_Eur_Discounted = "Price_Eur_R" - 1)

Error in `mutate()`:
ℹ In argument: `Price_Eur_Discounted = "Price_Eur_R" - 1`.
Caused by error in `"Price_Eur_R" - 1`:
! non-numeric argument to binary operator

With data-masking we can use data frame columns and external vectors in the same call.

gbp_usd <- 1.27
df %>%
  mutate(Price_Usd = Price * gbp_usd, .after = "Price")

A tibble: 525461 x 10

In case an external vector and a data frame column have the same name, the precedence is given to the column,

df %>%
  mutate(gbp_usd = 2,
         Price_Usd = Price * gbp_usd, .after = "Price")

A tibble: 525461 x 11

unless we give precedence to the external vector with the .env pronoun.

df %>%
  mutate(gbp_usd = 2,
         Price_Usd = Price * .env$gbp_usd, .after = "Price")

A tibble: 525461 x 11

The analogue data frame pronoun is .data, in case we want to remove ambiguity on which vector of values we are using.

df %>%
  mutate(gbp_usd = 2,
         Price_Usd_.data = Price * .data$gbp_usd,
         Price_Usd_.env = Price * .env$gbp_usd, .after = "Price")

A tibble: 525461 x 12