data-masking

- fundamentals

mutate() is used to store the results of various calculation, either into new columns,

df %>%
  mutate(Price_Eur = Price * 1.14)
A tibble: 525461 x 9

or by updating existing ones.

df %>%
  mutate(Price = Price * 1.14)
A tibble: 525461 x 8

The just created columns can be conveniently used again inside the same mutate() call.

df %>%
  mutate(Price_Eur = Price * 1.14,
         Revenue_Eur = Quantity * Price_Eur)
A tibble: 525461 x 10

mutate() permits a high degree of columns manipulation and it can be used with a vast variety of functions. Here for example we are interested in knowing which value is higher, the squared or the doubled one, for every element of the Price column.

df %>%
  mutate(Squared_Price = Price ^ 2,
         Double_Price = Price * 2,
         Max_Price = if_else(pmax(Squared_Price, Double_Price) == Squared_Price, "Squared_Price", "Double_Price"))
A tibble: 525461 x 11

In these previous examples we used calculations that are vectorized.

Vectorized operations are performed element-wise, meaning that the calculation is performed between elements that share the same position (row index in a data frame), getting as many results as the number of them.

c(1, 2, 3) * c(1, 2, 3)
## [1] 1 4 9

If one of the two objects is of length 1 its value will be used again (recycled) along the length of the longer one.

c(1, 2, 3) * 2
## [1] 2 4 6
2 * c(1, 2, 3)
## [1] 2 4 6

Recycling takes place also with vectors of different lengths.

c(1, 2, 3, 4) * c(2, 3)
## [1]  2  6  6 12

But if the longer vector’s length is not a multiple of the smaller one’s, R will issue a warning.

c(1, 2, 3) * c(2, 3)
## Warning in c(1, 2, 3) * c(2, 3): longer object length is not a multiple of
## shorter object length
## [1] 2 6 6

That is something we must take into consideration when using mutate(), as it could produce unwanted results.

df %>%
  mutate(Price_2 = Price * c(1.14, 2))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `Price_2 = Price * c(1.14, 2)`.
Caused by warning in `Price * c(1.14, 2)`:
! longer object length is not a multiple of shorter object length
A tibble: 525461 x 9

When we use an aggregate function, like mean(), we will get the same result for all the rows,

df %>%
  mutate(Avg_Price = mean(Price))
A tibble: 525461 x 9

and with window functions a result for every row.

df %>%
  mutate(Price_Rank = dense_rank(desc(Price)))
A tibble: 525461 x 9

With mutate() we can add new homogeneous columns, where the value specified will be recycled along all the rows.

df %>%
  mutate(DistributionCentre = "London")
A tibble: 525461 x 9

When creating new columns, it is only permitted to recycle vectors of length 1 so trying to add new columns with a length equal to a divisor of the length of the data frame will not work.

df %>%
  slice(1:10) %>%
  mutate(New_Col = c(1, 2))
Error in `mutate()`:
ℹ In argument: `New_Col = c(1, 2)`.
Caused by error:
! `New_Col` must be size 10 or 1, not 2.

As the error states, besides of length 1 we can add columns if their length is equal to the length of the data frame.

df %>%
  slice(1:10) %>%
  mutate(New_Col = c(1:10))
A tibble: 10 x 9

The recycling behavior seen in these past examples is dictated by the fact that mutate() must return the same number of rows as the input.

mutate() can be used to remove existing columns as well, with the NULL operator:

df %>%
  mutate(Price = NULL)
A tibble: 525461 x 7

- .after & .before

tidy-select

By default mutate() places the newly created columns at the end of the data frame, but we can change this behavior using the .after and .before optional tidy-select arguments.

df %>%
  mutate(Price_Eur = Price * 1.14, .after = Price)
A tibble: 525461 x 9
df %>%
  mutate(Price_Eur = Price * 1.14, .before = Price)
A tibble: 525461 x 9

In case we have two new columns and we want them in specific positions we have to break down the call in two though.

df %>%
  mutate(Price_Eur = Price * 1.14,
         .after = Price) %>%
  mutate(Revenue_Eur = Quantity * Price_Eur,
         .after = `Customer ID`)
A tibble: 525461 x 10

- .keep

Using the .keep argument we have agency on which columns to preserve in the output.

all, the default, keeps all the existing columns.

df %>%
  mutate(Price_Eur = Price * 1.14, .keep = "all")
A tibble: 525461 x 9

used preserves only the columns used in the mutate() call,

df %>%
  mutate(Price_Eur = Price * 1.14, .keep = "used")
A tibble: 525461 x 2

plus the grouping columns, if any.

df %>%
  group_by(Country) %>%
  mutate(Price_Eur = Price * 1.14, .keep = "used")
A tibble: 525461 x 3
Groups: Country [40]

unused substitutes the used columns with the newly calculated ones, useful for updating their names and values.

df %>%
  mutate(Price_Eur = Price * 1.14, .keep = "unused")
A tibble: 525461 x 8

none keeps only the new columns,

df %>% 
  mutate(Price_Eur = Price * 1.14, .keep = "none")
A tibble: 525461 x 1

plus any grouping columns, if present.

df %>% 
  group_by(Country) %>%
  mutate(Price_Eur = Price * 1.14, .keep = "none")
A tibble: 525461 x 2
Groups: Country [40]

- with group_by() / .by

If we are using aggregate or window functions, a grouped data frame allows mutate() to perform group-wise calculations, meaning that only the values pertaining to each group are responsible for that group’s output.

The results can greatly differ from when they are performed on an ungrouped data frame.

df %>%
  group_by(Country) %>%
  mutate(Avg_Price = mean(Price), .keep = "none")
A tibble: 525461 x 2
Groups: Country [40]
df %>%
  mutate(Avg_Price = mean(Price), .keep = "none")
A tibble: 525461 x 1
df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(desc(Price)), .keep = "none")
A tibble: 525461 x 2
Groups: Country [40]
df %>%
  mutate(Price_Rank = dense_rank(desc(Price)), .keep = "none")
A tibble: 525461 x 1

With element-wise calculations instead the results are the same.

df %>%
  group_by(Country) %>%
  mutate(Price_Eur = Price * 1.14, .keep = "none")
A tibble: 525461 x 2
Groups: Country [40]
df %>%
  mutate(Price_Eur = Price * 1.14, .keep = "none")
A tibble: 525461 x 1

Instead of using group_by() we can specify the .by argument to perform operations on grouped data frames.

df %>%
  mutate(Avg_Price = mean(Price), .by = Country, .keep = "none")
A tibble: 525461 x 2

.by always leaves the data frame ungrouped, and it needs tidy-select syntax if we want to specify more than one column.

df %>%
  mutate(Avg_Price = mean(Price), .by = c("Country", "Customer ID"), .keep = "none")
A tibble: 525461 x 3

That allows to utilize selection helpers.

df %>%
  mutate(Avg_Price = mean(Price), .by = starts_with("C"), .keep = "none")
A tibble: 525461 x 3