data-masking
mutate()
is used to store the results of various
calculation, either into new columns,
%>%
df mutate(Price_Eur = Price * 1.14)
or by updating existing ones.
%>%
df mutate(Price = Price * 1.14)
The just created columns can be conveniently used again inside the
same mutate()
call.
%>%
df mutate(Price_Eur = Price * 1.14,
Revenue_Eur = Quantity * Price_Eur)
mutate()
permits a high degree of columns manipulation
and it can be used with a vast variety of functions. Here for example we
are interested in knowing which value is higher, the squared or the
doubled one, for every element of the Price
column.
%>%
df mutate(Squared_Price = Price ^ 2,
Double_Price = Price * 2,
Max_Price = if_else(pmax(Squared_Price, Double_Price) == Squared_Price, "Squared_Price", "Double_Price"))
In these previous examples we used calculations that are vectorized.
Vectorized operations are performed element-wise, meaning that the calculation is performed between elements that share the same position (row index in a data frame), getting as many results as the number of them.
c(1, 2, 3) * c(1, 2, 3)
## [1] 1 4 9
If one of the two objects is of length 1 its value will be used again (recycled) along the length of the longer one.
c(1, 2, 3) * 2
## [1] 2 4 6
2 * c(1, 2, 3)
## [1] 2 4 6
Recycling takes place also with vectors of different lengths.
c(1, 2, 3, 4) * c(2, 3)
## [1] 2 6 6 12
But if the longer vector’s length is not a multiple of the smaller one’s, R will issue a warning.
c(1, 2, 3) * c(2, 3)
## Warning in c(1, 2, 3) * c(2, 3): longer object length is not a multiple of
## shorter object length
## [1] 2 6 6
That is something we must take into consideration when using
mutate()
, as it could produce unwanted results.
%>%
df mutate(Price_2 = Price * c(1.14, 2))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `Price_2 = Price * c(1.14, 2)`.
Caused by warning in `Price * c(1.14, 2)`:
! longer object length is not a multiple of shorter object length
When we use an aggregate function, like mean()
, we will
get the same result for all the rows,
%>%
df mutate(Avg_Price = mean(Price))
and with window functions a result for every row.
%>%
df mutate(Price_Rank = dense_rank(desc(Price)))
With mutate()
we can add new homogeneous columns, where
the value specified will be recycled along all the rows.
%>%
df mutate(DistributionCentre = "London")
When creating new columns, it is only permitted to recycle vectors of length 1 so trying to add new columns with a length equal to a divisor of the length of the data frame will not work.
%>%
df slice(1:10) %>%
mutate(New_Col = c(1, 2))
Error in `mutate()`:
ℹ In argument: `New_Col = c(1, 2)`.
Caused by error:
! `New_Col` must be size 10 or 1, not 2.
As the error states, besides of length 1 we can add columns if their length is equal to the length of the data frame.
%>%
df slice(1:10) %>%
mutate(New_Col = c(1:10))
The recycling behavior seen in these past examples is dictated by the
fact that mutate()
must return the same number of rows as
the input.
mutate()
can be used to remove existing columns as well,
with the NULL
operator:
%>%
df mutate(Price = NULL)
tidy-select
By default mutate()
places the newly created columns at
the end of the data frame, but we can change this behavior using the
.after
and .before
optional tidy-select
arguments.
%>%
df mutate(Price_Eur = Price * 1.14, .after = Price)
%>%
df mutate(Price_Eur = Price * 1.14, .before = Price)
In case we have two new columns and we want them in specific positions we have to break down the call in two though.
%>%
df mutate(Price_Eur = Price * 1.14,
.after = Price) %>%
mutate(Revenue_Eur = Quantity * Price_Eur,
.after = `Customer ID`)
Using the .keep
argument we have agency on which columns
to preserve in the output.
all
, the default, keeps all the existing columns.
%>%
df mutate(Price_Eur = Price * 1.14, .keep = "all")
used
preserves only the columns used in the
mutate()
call,
%>%
df mutate(Price_Eur = Price * 1.14, .keep = "used")
plus the grouping columns, if any.
%>%
df group_by(Country) %>%
mutate(Price_Eur = Price * 1.14, .keep = "used")
unused
substitutes the used columns with the newly
calculated ones, useful for updating their names and values.
%>%
df mutate(Price_Eur = Price * 1.14, .keep = "unused")
none
keeps only the new columns,
%>%
df mutate(Price_Eur = Price * 1.14, .keep = "none")
plus any grouping columns, if present.
%>%
df group_by(Country) %>%
mutate(Price_Eur = Price * 1.14, .keep = "none")
If we are using aggregate or window functions, a grouped data frame
allows mutate()
to perform group-wise calculations, meaning
that only the values pertaining to each group are responsible for that
group’s output.
The results can greatly differ from when they are performed on an ungrouped data frame.
%>%
df group_by(Country) %>%
mutate(Avg_Price = mean(Price), .keep = "none")
%>%
df mutate(Avg_Price = mean(Price), .keep = "none")
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(desc(Price)), .keep = "none")
%>%
df mutate(Price_Rank = dense_rank(desc(Price)), .keep = "none")
With element-wise calculations instead the results are the same.
%>%
df group_by(Country) %>%
mutate(Price_Eur = Price * 1.14, .keep = "none")
%>%
df mutate(Price_Eur = Price * 1.14, .keep = "none")
Instead of using group_by()
we can specify the
.by
argument to perform operations on grouped data
frames.
%>%
df mutate(Avg_Price = mean(Price), .by = Country, .keep = "none")
.by
always leaves the data frame ungrouped, and it needs
tidy-select syntax if we want to specify more than one column.
%>%
df mutate(Avg_Price = mean(Price), .by = c("Country", "Customer ID"), .keep = "none")
That allows to utilize selection helpers.
%>%
df mutate(Avg_Price = mean(Price), .by = starts_with("C"), .keep = "none")