data-masking
group_by()
groups together the rows of a data frame by
the unique values of the columns specified in its argument, forming as
many groups as the aforementioned unique values (when we group by only
one column),
%>%
df group_by(Country)
and as many as the unique existing combinations between their unique values (when we group by several ones).
%>%
df group_by(Country, `Customer ID`)
Besides characters’ ones, group_by()
can be used on
every class of columns: numeric
%>%
df group_by(Price)
and datetime alike.
%>%
df group_by(InvoiceDate)
It modifies the class of a data frame to grouped_df
, but
no other visible changes occur.
class(df %>%
group_by(Country))
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
When we apply a manipulation to a grouped data frame though, the manipulation is (in most of the cases) performed group-wise, meaning that only the values of the rows pertaining to a specific group are used for that group’s output. Therefore (still, generally speaking) are returned as many results as there are groups.
summarise()
might be the verb that benefits the most
from a grouped data frame, as with it we can easily construct summary
tables.
%>%
df summarise(Avg_Price = mean(Price))
%>%
df group_by(Country) %>%
summarise(Avg_Price = mean(Price))
%>%
df group_by(Country, `Customer ID`) %>%
summarise(Avg_Price = mean(Price))
`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.
Notice how the output is ordered by the grouping columns’ (the columns we group by) values.
As can be noticed from the previous example, when we have one result
per group, like when using aggregate functions,
summarise()
, after the calculation is performed, ungroups
the data frame when there is only one grouping column
%>%
df group_by(Country) %>%
summarise(Avg_Price = mean(Price)) %>%
group_vars()
## character(0)
or removes the last one added, when there are several, informing us with a message about the resulting grouping of the output.
%>%
df group_by(Country, `Customer ID`) %>%
summarise(Avg_Price = mean(Price)) %>%
group_vars()
## `summarise()` has grouped output by 'Country'. You can override using the
## `.groups` argument.
## [1] "Country"
mutate()
preserves the original rows’ order maintaining
as well their original number, adding the newly calculated values to
each one of them.
%>%
df group_by(Country) %>%
mutate(Avg_Price = mean(Price))
There are instances though where a grouped data frame doesn’t change
the results of a mutate()
call, for example if we applied a
vectorized operation (like a product) instead of an aggregate function
(like we did previously with mean()
).
%>%
df group_by(Country) %>%
mutate(Revenue = Quantity * Price, .keep = "used")
%>%
df mutate(Revenue = Quantity * Price, .keep = "used")
This is true also for summarise()
, but harder to see for
the new rows’ order dictated by the grouping column.
%>%
df group_by(Country) %>%
summarise(Revenue = Quantity * Price)
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.
%>%
df summarise(Revenue = Quantity * Price)
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
Furthermore, as the warning suggests, for these kinds of
transformations reframe()
is a more correct verb.
%>%
df group_by(Country) %>%
reframe(Revenue = Quantity * Price)
%>%
df reframe(Revenue = Quantity * Price)
The other difference between summarise()
and
reframe()
is that the latter always returns an ungrouped
data frame.
%>%
df group_by(Country, `Customer ID`) %>%
reframe(Revenue = Quantity * Price)
filter()
applies to each row a condition specific to the
group it belongs to, so for instance in the next example
mean(Price)
is independently calculated for each group and
the rows are preserved accordingly.
%>%
df group_by(Country) %>%
filter(Price > mean(Price))
We can see it more clearly in the following tables, where in the first one we show which rows evaluate to TRUE in the grouped and ungrouped case and in the second one the difference in the number of rows preserved.
For United Kingdom for example the difference relies on the rows with
a Price
higher than Grouped_Avg
but lower than
Ungrouped_Avg
(row 24 is the first one where this is
shown), that are preserved with a grouped data frame and removed with an
ungrouped one.
%>%
df group_by(Country) %>%
mutate(Grouped_Avg = mean(Price),
Grouped_Filter = Price > Grouped_Avg,
Ungrouped_Avg = mean(df$Price),
Ungrouped_Filter = Price > Ungrouped_Avg, .keep = "used") %>%
ungroup() %>%
relocate(Country, Price)
%>%
df group_by(Country) %>%
mutate(Grouped_Avg = mean(Price),
Grouped_Filter = Price > Grouped_Avg,
Ungrouped_Avg = mean(df$Price),
Ungrouped_Filter = Price > Ungrouped_Avg, .keep = "used") %>%
ungroup() %>%
::pivot_longer(c( "Grouped_Filter", "Ungrouped_Filter"), names_to = "Type_of_Filter", values_to = "Row") %>%
tidyrmutate(Row = if_else(Row == TRUE, "Rows_Preserved", "Rows_Removed")) %>%
count(Type_of_Filter, Row) %>%
::pivot_wider(names_from = "Row", values_from = "n") tidyr
A grouped data frame doesn’t change the output when the filtering value (5 here, a constant) doesn’t depend on the groups.
%>%
df group_by(Country) %>%
filter(Price > 5)
%>%
df filter(Price > 5)
slice()
and its helpers preserve as many singular or
multiples of rows (depending on the selection) as there are groups. Akin
to summarise()
, also slice()
orders the output
by the grouping columns.
%>%
df group_by(Country) %>%
slice(1)
%>%
df group_by(Country) %>%
slice(1:2)
%>%
df group_by(Country) %>%
slice_head(n = 2)
%>%
df group_by(Country) %>%
slice_sample(n = 2)
%>%
df group_by(Country) %>%
slice_max(Quantity, n = 2)
select()
always adds the grouping columns to the
selection.
%>%
df group_by(Country) %>%
select(Price)
Adding missing grouping variables: `Country`
arrange()
ignores the grouping.
%>%
df group_by(Country) %>%
arrange(desc(Price))
If there are NA values in the grouping columns, they are treated as an unique grouping value.
%>%
df mutate(Country = na_if(Country, "Unspecified")) %>%
group_by(Country) %>%
summarise(n = n()) %>%
arrange(!is.na(Country))
With more than one column every unique existing combination of their unique values forms a group, and we generally obtain a larger number of groups the more columns we add.
%>%
df group_by(StockCode) %>%
n_groups()
## [1] 4631
%>%
df group_by(StockCode, Description) %>%
n_groups()
## [1] 6963
But if each unique value of a column has only one combination with
the unique values of the other columns (like in the following example
where we removed with group_by(StockCode) %>% slice(1)
the multiple descriptions for every stock code) when we group by those
the number of groups stays constant.
%>%
df group_by(StockCode) %>%
slice(1) %>%
group_by(StockCode, Description) %>%
n_groups()
## [1] 4631
We can then use group_by()
to “carry forward”
columns.
%>%
df group_by(StockCode) %>%
slice(1) %>%
group_by(StockCode, Description) %>%
summarise(Avg_Price = mean(Price))
`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.
This can produce a table that is more informative than the one
without Description
.
%>%
df group_by(StockCode) %>%
slice(1) %>%
group_by(StockCode) %>%
summarise(Avg_Price = mean(Price))
Plus that column can be used in subsequent manipulations.
%>%
df group_by(StockCode) %>%
slice(1) %>%
group_by(StockCode, Description) %>%
summarise(Avg_Price = mean(Price)) %>%
filter(!is.na(Description))
`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.
Another way to achieve this could have been to use the
Description
column in the summarise()
call.
%>%
df group_by(StockCode) %>%
slice(1) %>%
group_by(StockCode) %>%
summarise(Description = Description,
Avg_Price = mean(Price))
But only if we preemptively removed the multiple descriptions for
every value of StockCode
, otherwise we wouldn’t be able to
obtain a mean(Price)
for every unique value of
StockCode
as the Description
column must be
kept as is.
%>%
df group_by(StockCode) %>%
summarise(Description = Description,
Avg_Price = mean(Price))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.
While using group_by()
, we must also made sure that the
grouping doesn’t affect the calculation we want to perform.
In this example, if we group by Invoice
as well, we
can’t have the correct number of unique invoices per customer as
n_distinct(
) can only count inside each group, and with
this grouping each group has just one value of Invoice
.
%>%
df group_by(`Customer ID`, Invoice) %>%
summarise(N_Invoices_per_Customer = n_distinct(Invoice))
`summarise()` has grouped output by 'Customer ID'. You can override using the
`.groups` argument.
So we need to remove Invoice
to produce the desired
result.
%>%
df group_by(`Customer ID`) %>%
summarise(N_Invoices_per_Customer = n_distinct(Invoice))
We must always be aware then of what kind of grouping our data frame
has and in general it is good practice to use ungroup()
to
remove the grouping after the manipulations it was needed for, to have
the intended results in subsequent ones.
%>%
df group_by(Country) %>%
mutate(Avg_Price = mean(Price)) %>%
select(Avg_Price)
Adding missing grouping variables: `Country`
%>%
df group_by(Country) %>%
mutate(Avg_Price = mean(Price)) %>%
ungroup() %>%
select(Avg_Price)
ungroup()
can remove one grouping at a time as well.
%>%
df group_by(Country, `Customer ID`) %>%
mutate(Avg_Price = mean(Price)) %>%
ungroup(Country) %>%
select(Avg_Price)
Adding missing grouping variables: `Customer ID`
As group_by()
is a data-masking function, we can
conveniently group by expressions, with the
new_name = old_name
syntax if we only change the names.
%>%
df group_by(CustomerID = `Customer ID`)
And with a mutate()
syntax if we do so by
computations.
%>%
df group_by(Price_Rank = dense_rank(Price))
This is equivalent to the following, where the mutate()
command is made explicit and not “hidden” inside group_by()
so to speak.
%>%
df mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank)
With the .add
argument we can control whether or not to
add new groups to the existing ones. The default is FALSE, so normally
the latest group_by()
overrides existing groups.
%>%
df group_by(Country) %>%
group_by(`Customer ID`) %>%
group_vars()
## [1] "Customer ID"
With .add = TRUE
we append new groups instead.
%>%
df group_by(Country) %>%
group_by(`Customer ID`, .add = TRUE) %>%
group_vars()
## [1] "Country" "Customer ID"
And it will be the same as using two columns to begin with.
%>%
df group_by(Country, `Customer ID`) %>%
group_vars()
## [1] "Country" "Customer ID"
But with the benefit of being able to perform manipulations between
the two group_by()
calls.
%>%
df group_by(Country) %>%
mutate(Country_Avg_Price = mean(Price)) %>%
group_by(`Customer ID`, .add = TRUE) %>%
mutate(Country_Customer_Avg_Price = mean(Price)) %>%
select(Country_Avg_Price, Country_Customer_Avg_Price) %>%
filter(row_number() == 1)
Adding missing grouping variables: `Country`, `Customer ID`
The default .add = FALSE
behavior, overriding the
existing groups, still stands also when we group by expressions, so the
implicit, “hidden”, mutate()
command is performed on the
ungrouped data frame.
%>%
df group_by(Country) %>%
group_by(Price_Rank = dense_rank(Price))
Therefore the fact that we had beforehand grouped by
Country
is inconsequential here: the data frame
Price_Rank
is computed on is not grouped by
Country
(I will remove group_by(Country)
in
the following examples then, also to better distinguish the two cases
presented).
So in case we want to group by a newly created column
(Price_Rank
here) an already grouped data frame, we have to
make the “hidden” mutate()
command explicit.
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank)
In this way we will group by Price_Rank
the rows inside
each Country
specific group, resulting in a different total
number of groups.
%>%
df group_by(Price_Rank = dense_rank(Price)) %>%
n_groups()
## [1] 1606
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank) %>%
n_groups()
## [1] 1494
That is because in the first case we group by the ranks computed on
the prices of all the data frame, regardless of which
Country
they pertain to, so for example the smallest rank
for Australia here is 30,
%>%
df group_by(Price_Rank = dense_rank(Price)) %>%
ungroup() %>%
count(Country, Price, Price_Rank)
as there are lower prices in other countries.
%>%
df group_by(Price_Rank = dense_rank(Price)) %>%
ungroup() %>%
count(Country, Price, Price_Rank) %>%
arrange(Price_Rank)
In the second case we group by the ranks of the prices present inside
each Country
, so each Country
ranking starts
from rank 1.
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank) %>%
ungroup() %>%
count(Country, Price, Price_Rank)
This results in a smaller number of groups, as the same
Price_Rank
is shared by different prices on different
countries,
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank) %>%
count(Price_Rank, Price, Country)
Whether in the first case every unique value of the
Price
column has a dedicated rank.
%>%
df group_by(Price_Rank = dense_rank(Price)) %>%
count(Price_Rank, Price, Country)
We can then use the structure from the second case to calculate, for example, the total number of items every rank contains.
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank) %>%
summarise(Total_Quantity_per_Rank = sum(Quantity)) %>%
arrange(Price_Rank)
This calculation would have returned a different result if we didn’t
“nest” the groups because, as we’ve seen before, for example Australia
doesn’t have a value in the Price_Rank
column that is lower
than 30.
%>%
df group_by(Price_Rank = dense_rank(Price)) %>%
summarise(Total_Quantity_per_Rank = sum(Quantity)) %>%
arrange(Price_Rank)
If we add the Country
grouping again we can also show
the top 3 countries by number of items per rank.
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank, Country) %>%
summarise(Total_Quantity_per_Rank = sum(Quantity)) %>%
arrange(Price_Rank, desc(Total_Quantity_per_Rank)) %>%
slice(1:3)
`summarise()` has grouped output by 'Price_Rank'. You can override using the
`.groups` argument.
Using .add = TRUE
outputs a different number of groups
which is the same for both cases.
%>%
df group_by(Country) %>%
group_by(Price_Rank = dense_rank(Price), .add = TRUE) %>%
n_groups()
## [1] 3059
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank, .add = TRUE) %>%
n_groups()
## [1] 3059
This happens because here we group by two columns at the same time, so, compared to before, the larger number of groups is determined by all the unique existing combinations between them.
Notice how it is the same as the number of rows for the tables
returned by count(Price_Rank, Price, Country)
in the
previous section: that is because count()
returns in fact
as many rows as all the unique existing combinations between the values
of the columns we feed into it.
The rankings are the same to the default .add = FALSE
as
in the first case we still rank the prices on an ungrouped data
frame,
%>%
df group_by(Country) %>%
group_by(Price_Rank = dense_rank(Price), .add = TRUE)
%>%
df group_by(Country) %>%
group_by(Price_Rank = dense_rank(Price), .add = TRUE) %>%
ungroup() %>%
count(Country, Price, Price_Rank)
while in the second case on a grouped one.
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank, .add = TRUE)
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank, .add = TRUE) %>%
ungroup() %>%
count(Country, Price, Price_Rank)
The first .add = TRUE
case, the implicit one, is
equivalent to using two grouping columns in the same
group_by()
call,
identical(df %>%
group_by(Country) %>%
group_by(Price_Rank = dense_rank(Price), .add = TRUE),
%>%
df group_by(Country, Price_Rank = dense_rank(Price)))
## [1] TRUE
also if we render explicit the mutate()
call.
identical(df %>%
group_by(Country) %>%
group_by(Price_Rank = dense_rank(Price), .add = TRUE),
%>%
df mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Country, Price_Rank))
## [1] TRUE
We can use .add = TRUE
to obtain the same results of an
example from the previous section, but here we need to add a
supplementary group_by(Price_Rank)
as the first one is
“peeled off” by the summarise()
call.
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank, .add = TRUE) %>%
summarise(Total_Quantity_per_Rank = sum(Quantity)) %>%
arrange(Price_Rank, desc(Total_Quantity_per_Rank)) %>%
group_by(Price_Rank) %>%
slice(1:3)
`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.
It can happen that we group by a factor that, after a manipulation, doesn’t have anymore values in a level, like Lebanon here that only has 13 rows.
%>%
df group_by(factor(Country)) %>%
filter(row_number() == 15) %>%
n_groups()
## [1] 39
By default the Lebanon group gets removed, as it is now empty, so the number of groups has decreased.
If we wish to keep the empty groups for further manipulations, we can
do so by changing the value of the .drop
argument to FALSE
from its default TRUE.
%>%
df group_by(factor(Country), .drop = FALSE) %>%
filter(row_number() == 15) %>%
n_groups()
## [1] 40
We need to pay attention when we group by more than one factor
though, as .drop = FALSE
will accordingly return every
combination between their levels, even the non existing ones, as we see
here where it is like every customer is located in every country.
%>%
df group_by(factor(Country), factor(`Customer ID`), .drop = FALSE) %>%
summarise(Total_Quantity = sum(Quantity))
`summarise()` has grouped output by 'factor(Country)'. You can override using
the `.groups` argument.
The default (.drop = TRUE
) doesn’t return the non
existing combinations.
%>%
df group_by(factor(Country), factor(`Customer ID`)) %>%
summarise(Total_Quantity = sum(Quantity))
`summarise()` has grouped output by 'factor(Country)'. You can override using
the `.groups` argument.
If we are not using factors the .drop
argument doesn’t
change the output, as it is factor specific.
%>%
df group_by(Country, `Customer ID`, .drop = FALSE) %>%
summarise(Total_Quantity = sum(Quantity))
`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.
%>%
df group_by(Country, `Customer ID`) %>%
summarise(Total_Quantity = sum(Quantity))
`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.
If we mix factor and another type the output depends on the order, because if we use factors as the second grouping column we get empty combinations, so attention is required.
%>%
df group_by(factor(Country), `Customer ID`, .drop = FALSE) %>%
summarise(Total_Quantity = sum(Quantity))
`summarise()` has grouped output by 'factor(Country)'. You can override using
the `.groups` argument.
%>%
df group_by(Country, factor(`Customer ID`), .drop = FALSE) %>%
summarise(Total_Quantity = sum(Quantity))
`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.
.drop
is similar to the .preserve
argument
of filter()
and slice()
, except that
.preserve
works with any kind of classes, not only with
factors like .drop
.
%>%
df group_by(Country) %>%
filter(row_number() == 15) %>%
n_groups()
## [1] 39
%>%
df group_by(Country) %>%
filter(row_number() == 15, .preserve = TRUE) %>%
n_groups()
## [1] 40
In case we apply group_by()
to a grouped data frame that
had .drop = FALSE
, the default changes to (keeps being)
FALSE, so we will continue to keep the empty levels going forward, even
if the second group_by()
is not applied to factors.
%>%
df group_by(factor(Country), .drop = FALSE) %>%
filter(row_number() > 15) %>%
group_by(`Customer ID`) %>%
filter(row_number() > 50) %>%
summarise(n = n())
And we need to specify .drop = TRUE
if we want to return
to the default behavior.
%>%
df group_by(factor(Country), .drop = FALSE) %>%
filter(row_number() > 15) %>%
group_by(`Customer ID`, .drop = TRUE) %>%
filter(row_number() > 50) %>%
summarise(n = n())
In case of doubt the state of .drop
can be checked with
group_by_drop_default()
.
%>%
df group_by(factor(Country), .drop = FALSE) %>%
filter(row_number() > 15) %>%
group_by(`Customer ID`) %>%
group_by_drop_default()
## [1] FALSE
tidy-select
We can perform operations on grouped data frames also by using the
.by
argument, available for the following verbs.
mutate()
summarise()
reframe()
filter()
slice()
slice()
helpers use by
instead.
slice_head()
slice_tail()
slice_min()
slice_max()
slice_sample()
So if we want to translate the following lines of code,
%>%
df group_by(Country) %>%
mutate(Avg_Price = mean(Price))
we can write
%>%
df mutate(Avg_Price = mean(Price), .by = Country)
The first difference between the two is that .by
doesn’t
return a grouped data frame so with it we can avoid subsequent
ungroup()
calls.
Then if we group by two columns we must use a tidy-select syntax to specify them.
%>%
df group_by(Country, `Customer ID`) %>%
mutate(Avg_Price = mean(Price))
%>%
df mutate(Avg_Price = mean(Price), .by = c(Country, `Customer ID` ))
The tidy-select syntax permits different kinds of selection without
the need of using across()
.
%>%
df mutate(Avg_Price = mean(Price), .by = starts_with("C"))
%>%
df group_by(across(starts_with("C"))) %>%
mutate(Avg_Price = mean(Price))
.by
has no problems with multiple manipulations in the
same mutate()
call.
%>%
df group_by(Country) %>%
mutate(Grouped_Avg = mean(Price),
Grouped_Filter = Price > Grouped_Avg,
Ungrouped_Avg = mean(df$Price),
Ungrouped_Filter = Price > Ungrouped_Avg, .keep = "used")
%>%
df mutate(Grouped_Avg = mean(Price),
Grouped_Filter = Price > Grouped_Avg,
Ungrouped_Avg = mean(df$Price),
Ungrouped_Filter = Price > Ungrouped_Avg, .keep = "used", .by = Country)
And can shorten some examples we’ve previously discussed (notice as
well how in the first reworked example the select()
call is
different, as .by
ungroups by default).
%>%
df group_by(Country) %>%
mutate(Country_Avg_Price = mean(Price)) %>%
group_by(`Customer ID`, .add = TRUE) %>%
mutate(Country_Customer_Avg_Price = mean(Price)) %>%
select(Country_Avg_Price, Country_Customer_Avg_Price)
Adding missing grouping variables: `Country`, `Customer ID`
%>%
df mutate(Country_Avg_Price = mean(Price), .by = Country) %>%
mutate(Country_Customer_Avg_Price = mean(Price), .by = c(Country, `Customer ID` )) %>%
select(Country, `Customer ID`, Country_Avg_Price, Country_Customer_Avg_Price)
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
group_by(Price_Rank) %>%
summarise(Total_Quantity_per_Rank = sum(Quantity)) %>%
arrange(Price_Rank, desc(Total_Quantity_per_Rank))
%>%
df mutate(Price_Rank = dense_rank(Price), .by = Country) %>%
summarise(Total_Quantity_per_Rank = sum(Quantity), .by = Price_Rank) %>%
arrange(Price_Rank, desc(Total_Quantity_per_Rank))
But in case we need the same grouping for many subsequent
manipulations we might prefer group_by()
, with the goal to
avoid repetitions.
%>%
df group_by(Country) %>%
mutate(Price_Rank = dense_rank(Price)) %>%
filter(Price_Rank == 2) %>%
slice(1)
%>%
df mutate(Price_Rank = dense_rank(Price), .by = Country) %>%
filter(Price_Rank == 2, .by = Country) %>%
slice(1, .by = Country)
Or if we are taking advantage of automatically removing the last
grouping with a summarise()
call.
%>%
df group_by(`Customer ID`, Invoice) %>%
summarise(N_Rows_per_Invoice = n()) %>%
mutate(N_Invoice_Rows_per_Customer = sum(N_Rows_per_Invoice))
`summarise()` has grouped output by 'Customer ID'. You can override using the
`.groups` argument.
%>%
df summarise(N_Rows_per_Invoice = n(), .by = c(`Customer ID` , Invoice)) %>%
mutate(N_Invoice_Rows_per_Customer = sum(N_Rows_per_Invoice), .by = `Customer ID`)
Another thing evident from this example is that .by
maintains the original row order when used with summarise()
while with group_by()
that verb rearranges by the grouping
variables.
This behavior is shared with slice()
%>%
df group_by(Country) %>%
slice(1)
%>%
df slice(1, .by = Country)
and with reframe()
as well.
%>%
df group_by(StockCode) %>%
reframe(Price_Quantile_Value = quantile(Price, c(0.25, 0.75)), prob = c(0.25, 0.75))
%>%
df reframe(Price_Quantile_Value = quantile(Price, c(0.25, 0.75)), prob = c(0.25, 0.75), .by = StockCode)
Additionally, .by
can’t replicate the
.drop = FALSE
use case but only the default
drop = TRUE
.
%>%
df group_by(factor(Country), factor(`Customer ID`), .drop = FALSE) %>%
summarise(Total_Quantity = sum(Quantity))
`summarise()` has grouped output by 'factor(Country)'. You can override using
the `.groups` argument.
%>%
df mutate(Country = factor(Country),
`Customer ID` = factor(`Customer ID`)) %>%
summarise(Total_Quantity = sum(Quantity), .by = c(Country, `Customer ID` ))
In this last example we can also reinforce how .by
uses
a tidy-select syntax, so we had to use a mutate()
call
beforehand to have factor columns.
If we try to use .by
on a grouped data frame, we receive
an error.
%>%
df group_by(Country) %>%
mutate(Avg_Price = mean(Price), .by = `Customer ID`)
Error in `mutate()`:
! Can't supply `.by` when `.data` is a grouped data frame.