data-masking

- fundamentals

group_by() groups together the rows of a data frame by the unique values of the columns specified in its argument, forming as many groups as the aforementioned unique values (when we group by only one column),

df %>%
  group_by(Country)

A tibble: 525461 x 8

Groups: Country [40]

and as many as the unique existing combinations between their unique values (when we group by several ones).

df %>%
  group_by(Country, `Customer ID`)

A tibble: 525461 x 8

Groups: Country, Customer ID [4401]

Besides characters’ ones, group_by() can be used on every class of columns: numeric

df %>%
  group_by(Price)

A tibble: 525461 x 8

Groups: Price [1606]

and datetime alike.

df %>%
  group_by(InvoiceDate)

A tibble: 525461 x 8

Groups: InvoiceDate [25296]

It modifies the class of a data frame to grouped_df, but no other visible changes occur.

class(df %>%
        group_by(Country))

## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

When we apply a manipulation to a grouped data frame though, the manipulation is (in most of the cases) performed group-wise, meaning that only the values of the rows pertaining to a specific group are used for that group’s output. Therefore (still, generally speaking) are returned as many results as there are groups.

- with summarise(), mutate() & reframe()

summarise() might be the verb that benefits the most from a grouped data frame, as with it we can easily construct summary tables.

df %>%
  summarise(Avg_Price = mean(Price))

A tibble: 1 x 1

df %>%
  group_by(Country) %>%
  summarise(Avg_Price = mean(Price))

A tibble: 40 x 2

df %>%
  group_by(Country, `Customer ID`) %>%
  summarise(Avg_Price = mean(Price))

`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.

A tibble: 4401 x 3

Groups: Country [40]

Notice how the output is ordered by the grouping columns’ (the columns we group by) values.

- summarise() unique “peeling off” property

As can be noticed from the previous example, when we have one result per group, like when using aggregate functions, summarise(), after the calculation is performed, ungroups the data frame when there is only one grouping column

df %>%
  group_by(Country) %>%
  summarise(Avg_Price = mean(Price)) %>%
  group_vars()

## character(0)

or removes the last one added, when there are several, informing us with a message about the resulting grouping of the output.

df %>%
  group_by(Country, `Customer ID`) %>%
  summarise(Avg_Price = mean(Price)) %>%
  group_vars()

## `summarise()` has grouped output by 'Country'. You can override using the
## `.groups` argument.

## [1] "Country"

mutate() preserves the original rows’ order maintaining as well their original number, adding the newly calculated values to each one of them.

df %>%
  group_by(Country) %>%
  mutate(Avg_Price = mean(Price))

A tibble: 525461 x 9

Groups: Country [40]

There are instances though where a grouped data frame doesn’t change the results of a mutate() call, for example if we applied a vectorized operation (like a product) instead of an aggregate function (like we did previously with mean()).

df %>%
  group_by(Country) %>%
  mutate(Revenue = Quantity * Price, .keep = "used")

A tibble: 525461 x 4

Groups: Country [40]

df %>%
  mutate(Revenue = Quantity * Price, .keep = "used")

A tibble: 525461 x 3

This is true also for summarise(), but harder to see for the new rows’ order dictated by the grouping column.

df %>%
  group_by(Country) %>%
  summarise(Revenue = Quantity * Price)

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.

A tibble: 525461 x 2

Groups: Country [40]

df %>%
  summarise(Revenue = Quantity * Price)

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

A tibble: 525461 x 1

Furthermore, as the warning suggests, for these kinds of transformations reframe() is a more correct verb.

df %>%
  group_by(Country) %>%
  reframe(Revenue = Quantity * Price)

A tibble: 525461 x 2

df %>%
  reframe(Revenue = Quantity * Price)

A tibble: 525461 x 1

The other difference between summarise() and reframe() is that the latter always returns an ungrouped data frame.

df %>%
  group_by(Country, `Customer ID`) %>%
  reframe(Revenue = Quantity * Price)

A tibble: 525461 x 3

- with filter()

filter() applies to each row a condition specific to the group it belongs to, so for instance in the next example mean(Price) is independently calculated for each group and the rows are preserved accordingly.

df %>%
  group_by(Country) %>%
  filter(Price > mean(Price))

A tibble: 117497 x 8

Groups: Country [40]

We can see it more clearly in the following tables, where in the first one we show which rows evaluate to TRUE in the grouped and ungrouped case and in the second one the difference in the number of rows preserved.

For United Kingdom for example the difference relies on the rows with a Price higher than Grouped_Avg but lower than Ungrouped_Avg (row 24 is the first one where this is shown), that are preserved with a grouped data frame and removed with an ungrouped one.

df %>%
  group_by(Country) %>%
  mutate(Grouped_Avg = mean(Price),
         Grouped_Filter = Price > Grouped_Avg,
         Ungrouped_Avg = mean(df$Price),
         Ungrouped_Filter = Price > Ungrouped_Avg, .keep = "used") %>%
  ungroup() %>%
  relocate(Country, Price)

A tibble: 525461 x 6

df %>%
  group_by(Country) %>%
  mutate(Grouped_Avg = mean(Price),
         Grouped_Filter = Price > Grouped_Avg,
         Ungrouped_Avg = mean(df$Price),
         Ungrouped_Filter = Price > Ungrouped_Avg, .keep = "used") %>%
  ungroup() %>%  
  tidyr::pivot_longer(c( "Grouped_Filter", "Ungrouped_Filter"), names_to = "Type_of_Filter", values_to = "Row") %>%
  mutate(Row = if_else(Row == TRUE, "Rows_Preserved", "Rows_Removed")) %>%
  count(Type_of_Filter, Row) %>%
  tidyr::pivot_wider(names_from = "Row", values_from = "n")

A tibble: 2 x 3

A grouped data frame doesn’t change the output when the filtering value (5 here, a constant) doesn’t depend on the groups.

df %>%
  group_by(Country) %>%
  filter(Price > 5)

A tibble: 98860 x 8

Groups: Country [40]

df %>%
  filter(Price > 5)

A tibble: 98860 x 8

- with slice() & its helpers

slice() and its helpers preserve as many singular or multiples of rows (depending on the selection) as there are groups. Akin to summarise(), also slice() orders the output by the grouping columns.

df %>%
  group_by(Country) %>%
  slice(1)

A tibble: 40 x 8

Groups: Country [40]

df %>%
  group_by(Country) %>%
  slice(1:2)

A tibble: 80 x 8

Groups: Country [40]

df %>%
  group_by(Country) %>%
  slice_head(n = 2)

A tibble: 80 x 8

Groups: Country [40]

df %>%
  group_by(Country) %>%
  slice_sample(n = 2)

A tibble: 80 x 8

Groups: Country [40]

df %>%
  group_by(Country) %>%
  slice_max(Quantity, n = 2)

A tibble: 132 x 8

Groups: Country [40]

- with select()

select() always adds the grouping columns to the selection.

df %>%
  group_by(Country) %>%
  select(Price)

Adding missing grouping variables: `Country`

A tibble: 525461 x 2

Groups: Country [40]

- with arrange()

arrange() ignores the grouping.

df %>%
  group_by(Country) %>%
  arrange(desc(Price))

A tibble: 525461 x 8

Groups: Country [40]

- with NAs

If there are NA values in the grouping columns, they are treated as an unique grouping value.

df %>%
  mutate(Country = na_if(Country, "Unspecified")) %>%
  group_by(Country) %>%
  summarise(n = n()) %>%
  arrange(!is.na(Country))

A tibble: 40 x 2

- single combinations between grouping columns

With more than one column every unique existing combination of their unique values forms a group, and we generally obtain a larger number of groups the more columns we add.

df %>%
  group_by(StockCode) %>%
  n_groups()

## [1] 4631

df %>%
  group_by(StockCode, Description) %>%
  n_groups()

## [1] 6963

But if each unique value of a column has only one combination with the unique values of the other columns (like in the following example where we removed with group_by(StockCode) %>% slice(1) the multiple descriptions for every stock code) when we group by those the number of groups stays constant.

df %>%
  group_by(StockCode) %>%
  slice(1) %>%
  group_by(StockCode, Description) %>%
  n_groups()

## [1] 4631

We can then use group_by() to “carry forward” columns.

df %>%
  group_by(StockCode) %>%
  slice(1) %>%
  group_by(StockCode, Description) %>%
  summarise(Avg_Price = mean(Price))

`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.

A tibble: 4631 x 3

Groups: StockCode [4631]

This can produce a table that is more informative than the one without Description.

df %>%
  group_by(StockCode) %>%
  slice(1) %>%
  group_by(StockCode) %>%
  summarise(Avg_Price = mean(Price))

A tibble: 4631 x 2

Plus that column can be used in subsequent manipulations.

df %>%
  group_by(StockCode) %>%
  slice(1) %>%
  group_by(StockCode, Description) %>%
  summarise(Avg_Price = mean(Price)) %>%
  filter(!is.na(Description))

`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.

A tibble: 4160 x 3

Groups: StockCode [4160]

Another way to achieve this could have been to use the Description column in the summarise() call.

df %>%
  group_by(StockCode) %>%
  slice(1) %>%
  group_by(StockCode) %>%
  summarise(Description = Description,
            Avg_Price = mean(Price))

A tibble: 4631 x 3

But only if we preemptively removed the multiple descriptions for every value of StockCode, otherwise we wouldn’t be able to obtain a mean(Price) for every unique value of StockCode as the Description column must be kept as is.

df %>%
  group_by(StockCode) %>%
  summarise(Description = Description,
            Avg_Price = mean(Price))

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.

A tibble: 525461 x 3

Groups: StockCode [4631]

- grouped data frames setbacks

While using group_by(), we must also made sure that the grouping doesn’t affect the calculation we want to perform.

In this example, if we group by Invoice as well, we can’t have the correct number of unique invoices per customer as n_distinct() can only count inside each group, and with this grouping each group has just one value of Invoice.

df %>%
  group_by(`Customer ID`, Invoice) %>%
  summarise(N_Invoices_per_Customer = n_distinct(Invoice))

`summarise()` has grouped output by 'Customer ID'. You can override using the
`.groups` argument.

A tibble: 28816 x 3

Groups: Customer ID [4384]

So we need to remove Invoice to produce the desired result.

df %>%
  group_by(`Customer ID`) %>%
  summarise(N_Invoices_per_Customer = n_distinct(Invoice))

A tibble: 4384 x 2

- ungroup()

We must always be aware then of what kind of grouping our data frame has and in general it is good practice to use ungroup() to remove the grouping after the manipulations it was needed for, to have the intended results in subsequent ones.

df %>%
  group_by(Country) %>%
  mutate(Avg_Price = mean(Price)) %>%
  select(Avg_Price)

Adding missing grouping variables: `Country`

A tibble: 525461 x 2

Groups: Country [40]

df %>%
  group_by(Country) %>%
  mutate(Avg_Price = mean(Price)) %>%
  ungroup() %>%
  select(Avg_Price)

A tibble: 525461 x 1

ungroup() can remove one grouping at a time as well.

df %>%
  group_by(Country, `Customer ID`) %>%
  mutate(Avg_Price = mean(Price)) %>%
  ungroup(Country) %>%
  select(Avg_Price)

Adding missing grouping variables: `Customer ID`

A tibble: 525461 x 2

Groups: Customer ID [4384]

As group_by() is a data-masking function, we can conveniently group by expressions, with the new_name = old_name syntax if we only change the names.

df %>%
  group_by(CustomerID = `Customer ID`)

A tibble: 525461 x 9

Groups: CustomerID [4384]

And with a mutate() syntax if we do so by computations.

df %>%
  group_by(Price_Rank = dense_rank(Price))

A tibble: 525461 x 9

Groups: Price_Rank [1606]

This is equivalent to the following, where the mutate() command is made explicit and not “hidden” inside group_by() so to speak.

df %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank)

A tibble: 525461 x 9

Groups: Price_Rank [1606]

- .add

With the .add argument we can control whether or not to add new groups to the existing ones. The default is FALSE, so normally the latest group_by() overrides existing groups.

df %>%
  group_by(Country) %>%
  group_by(`Customer ID`) %>%
  group_vars()

## [1] "Customer ID"

With .add = TRUE we append new groups instead.

df %>%
  group_by(Country) %>%
  group_by(`Customer ID`, .add = TRUE) %>%
  group_vars()

## [1] "Country"     "Customer ID"

And it will be the same as using two columns to begin with.

df %>%
  group_by(Country, `Customer ID`) %>%
  group_vars()

## [1] "Country"     "Customer ID"

But with the benefit of being able to perform manipulations between the two group_by() calls.

df %>%
  group_by(Country) %>%
  mutate(Country_Avg_Price = mean(Price)) %>%
  group_by(`Customer ID`, .add = TRUE) %>%
  mutate(Country_Customer_Avg_Price = mean(Price)) %>%
  select(Country_Avg_Price, Country_Customer_Avg_Price) %>%
  filter(row_number() == 1)

Adding missing grouping variables: `Country`, `Customer ID`

A tibble: 4401 x 4

Groups: Country, Customer ID [4401]

- implicit & explicit mutate() calls

The default .add = FALSE behavior, overriding the existing groups, still stands also when we group by expressions, so the implicit, “hidden”, mutate() command is performed on the ungrouped data frame.

df %>%
  group_by(Country) %>%
  group_by(Price_Rank = dense_rank(Price))

A tibble: 525461 x 9

Groups: Price_Rank [1606]

Therefore the fact that we had beforehand grouped by Country is inconsequential here: the data frame Price_Rank is computed on is not grouped by Country (I will remove group_by(Country) in the following examples then, also to better distinguish the two cases presented).

So in case we want to group by a newly created column (Price_Rank here) an already grouped data frame, we have to make the “hidden” mutate() command explicit.

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank)

A tibble: 525461 x 9

Groups: Price_Rank [1494]

In this way we will group by Price_Rank the rows inside each Country specific group, resulting in a different total number of groups.

df %>%
  group_by(Price_Rank = dense_rank(Price)) %>%
  n_groups()

## [1] 1606

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank) %>%
  n_groups()

## [1] 1494

That is because in the first case we group by the ranks computed on the prices of all the data frame, regardless of which Country they pertain to, so for example the smallest rank for Australia here is 30,

df %>%
  group_by(Price_Rank = dense_rank(Price)) %>%
  ungroup() %>%
  count(Country, Price, Price_Rank)

A tibble: 3059 x 4

as there are lower prices in other countries.

df %>%
  group_by(Price_Rank = dense_rank(Price)) %>%
  ungroup() %>%
  count(Country, Price, Price_Rank) %>%
  arrange(Price_Rank)

A tibble: 3059 x 4

In the second case we group by the ranks of the prices present inside each Country, so each Country ranking starts from rank 1.

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank) %>%
  ungroup() %>%
  count(Country, Price, Price_Rank)

A tibble: 3059 x 4

This results in a smaller number of groups, as the same Price_Rank is shared by different prices on different countries,

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank) %>%
  count(Price_Rank, Price, Country)

A tibble: 3059 x 4

Groups: Price_Rank [1494]

Whether in the first case every unique value of the Price column has a dedicated rank.

df %>%
  group_by(Price_Rank = dense_rank(Price)) %>%
  count(Price_Rank, Price, Country)

A tibble: 3059 x 4

Groups: Price_Rank [1606]

We can then use the structure from the second case to calculate, for example, the total number of items every rank contains.

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank) %>%
  summarise(Total_Quantity_per_Rank = sum(Quantity)) %>%
  arrange(Price_Rank)

A tibble: 1494 x 2

This calculation would have returned a different result if we didn’t “nest” the groups because, as we’ve seen before, for example Australia doesn’t have a value in the Price_Rank column that is lower than 30.

df %>%
  group_by(Price_Rank = dense_rank(Price)) %>%
  summarise(Total_Quantity_per_Rank = sum(Quantity)) %>%
  arrange(Price_Rank)

A tibble: 1606 x 2

If we add the Country grouping again we can also show the top 3 countries by number of items per rank.

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank, Country) %>%
  summarise(Total_Quantity_per_Rank = sum(Quantity)) %>%
  arrange(Price_Rank, desc(Total_Quantity_per_Rank)) %>%
  slice(1:3)

`summarise()` has grouped output by 'Price_Rank'. You can override using the
`.groups` argument.

A tibble: 1724 x 3

Groups: Price_Rank [1494]

- .add = TRUE

Using .add = TRUE outputs a different number of groups which is the same for both cases.

df %>%
  group_by(Country) %>%
  group_by(Price_Rank = dense_rank(Price), .add = TRUE) %>%
  n_groups()

## [1] 3059

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank, .add = TRUE) %>%
  n_groups()

## [1] 3059

This happens because here we group by two columns at the same time, so, compared to before, the larger number of groups is determined by all the unique existing combinations between them.

Notice how it is the same as the number of rows for the tables returned by count(Price_Rank, Price, Country) in the previous section: that is because count() returns in fact as many rows as all the unique existing combinations between the values of the columns we feed into it.

The rankings are the same to the default .add = FALSE as in the first case we still rank the prices on an ungrouped data frame,

df %>%
  group_by(Country) %>%
  group_by(Price_Rank = dense_rank(Price), .add = TRUE)

A tibble: 525461 x 9

Groups: Country, Price_Rank [3059]

df %>%
  group_by(Country) %>%
  group_by(Price_Rank = dense_rank(Price), .add = TRUE) %>%
  ungroup() %>%
  count(Country, Price, Price_Rank)

A tibble: 3059 x 4

while in the second case on a grouped one.

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank, .add = TRUE)

A tibble: 525461 x 9

Groups: Country, Price_Rank [3059]

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank, .add = TRUE) %>%
  ungroup() %>%
  count(Country, Price, Price_Rank)

A tibble: 3059 x 4

The first .add = TRUE case, the implicit one, is equivalent to using two grouping columns in the same group_by() call,

identical(df %>%
            group_by(Country) %>%
            group_by(Price_Rank = dense_rank(Price), .add = TRUE),
          df %>%
            group_by(Country, Price_Rank = dense_rank(Price)))

## [1] TRUE

also if we render explicit the mutate() call.

identical(df %>%
            group_by(Country) %>%
            group_by(Price_Rank = dense_rank(Price), .add = TRUE),
          df %>%
            mutate(Price_Rank = dense_rank(Price)) %>%
            group_by(Country, Price_Rank))

## [1] TRUE

We can use .add = TRUE to obtain the same results of an example from the previous section, but here we need to add a supplementary group_by(Price_Rank) as the first one is “peeled off” by the summarise() call.

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank, .add = TRUE) %>%
  summarise(Total_Quantity_per_Rank = sum(Quantity)) %>%
  arrange(Price_Rank, desc(Total_Quantity_per_Rank)) %>%
  group_by(Price_Rank) %>%
  slice(1:3)

`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.

A tibble: 1724 x 3

Groups: Price_Rank [1494]

- .drop

It can happen that we group by a factor that, after a manipulation, doesn’t have anymore values in a level, like Lebanon here that only has 13 rows.

df %>%
  group_by(factor(Country)) %>%
  filter(row_number() == 15) %>%
  n_groups()

## [1] 39

By default the Lebanon group gets removed, as it is now empty, so the number of groups has decreased.

If we wish to keep the empty groups for further manipulations, we can do so by changing the value of the .drop argument to FALSE from its default TRUE.

df %>%
  group_by(factor(Country), .drop = FALSE) %>%
  filter(row_number() == 15) %>%
  n_groups()

## [1] 40

We need to pay attention when we group by more than one factor though, as .drop = FALSE will accordingly return every combination between their levels, even the non existing ones, as we see here where it is like every customer is located in every country.

df %>%
  group_by(factor(Country), factor(`Customer ID`), .drop = FALSE) %>%
  summarise(Total_Quantity = sum(Quantity))

`summarise()` has grouped output by 'factor(Country)'. You can override using
the `.groups` argument.

A tibble: 175333 x 3

Groups: factor(Country) [40]

The default (.drop = TRUE) doesn’t return the non existing combinations.

df %>%
  group_by(factor(Country), factor(`Customer ID`)) %>%
  summarise(Total_Quantity = sum(Quantity))

`summarise()` has grouped output by 'factor(Country)'. You can override using
the `.groups` argument.

A tibble: 4401 x 3

Groups: factor(Country) [40]

If we are not using factors the .drop argument doesn’t change the output, as it is factor specific.

df %>%
  group_by(Country, `Customer ID`, .drop = FALSE) %>%
  summarise(Total_Quantity = sum(Quantity))

`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.

A tibble: 4401 x 3

Groups: Country [40]

df %>%
  group_by(Country, `Customer ID`) %>%
  summarise(Total_Quantity = sum(Quantity))

`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.

A tibble: 4401 x 3

Groups: Country [40]

If we mix factor and another type the output depends on the order, because if we use factors as the second grouping column we get empty combinations, so attention is required.

df %>%
  group_by(factor(Country), `Customer ID`, .drop = FALSE) %>%
  summarise(Total_Quantity = sum(Quantity))

`summarise()` has grouped output by 'factor(Country)'. You can override using
the `.groups` argument.

A tibble: 4401 x 3

Groups: factor(Country) [40]

df %>%
  group_by(Country, factor(`Customer ID`), .drop = FALSE) %>%
  summarise(Total_Quantity = sum(Quantity))

`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.

A tibble: 175333 x 3

Groups: Country [40]

.drop is similar to the .preserve argument of filter() and slice(), except that .preserve works with any kind of classes, not only with factors like .drop.

df %>%
  group_by(Country) %>%
  filter(row_number() == 15) %>%
  n_groups()

## [1] 39

df %>%
  group_by(Country) %>%
  filter(row_number() == 15, .preserve = TRUE) %>%
  n_groups()

## [1] 40

In case we apply group_by() to a grouped data frame that had .drop = FALSE, the default changes to (keeps being) FALSE, so we will continue to keep the empty levels going forward, even if the second group_by() is not applied to factors.

df %>%
  group_by(factor(Country), .drop = FALSE) %>%
  filter(row_number() > 15) %>%
  group_by(`Customer ID`) %>%
  filter(row_number() > 50) %>%
  summarise(n = n())

A tibble: 4381 x 2

And we need to specify .drop = TRUE if we want to return to the default behavior.

df %>%
  group_by(factor(Country), .drop = FALSE) %>%
  filter(row_number() > 15) %>%
  group_by(`Customer ID`, .drop = TRUE) %>%
  filter(row_number() > 50) %>%
  summarise(n = n())

A tibble: 1995 x 2

- group_by_drop_default()

In case of doubt the state of .drop can be checked with group_by_drop_default().

df %>%
  group_by(factor(Country), .drop = FALSE) %>%
  filter(row_number() > 15) %>%
  group_by(`Customer ID`) %>%
  group_by_drop_default()

## [1] FALSE

- using .by or by instead

tidy-select

We can perform operations on grouped data frames also by using the .by argument, available for the following verbs.

mutate()
summarise()
reframe()
filter()
slice()

slice() helpers use by instead.

slice_head()
slice_tail()
slice_min()
slice_max()
slice_sample()

So if we want to translate the following lines of code,

df %>%
  group_by(Country) %>%
  mutate(Avg_Price = mean(Price))

A tibble: 525461 x 9

Groups: Country [40]

we can write

df %>%
  mutate(Avg_Price = mean(Price), .by = Country)

A tibble: 525461 x 9

The first difference between the two is that .by doesn’t return a grouped data frame so with it we can avoid subsequent ungroup() calls.

Then if we group by two columns we must use a tidy-select syntax to specify them.

df %>%
  group_by(Country, `Customer ID`) %>%
  mutate(Avg_Price = mean(Price))

A tibble: 525461 x 9

Groups: Country, Customer ID [4401]

df %>%
  mutate(Avg_Price = mean(Price), .by = c(Country, `Customer ID` ))

A tibble: 525461 x 9

The tidy-select syntax permits different kinds of selection without the need of using across().

df %>%
  mutate(Avg_Price = mean(Price), .by = starts_with("C"))

A tibble: 525461 x 9

df %>%
  group_by(across(starts_with("C"))) %>%
  mutate(Avg_Price = mean(Price))

A tibble: 525461 x 9

Groups: Customer ID, Country [4401]

.by has no problems with multiple manipulations in the same mutate() call.

df %>%
  group_by(Country) %>%
  mutate(Grouped_Avg = mean(Price),
         Grouped_Filter = Price > Grouped_Avg,
         Ungrouped_Avg = mean(df$Price),
         Ungrouped_Filter = Price > Ungrouped_Avg, .keep = "used")

A tibble: 525461 x 6

Groups: Country [40]

df %>%
  mutate(Grouped_Avg = mean(Price),
         Grouped_Filter = Price > Grouped_Avg,
         Ungrouped_Avg = mean(df$Price),
         Ungrouped_Filter = Price > Ungrouped_Avg, .keep = "used", .by = Country)

A tibble: 525461 x 6

And can shorten some examples we’ve previously discussed (notice as well how in the first reworked example the select() call is different, as .by ungroups by default).

df %>%
  group_by(Country) %>%
  mutate(Country_Avg_Price = mean(Price)) %>%
  group_by(`Customer ID`, .add = TRUE) %>%
  mutate(Country_Customer_Avg_Price = mean(Price)) %>%
  select(Country_Avg_Price, Country_Customer_Avg_Price)

Adding missing grouping variables: `Country`, `Customer ID`

A tibble: 525461 x 4

Groups: Country, Customer ID [4401]

df %>%
  mutate(Country_Avg_Price = mean(Price), .by = Country) %>%
  mutate(Country_Customer_Avg_Price = mean(Price), .by = c(Country, `Customer ID` )) %>%
  select(Country, `Customer ID`, Country_Avg_Price, Country_Customer_Avg_Price)

A tibble: 525461 x 4

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  group_by(Price_Rank) %>%
  summarise(Total_Quantity_per_Rank = sum(Quantity)) %>%
  arrange(Price_Rank, desc(Total_Quantity_per_Rank))

A tibble: 1494 x 2

df %>%
  mutate(Price_Rank = dense_rank(Price), .by = Country) %>%
  summarise(Total_Quantity_per_Rank = sum(Quantity), .by = Price_Rank) %>%
  arrange(Price_Rank, desc(Total_Quantity_per_Rank))

A tibble: 1494 x 2

But in case we need the same grouping for many subsequent manipulations we might prefer group_by(), with the goal to avoid repetitions.

df %>%
  group_by(Country) %>%
  mutate(Price_Rank = dense_rank(Price)) %>%
  filter(Price_Rank == 2) %>%
  slice(1)

A tibble: 40 x 9

Groups: Country [40]

df %>%
  mutate(Price_Rank = dense_rank(Price), .by = Country) %>%
  filter(Price_Rank == 2, .by = Country) %>%
  slice(1, .by = Country)

A tibble: 40 x 9

Or if we are taking advantage of automatically removing the last grouping with a summarise() call.

df %>%
  group_by(`Customer ID`, Invoice) %>%
  summarise(N_Rows_per_Invoice = n()) %>%
  mutate(N_Invoice_Rows_per_Customer = sum(N_Rows_per_Invoice))

`summarise()` has grouped output by 'Customer ID'. You can override using the
`.groups` argument.

A tibble: 28816 x 4

Groups: Customer ID [4384]

df %>%
  summarise(N_Rows_per_Invoice = n(), .by = c(`Customer ID` , Invoice)) %>%
  mutate(N_Invoice_Rows_per_Customer = sum(N_Rows_per_Invoice), .by = `Customer ID`)

A tibble: 28816 x 4

Another thing evident from this example is that .by maintains the original row order when used with summarise() while with group_by() that verb rearranges by the grouping variables.

This behavior is shared with slice()

df %>%
  group_by(Country) %>%
  slice(1)

A tibble: 40 x 8

Groups: Country [40]

df %>%
  slice(1, .by = Country)

A tibble: 40 x 8

and with reframe() as well.

df %>%
  group_by(StockCode) %>%
  reframe(Price_Quantile_Value = quantile(Price, c(0.25, 0.75)), prob = c(0.25, 0.75))

A tibble: 9262 x 3

df %>%
  reframe(Price_Quantile_Value = quantile(Price, c(0.25, 0.75)), prob = c(0.25, 0.75), .by = StockCode)

A tibble: 9262 x 3

Additionally, .by can’t replicate the .drop = FALSE use case but only the default drop = TRUE.

df %>%
  group_by(factor(Country), factor(`Customer ID`), .drop = FALSE) %>%
  summarise(Total_Quantity = sum(Quantity))

`summarise()` has grouped output by 'factor(Country)'. You can override using
the `.groups` argument.

A tibble: 175333 x 3

Groups: factor(Country) [40]

df %>%
  mutate(Country = factor(Country),
         `Customer ID` = factor(`Customer ID`)) %>%
  summarise(Total_Quantity = sum(Quantity), .by = c(Country, `Customer ID` ))

A tibble: 4401 x 3

In this last example we can also reinforce how .by uses a tidy-select syntax, so we had to use a mutate() call beforehand to have factor columns.

If we try to use .by on a grouped data frame, we receive an error.

df %>%
  group_by(Country) %>%
  mutate(Avg_Price = mean(Price), .by = `Customer ID`)

Error in `mutate()`:
! Can't supply `.by` when `.data` is a grouped data frame.