data-masking
summarise()
, when used with an aggregate function
(i.e. a function that computes multiple values into a single output),
compresses all the data frame into the single value returned by the
calculation.
%>%
df summarise(Avg_Price = mean(Price))
Notice the difference with mutate()
that instead adds a
column with the same value repeated for all the rows.
%>%
df mutate(Avg_Price = mean(Price))
With vectorized operations the output is not really different from
mutate()
, besides returning only one column.
%>%
df summarise(Price_Eur = Price * 1.14)
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
%>%
df mutate(Price_Eur = Price * 1.14)
This applies to window functions as well.
%>%
df summarise(Price_Rank = dense_rank(desc(Price)))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
%>%
df mutate(Price_Rank = dense_rank(desc(Price)))
Some warnings were issued because, when the manipulation returns more
than one value per group (in those examples the group is one, the whole
data frame, so it should return just one value), it is advised to use
reframe()
instead.
When used on a grouped data frame summarise()
will
return one value per group when using an aggregate function.
%>%
df group_by(StockCode) %>%
summarise(Avg_Price = mean(Price))
With vectorised operations and window functions we may have more than one row per group (hence the warnings again) and, differently from the ungrouped case, it will return the grouping columns as well.
%>%
df group_by(StockCode) %>%
summarise(Price_Eur = Price * 1.14)
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.
%>%
df group_by(StockCode) %>%
summarise(Price_Rank = dense_rank(desc(Price)))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.
Notice how in the last three examples the output is ordered by the grouping columns and that the first one returns an ungrouped data frame while the others don’t.
This is another property of summarise()
: removing the
most recent grouping column when the groups in the output are of size
one (meaning that they consist of only one row).
%>%
df group_by(`Customer ID`, Invoice) %>%
summarise(N_Rows_per_Invoice = n())
`summarise()` has grouped output by 'Customer ID'. You can override using the
`.groups` argument.
This allows to easily compute with different groupings in the same
pipe, like in the following example, where we needed both the
Customer
and Invoice
column for the first
computation and then only Customer ID
for the second
one.
%>%
df group_by(`Customer ID`, Invoice) %>%
summarise(N_Rows_per_Invoice = n()) %>%
mutate(N_Rows_per_Customer = sum(N_Rows_per_Invoice))
`summarise()` has grouped output by 'Customer ID'. You can override using the
`.groups` argument.
This behavior makes sense because, after a summarise()
call, the usefulness of the most recent grouping column is not very
high, as it usually only identifies one row, and therefore keeping the
original grouping structure could impede subsequent calculations.
%>%
df group_by(`Customer ID`, Invoice) %>%
summarise(N_Rows_per_Invoice = n()) %>%
group_by(`Customer ID`, Invoice) %>%
mutate(N_Rows_per_Customer = sum(N_Rows_per_Invoice))
`summarise()` has grouped output by 'Customer ID'. You can override using the
`.groups` argument.
This behavior never constitutes a problem if we are using
summarise()
for counting or summing, like in the previous
examples.
But it can produce erroneous results in other instances, like if we are for example averaging because the mean of several means
%>%
df group_by(`Customer ID`, Invoice) %>%
summarise(Avg_Quantity_per_Invoice = mean(Quantity))
`summarise()` has grouped output by 'Customer ID'. You can override using the
`.groups` argument.
%>%
df group_by(`Customer ID`, Invoice) %>%
summarise(Avg_Quantity_per_Invoice = mean(Quantity)) %>%
summarise(Avg_Quantity_per_Customer = mean(Avg_Quantity_per_Invoice))
`summarise()` has grouped output by 'Customer ID'. You can override using the
`.groups` argument.
is not always necessarily equal to the overall unmediated mean.
%>%
df group_by(`Customer ID`) %>%
summarise(Avg_Quantity_per_Customer = mean(Quantity))
So it is better to directly calculate the overall mean without intermediate steps, like in the latest example.
There is a .groups
argument that controls the behavior
of removing the most recent grouping, and it has two defaults based on
the number of rows, in the output, for each group:
with drop_last
we remove the most recent grouping and it is
the default if all groups have only one row
%>%
df group_by(StockCode) %>%
summarise(Price_Quantile_Value = quantile(Price, 0.25), prob = 0.25)
keep
maintains the grouping and it is the default if the
groups are bigger than one row.
%>%
df group_by(StockCode) %>%
summarise(Price_Quantile_Value = quantile(Price, c(0.25, 0.75)), prob = c(0.25, 0.75))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.
Again, it is not advised to have summarise()
returning
more than one value per group but to use reframe()
instead.
Then we have:
drop
, that removes all the groups, so it can be used to not
have another line with ungroup()
afterwards
%>%
df group_by(StockCode, Country) %>%
summarise(Avg_Price = mean(Price), .groups = "drop")
and rowwise
, that creates a data frame where every row
is a group, useful when we want to apply functions to values on the same
row.
%>%
df group_by(StockCode) %>%
summarise(Avg_Price = mean(Price),
Median_Price = median(Price), .groups = "rowwise") %>%
mutate(Highest_Price = max(Avg_Price, Median_Price))
I stress that I wrote functions, because with vectorized arithmetic
operations rowwise
is not needed.
%>%
df group_by(StockCode) %>%
summarise(Avg_Price = mean(Price),
Median_Price = median(Price), .groups = "rowwise") %>%
mutate(Price_Diff = Avg_Price - Median_Price)
%>%
df group_by(StockCode) %>%
summarise(Avg_Price = mean(Price),
Median_Price = median(Price)) %>%
mutate(Price_Diff = Avg_Price - Median_Price)
It is important as well to emphasize that the .groups
argument modifies the data frame after the calculation is performed.
As it happened in some of our examples, if we don’t specify the
.groups
argument, we get messages about the current state
of grouping if one is still present after the summarise()
call.
`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.
tidy-select
Instead of using group_by()
, we can use use the
.by
argument to perform manipulations on a grouped data
frame.
%>%
df summarise(Avg_Price = mean(Price), .by = Country)
The first difference is that the output is not ordered by the
grouping columns. Another one is that .by
always returns an
ungrouped data frame also when using two columns (to be specified with a
tidy-select syntax).
%>%
df summarise(Avg_Price = mean(Price), .by = c("Country", "StockCode"))
Let’s take a look now at some of the functions we can use
summarise()
with:
: mean(), median()
Center: sd(), IQR(), mad()
Spread: min(), max(), quantile()
Range: first(), last(), nth()
Position: n(), n_distinct()
Count: any(), all() Logical
For the central tendency of a distribution of values we can use the
aggregate functions mean()
and median()
.
%>%
df group_by(StockCode) %>%
summarise(Avg_Price = mean(Price),
Median_Price = median(Price))
Likewise we have functions for measures of dispersion like the
standard deviation and the range, so with summarise()
we
can easily construct custom summary tables of our liking.
%>%
df group_by(StockCode) %>%
summarise(St_Dev_Price = sd(Price),
Price_Range = max(Price) - min(Price))
We can access the first, last and nth element of a group with the
position wrappers first()
, last()
and
nth()
.
%>%
df group_by(Invoice) %>%
summarise(First_Item = first(Description),
Fifth_Item = nth(Description, 5),
Last_Item = last(Description))
They possess three optional arguments: order_by
, to
change the order by which we count positions,
%>%
df group_by(Invoice) %>%
summarise(Fifth_Item = nth(Description, 5),
Fifth_Item_New_Order = nth(Description, 5, order_by = StockCode))
default
, in case we specify an absent position (its
preset value is NA)
%>%
df group_by(Invoice) %>%
summarise(Fifth_Item = nth(Description, 5),
Fifth_Item_New_Default = nth(Description, 5, default = "missing"))
and na_rm
, to remove NAs when counting the positions (as
no invoice with an NA in Description
has more than one row,
we won’t experience any changes in this example).
%>%
df group_by(Invoice) %>%
summarise(Fifth_Item = nth(Description, 1),
Fifth_Item_No_NAs = nth(Description, 1, na_rm = TRUE))
n()
, to be used without an argument, returns the size as
in the number of rows while n_distinct()
the number of
unique values of a column.
%>%
df group_by(`Customer ID`) %>%
summarise(N_Invoice_Lines_per_Customer = n(),
N_Unique_Items_per_Customer = n_distinct(StockCode))
When using n()
NAs can make part of the grouping
columns.
%>%
df group_by(`Customer ID`) %>%
summarise(N_Invoice_Lines_per_Customer = n()) %>%
arrange(desc(N_Invoice_Lines_per_Customer))
n_distinct()
can remove them from the count with
na.rm
.
%>%
df filter(Invoice == "489521")
%>%
df filter(Invoice == "489521") %>%
summarise(N_Unique_Items_per_Customer = n_distinct(Description))
%>%
df filter(Invoice == "489521") %>%
summarise(N_Unique_Items_per_Customer = n_distinct(Description, na.rm = TRUE))
n_distinct()
supports multiple columns as well and in
this case it will return the number of unique combinations between
them.
%>%
df filter(Invoice == "489434")
%>%
df filter(Invoice == "489434") %>%
summarise(N_Unique_StockCode_Description = n_distinct(StockCode, Description))
This can be useful to spot repetitions when that number is different from the number of rows.
%>%
df filter(Invoice == "489488") %>%
arrange(StockCode, Description)
%>%
df filter(Invoice == "489488") %>%
summarise(N_Unique_StockCode_Description = n_distinct(StockCode, Description))
%>%
df group_by(Invoice) %>%
summarise(N_Rows_per_Invoice = n(),
N_Unique_StockCode_Description = n_distinct(StockCode, Description)) %>%
filter(N_Rows_per_Invoice != N_Unique_StockCode_Description)
Lastly, any()
and all()
are two functions
that evaluate logical vectors, returning one single value as the
output.
If any of the elements of the vector is TRUE, any()
returns
TRUE.
<- c(1, 2, 3)) (x
## [1] 1 2 3
> 2 x
## [1] FALSE FALSE TRUE
any(x > 2)
## [1] TRUE
If none is, it returns FALSE.
> 3 x
## [1] FALSE FALSE FALSE
any(x > 3)
## [1] FALSE
If all the elements are TRUE, all()
returns TRUE,
> 0 x
## [1] TRUE TRUE TRUE
all(x > 0)
## [1] TRUE
and if there is just one FALSE, it returns FALSE.
> 1 x
## [1] FALSE TRUE TRUE
all(x > 1)
## [1] FALSE
It is like any()
chains many OR expressions,
any(x > 2)
## [1] TRUE
1 > 2 | 2 > 2 | 3 > 2
## [1] TRUE
any(x > 3)
## [1] FALSE
1 > 3 | 2 > 3 | 3 > 3
## [1] FALSE
while all()
many AND ones.
all(x > 0)
## [1] TRUE
1 > 0 & 2 > 0 & 3 > 0
## [1] TRUE
all(x > 1)
## [1] FALSE
1 > 1 & 2 > 1 & 3 > 1
## [1] FALSE
As they compress one vector into one value, any()
and
all()
work well with the akin function
summarise()
on grouped data frames.
For example we may want to know which invoices have at least one
stock code with a Price
higher than 5,
%>%
df group_by(Invoice) %>%
summarise(One_Price_Higher_5 = any(Price > 5))
or the ones that have all of the stock codes with a
Price
higher than 5.
%>%
df group_by(Invoice) %>%
summarise(All_Prices_Higher_5 = all(Price > 5))
We remind the outputs of the interactions between NAs and the logical constants TRUE and FALSE in OR statements.
NA | TRUE
## [1] TRUE
NA | FALSE
## [1] NA
Because the vectors we evaluate can sometimes have NAs,
<- c(1, 2, 3, NA)) (xNA
## [1] 1 2 3 NA
> 3 xNA
## [1] FALSE FALSE FALSE NA
so in case we apply any()
on a vector with FALSEs and
NAs we can get an NA instead of FALSE,
any(xNA > 3)
## [1] NA
With all()
instead, an NA could prevent it to output
TRUE.
NA & TRUE
## [1] NA
NA & FALSE
## [1] FALSE
> 0 xNA
## [1] TRUE TRUE TRUE NA
all(xNA > 0)
## [1] NA
That could be a problem with summarise()
when the
columns we are evaluating have NAs in them,
%>%
df rows_append(tibble(Invoice = "489435")) %>%
arrange(Invoice, !is.na(StockCode))
as that could modify the desired output.
%>%
df rows_append(tibble(Invoice = "489435")) %>%
group_by(Invoice) %>%
summarise(One_Price_Higher_5 = any(Price > 5))
To prevent that, we can use the na.rm
argument,
available for both any()
and all()
.
%>%
df rows_append(tibble(Invoice = "489435")) %>%
group_by(Invoice) %>%
summarise(One_Price_Higher_5 = any(Price > 5, na.rm = TRUE))
Exploiting the properties of TRUE and FALSE (that evaluate to 1 and 0 in calculations),
TRUE + TRUE
## [1] 2
TRUE + FALSE
## [1] 1
we can also use any()
and all()
to create
tables with counts and proportions.
%>%
df group_by(Invoice) %>%
summarise(One_Price_Higher_100 = any(Price > 100))
%>%
df group_by(Invoice) %>%
summarise(One_Price_Higher_100 = any(Price > 100)) %>%
summarise(`Tot # Invoices` = n(),
`# Invoices with Expensive Items` = sum(One_Price_Higher_100),
`% Invoices with Expensive Items` = formattable::percent(mean(One_Price_Higher_100)))
Counts and proportions of logical values can be done also without
any()
and all()
if we feed a conditional
statement into sum()
or mean()
(pay attention
that the two examples don’t and are not meant to return the same
output).
%>%
df summarise(Tot_N_Invoices = n_distinct(Invoice),
N_Expensive_Items = sum(Price > 100),
Prop_of_Expensive_Items = formattable::percent(mean(Price > 100)))
Another use of summarise()
it to have the same
calculation performed on two different sets thank to inline
subsetting.
%>%
df summarise(Avg_Quantity = mean(Quantity),
Avg_Positive_Quantity = mean(Quantity[Quantity > 0]))
That doesn’t need to be done on the same column.
%>%
df summarise(N_Invoices = n_distinct(Invoice),
N_Invoices_Positive_Quantity = n_distinct(Invoice[Quantity > 0]))
data-masking
We’ve seen examples where summarise()
returns more than
one value per group.
%>%
df group_by(StockCode) %>%
summarise(Price_Quantile_Value = quantile(Price, c(0.25, 0.75)), prob = c(0.25, 0.75))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.
We received a warning because for these kinds of operations it is
advised to use reframe()
.
%>%
df group_by(StockCode) %>%
reframe(Price_Quantile_Value = quantile(Price, c(0.25, 0.75)), prob = c(0.25, 0.75))
reframe()
and summarise()
function very
similarly, one difference is that the former always returns an ungrouped
data frame, even if the grouping columns are more than one (it doesn’t
have a .groups
argument then).
%>%
df group_by(StockCode, Country) %>%
reframe(Price_Quantile_Value = quantile(Price, c(0.25, 0.75)), prob = c(0.25, 0.75))
So we might want to use it also with manipulations that return only one row if we want an ungrouped data frame as the output.
%>%
df group_by(StockCode, Country) %>%
reframe(Price_Quantile_Value = quantile(Price, 0.25, prob = 0.25))
%>%
df group_by(StockCode, Country) %>%
summarise(Price_Quantile_Value = quantile(Price, 0.25, prob = 0.25))
`summarise()` has grouped output by 'StockCode'. You can override using the
`.groups` argument.
tidy-select
Using .by
will keep the original rows order.
%>%
df reframe(Price_Quantile_Value = quantile(Price, c(0.25, 0.75)), prob = c(0.25, 0.75), .by = c("StockCode", "Country"))