any()
and all()
, two base R
functions, can be used with several dplyr
verbs. Besides
that, they have as well some variants with very specific employments, so
I thought it useful to retread all their applications in one page, for
an easier consulting when the need arises.
Their purpose is to evaluate logical vectors returning one single value as the output.
If any of the elements of the vector is TRUE, any()
returns TRUE.
<- c(1, 2, 3)) (x
## [1] 1 2 3
> 2 x
## [1] FALSE FALSE TRUE
any(x > 2)
## [1] TRUE
If none is, it returns FALSE.
> 3 x
## [1] FALSE FALSE FALSE
any(x > 3)
## [1] FALSE
If all the elements are TRUE, all()
returns TRUE,
> 0 x
## [1] TRUE TRUE TRUE
all(x > 0)
## [1] TRUE
and if there is just one FALSE, it returns FALSE.
> 1 x
## [1] FALSE TRUE TRUE
all(x > 1)
## [1] FALSE
It is like any()
chains many OR expressions,
any(x > 2)
## [1] TRUE
1 > 2 | 2 > 2 | 3 > 2
## [1] TRUE
any(x > 3)
## [1] FALSE
1 > 3 | 2 > 3 | 3 > 3
## [1] FALSE
while all()
many AND ones.
all(x > 0)
## [1] TRUE
1 > 0 & 2 > 0 & 3 > 0
## [1] TRUE
all(x > 1)
## [1] FALSE
1 > 1 & 2 > 1 & 3 > 1
## [1] FALSE
With these behaviors, we can use them in filter()
calls
on grouped data frames to preserve all the rows of groups that either at
times or all the times satisfy a condition.
For example we might want to use any()
to preserve all
the rows of the stock codes that had a negative value in the
Quantity
column at least once.
%>%
df group_by(StockCode) %>%
filter(any(Quantity < 0))
So, for every stock code, if there is at least one row with
Quantity < 0
(which will therefore return TRUE), all
that stock code’s rows will be preserved, even if the other values in
the Quantity
column are positive.
%>%
df group_by(StockCode) %>%
filter(any(Quantity < 0)) %>%
filter(StockCode == "85048") %>%
select(StockCode, Quantity)
With all()
instead we would preserve the stock codes
that always had negative values, as all of their rows will return
TRUE.
%>%
df group_by(StockCode) %>%
filter(all(Quantity < 0))
%>%
df group_by(StockCode) %>%
filter(all(Quantity < 0)) %>%
ungroup() %>%
summarise(`# of StockCodes` = n_distinct(StockCode),
`% of StockCodes with Negative Quantities` = formattable::percent(mean(Quantity < 0)))
Without any()
or all()
we would just
preserve the rows with negative quantities, regardless of what group
they belong to.
%>%
df group_by(StockCode) %>%
filter(Quantity < 0)
Additionally, this operation yields the same results on grouped and ungrouped data frames.
%>%
df filter(Quantity < 0)
Another example could be when we use an aggregate function like
mean()
.
Here for instance we preserve the rows of the stock codes that had at
least once a Price
higher than the mean of their
prices.
%>%
df group_by(StockCode) %>%
filter(any(Price > mean(Price)))
We remind the outputs of the interactions between NAs and the logical constants TRUE and FALSE in OR statements.
NA | TRUE
## [1] TRUE
NA | FALSE
## [1] NA
Because the vectors we evaluate can sometimes have NAs,
<- c(1, 2, 3, NA)) (xNA
## [1] 1 2 3 NA
> 3 xNA
## [1] FALSE FALSE FALSE NA
so in case we apply any()
on a vector with FALSEs and
NAs we can get an NA instead of FALSE.
any(xNA > 3)
## [1] NA
To prevent that we can use the na.rm
argument, even if
filter()
treats NAs as FALSE so that should not constitute
a problem.
any(xNA > 3, na.rm = TRUE)
## [1] FALSE
With all()
instead, an NA could prevent it to output
TRUE.
NA & TRUE
## [1] NA
NA & FALSE
## [1] FALSE
> 0 xNA
## [1] TRUE TRUE TRUE NA
all(xNA > 0)
## [1] NA
So here the na.rm
argument becomes necessary.
all(xNA > 0, na.rm = TRUE)
## [1] TRUE
As they compress one vector into one value, any()
and
all()
work well with the akin function
summarise()
on grouped data frames.
For example we may want to know which invoices have at least one
stock code with a Price
higher than 5,
%>%
df group_by(Invoice) %>%
summarise(One_Price_Higher_5 = any(Price > 5))
or the ones that have all of the stock codes with a
Price
higher than 5.
%>%
df group_by(Invoice) %>%
summarise(All_Prices_Higher_5 = all(Price > 5))
Exploiting the properties of TRUE and FALSE (that evaluate to 1 and 0 in calculations),
TRUE + TRUE
## [1] 2
TRUE + FALSE
## [1] 1
we can also use any()
and all()
to create
tables with counts and proportions.
%>%
df group_by(Invoice) %>%
summarise(One_Price_Higher_100 = any(Price > 100))
%>%
df group_by(Invoice) %>%
summarise(One_Price_Higher_100 = any(Price > 100)) %>%
summarise(`Tot # Invoices` = n(),
`# Invoices with Expensive Items` = sum(One_Price_Higher_100),
`% Invoices with Expensive Items` = formattable::percent(mean(One_Price_Higher_100)))
As seen in the
interactions between NAs & TRUE or FALSE
sub-section,
NAs can modify the output of any()
and all()
,
returning NA instead of a logical value.
xNA
## [1] 1 2 3 NA
> 3 xNA
## [1] FALSE FALSE FALSE NA
any(xNA > 3)
## [1] NA
> 0 xNA
## [1] TRUE TRUE TRUE NA
all(xNA > 0)
## [1] NA
That could be a problem with summarise()
when the
columns we are evaluating have NAs in them,
%>%
df rows_append(tibble(Invoice = "489435")) %>%
arrange(Invoice, !is.na(StockCode))
as that could modify the desired output.
%>%
df rows_append(tibble(Invoice = "489435")) %>%
group_by(Invoice) %>%
summarise(One_Price_Higher_5 = any(Price > 5))
To prevent that, we can use the na.rm
argument,
available for both any()
and all()
.
%>%
df rows_append(tibble(Invoice = "489435")) %>%
group_by(Invoice) %>%
summarise(One_Price_Higher_5 = any(Price > 5, na.rm = TRUE))
where()
is a select()
helper that allows to
use predicate functions (i.e. functions that return logical values, like
any()
and all()
), for example to only preserve
numerical columns.
%>%
df select(where(is.numeric))
The way it works is by testing the condition “is the column
numerical?” on every column, returning a logical vector as the output.
select()
will then preserve the columns whose positions
match the ones with a TRUE value in the output returned by
where()
.
With that in mind, it is easy to see how we can utilize
any()
or all()
inside of it, for example to
preserve the columns with some NAs (we utilize the ~
formula-like syntax here),
%>%
df select(where(~ any(is.na(.x))))
or the ones where all the values are some specific ones.
%>%
df mutate(Country = Country,
IT_Country = rep("Italy"),
FR_Country = rep("France"),
UK_Country = rep("United Kingdom"), .keep = "used")
%>%
df mutate(Country = Country,
IT_Country = rep("Italy"),
FR_Country = rep("France"),
UK_Country = rep("United Kingdom"), .keep = "used") %>%
select(where(~ all(.x %in% c("Italy", "France"))))
We can also find situations where the na.rm
argument can
be useful, for example here where we only want to preserve the columns,
amongst the numerical ones, with at least one value higher than the mean
of them.
%>%
df select(where(is.numeric)) %>%
select(where(~ any(.x > mean(.x))))
Error in `where()`:
! Predicate must return `TRUE` or `FALSE`, not `NA`.
Without na.rm = TRUE
we received an error as the column
Customer ID
has NAs in it, so its mean is NA and a test
with NAs always returns NA.
13085 > NA
## [1] NA
Evaluating that column any()
will then return one NA and
those are not accepted by where()
.
With na.rm = TRUE
we can remove that value and allow the
operation to return an output without errors.
%>%
df select(where(is.numeric)) %>%
select(where(~ any(.x > mean(.x), na.rm = TRUE)))
We could have as well removed the NAs from the
Customer ID
column by applying na.rm = TRUE
to
mean()
. The result is different as here we will preserve
also the aforementioned column, as its average is not NA anymore.
%>%
df select(where(is.numeric)) %>%
select(where(~ any(.x > mean(.x, na.rm = TRUE))))
Using a unique select()
call would have issued some
warnings, per the preemptive evaluation of the mean()
function also to non numerical columns. We didn’t receive an error as
mean()
, thanks to the &
operator, is only
effectively applied to numerical columns.
%>%
df select(where(~ is.numeric(.x) &
any(.x > mean(.x), na.rm = TRUE)))
Warning in mean.default(.x): argument is not numeric or logical: returning NA
Warning in mean.default(.x): argument is not numeric or logical: returning NA
Warning in mean.default(.x): argument is not numeric or logical: returning NA
Warning in mean.default(.x): argument is not numeric or logical: returning NA
any_of()
and all_of()
are two
select()
helpers that must be used when we have columns’
names stored in an external vector,
<- c("Invoice", "InvoiceDate")
cols_vec %>%
df select(any_of(cols_vec))
%>%
df select(all_of(cols_vec))
as without them we would receive a warning.
%>%
df select(cols_vec)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(cols_vec)
# Now:
data %>% select(all_of(cols_vec))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
Their difference comes off when a column from the vector is not
present in the data frame, as all_of()
throws an error,
<- c("Invoice", "InvoiceDate", "Shipment")
cols_vec %>%
df select(all_of(cols_vec))
Error in `all_of()`:
! Can't subset columns that don't exist.
✖ Column `Shipment` doesn't exist.
where any_of()
returns just the columns of the data
frame that match the ones in the vector.
%>%
df select(any_of(cols_vec))
The same behavior is present in case of de-selections.
%>%
df select(-all_of(cols_vec))
Error in `all_of()`:
! Can't subset columns that don't exist.
✖ Column `Shipment` doesn't exist.
%>%
df select(-any_of(cols_vec))
any_of()
and all_of()
can be used with
rename()
as well, when we have a vector with the
modifications we want to apply.
<- c(CUSTOMER_ID = "Customer ID", COUNTRY = "Country", SHIPMENT = "Shipment")
any_new_names %>%
df rename(any_of(any_new_names))
<- c(CUSTOMER_ID = "Customer ID", COUNTRY = "Country")
all_new_names %>%
df rename(all_of(all_new_names))
By the same token as with select()
, all of the columns
to be modified must be present in the data frame when we use
all_of()
.
%>%
df rename(all_of(any_new_names))
Error in `all_of()`:
! Can't rename columns that don't exist.
✖ Column `Shipment` doesn't exist.
Two other variants are cumany()
and
cumall()
, two cumulative aggregate window functions, both
of which evaluate logical vectors.
<- c(1, 2, 3, 2)) (x
## [1] 1 2 3 2
> 2 x
## [1] FALSE FALSE TRUE FALSE
cumany()
returns TRUE values after the first TRUE, even
if there are, in the logical vector, some FALSE ones after that.
cumany(x > 2)
## [1] FALSE FALSE TRUE TRUE
cumall()
instead returns FALSE values after the first
FALSE, even if there are some TRUE ones after it.
< 3 x
## [1] TRUE TRUE FALSE TRUE
cumall(x < 3)
## [1] TRUE TRUE FALSE FALSE
With these behaviors, cumany()
and cumall()
can be used inside filter()
calls to preserve rows
hereafter or up to a particular one.
cumany()
is used to preserve rows after a certain
condition is TRUE for the first time, the row where this happens
included, regardless of the condition becoming FALSE afterwards.
Like for example the rows after a stock code had a price higher than 2 for the first time.
%>%
df filter(StockCode == 21232)
%>%
df filter(StockCode == 21232) %>%
filter(cumany(Price > 2))
cumall()
preserves the rows before a certain condition
is no longer TRUE, regardless of it becoming TRUE again, like the rows
before the price of a stock code was no longer less than 2.
%>%
df filter(StockCode == 21232) %>%
filter(cumall(Price < 2))
With that in mind, we can’t use cumall()
with a
condition that returns FALSE as the first outputted value, as that will
preserve no rows.
%>%
df filter(StockCode == 21232) %>%
filter(cumall(Price > 2))
With NAs, ||
(the non-cumulative version of
cumany()
) returns TRUE with TRUE
NA || TRUE
## [1] TRUE
and NA with FALSE.
NA || FALSE
## [1] NA
So, if there are NAs, cumany()
will return NA instead of
FALSE.
<- c(1, 2, 3, NA, 2)) (xNA
## [1] 1 2 3 NA 2
> 3 xNA
## [1] FALSE FALSE FALSE NA FALSE
cumany(xNA > 3)
## [1] FALSE FALSE FALSE NA NA
filter()
treats the rows with NAs like FALSE so with
these manipulations that is not a problem, beside the fact that
cumany()
returns TRUE after the first TRUE, so subsequent
NAs are ininfluential.
%>%
df filter(StockCode == "10123G")
%>%
df filter(StockCode == "10123G") %>%
filter(cumany(`Customer ID` < 17968))
&&
(the non-cumulative version of
cumall()
) instead returns NA with TRUE,
NA && TRUE
## [1] NA
NA && FALSE
## [1] FALSE
and that could pose a problem for cumall()
,
xNA
## [1] 1 2 3 NA 2
> 0 xNA
## [1] TRUE TRUE TRUE NA TRUE
cumall(xNA > 0)
## [1] TRUE TRUE TRUE NA NA
as rows with NAs will not be preserved by filter()
.
%>%
df filter(StockCode == "10123G") %>%
slice(4:n())
%>%
df filter(StockCode == "10123G") %>%
slice(4:n()) %>%
filter(cumall(`Customer ID` < 17920))
The only solution is to remove NAs beforehand.
%>%
df filter(StockCode == "10123G" &
!is.na(`Customer ID`)) %>%
slice(3:n())
%>%
df filter(StockCode == "10123G" &
!is.na(`Customer ID`)) %>%
slice(3:n()) %>%
filter(cumall(`Customer ID` < 17920))
if_any()
and if_all()
are two substitutes
of across()
to be used in filter()
call, as
its use is deprecated.
%>%
df filter(Quantity > 5 &
> 5) Price
%>%
df filter(across(c(Quantity, Price), ~ .x > 5))
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
Their purpose is the same, to apply the same manipulation to
different columns. We use if_all()
when we want to preserve
the rows that have, for example, a value greater than 5 in all of the
columns selected,
%>%
df filter(if_all(c(Quantity, Price), ~ .x > 5))
and if_any()
in case we want to preserve the ones that
have a value greater than 5 in any of them.
%>%
df filter(Quantity > 5 |
> 5) Price
%>%
df filter(if_any(c(Quantity, Price), ~ .x > 5))
Note that, while the first example could be written with
across()
instead of if_all()
, for the second
one we can only use if_any()
.
We can also use them to filter for the same interval on several columns, either on all of them at the same time,
%>%
df filter(between(Quantity, 10, 20) &
between(Price, 10, 20))
%>%
df filter(if_all(c(Quantity, Price), ~ between(.x, 10, 20)))
%>%
df filter(if_all(c(Quantity, Price), ~ .x >= 10 & .x <= 20))
or just on any of them.
%>%
df filter(between(Quantity, 10, 20) |
between(Price, 10, 20))
%>%
df filter(if_any(c(Quantity, Price), ~ between(.x, 10, 20)))
%>%
df filter(if_any(c(Quantity, Price), ~ .x >= 10 & .x <= 20))
By design though they don’t understand cur_column()
(an
across()
specific function) so if we want to filter each
column for a different value,
%>%
df filter(Quantity > 2 &
> 10) Price
we need to use across()
(thus an &
statement) therefore we can only preserve the rows where both conditions
are TRUE at the same time.
<- list(Quantity = 2,
fltr Price = 10)
%>%
df filter(across(c(Quantity, Price), ~ .x > fltr[[cur_column()]]))
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
Another example with between()
, where we filter on two
columns for different ranges.
%>%
df filter(between(Quantity, 10, 20) &
between(Price, 20, 30))
<- list(Quantity = c(10, 20),
fltr2 Price = c(20, 30))
%>%
df filter(across(c(Quantity, Price), ~ between(.x, fltr2[[cur_column()]][1], fltr2[[cur_column()]][2])))
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
If the data frame already has logical columns,
%>%
df mutate(Boolean_TFT = rep(c(TRUE, FALSE, TRUE), length.out = nrow(df)),
Boolean_TF = rep(c(TRUE, FALSE), length.out = nrow(df)), .before = "Invoice")
if_any()
and if_all()
can be used also
without a function.
%>%
df mutate(Boolean_TFT = rep(c(TRUE, FALSE, TRUE), length.out = nrow(df)),
Boolean_TF = rep(c(TRUE, FALSE), length.out = nrow(df)), .before = "Invoice") %>%
filter(if_any(where(is.logical)))
%>%
df mutate(Boolean_TFT = rep(c(TRUE, FALSE, TRUE), length.out = nrow(df)),
Boolean_TF = rep(c(TRUE, FALSE), length.out = nrow(df)), .before = "Invoice") %>%
filter(if_all(where(is.logical)))
if_any()
and if_all()
can be used in
mutate()
or summarise()
calls as well, as long
as we exploit their property of returning logical vectors.
%>%
df mutate(High = case_when(Quantity > 50 & Price > 50 ~ "volume_and_price",
> 50 | Price > 50 ~ "volume_or_price",
Quantity < 50 & Price < 50 ~ "neither"), .keep = "used") Quantity
%>%
df mutate(High = case_when(if_all(c(Quantity, Price), ~ .x > 50) ~ "volume_and_price",
if_any(c(Quantity, Price), ~ .x > 50) ~ "volume_or_price",
if_all(c(Quantity, Price), ~ .x <= 50) ~ "neither"))