any() and all(), two base R functions, can be used with several dplyr verbs. Besides that, they have as well some variants with very specific employments, so I thought it useful to retread all their applications in one page, for an easier consulting when the need arises.

- fundamentals

Their purpose is to evaluate logical vectors returning one single value as the output.

If any of the elements of the vector is TRUE, any() returns TRUE.

(x <- c(1, 2, 3))
## [1] 1 2 3
x > 2
## [1] FALSE FALSE  TRUE
any(x > 2)
## [1] TRUE

If none is, it returns FALSE.

x > 3
## [1] FALSE FALSE FALSE
any(x > 3)
## [1] FALSE

If all the elements are TRUE, all() returns TRUE,

x > 0
## [1] TRUE TRUE TRUE
all(x > 0)
## [1] TRUE

and if there is just one FALSE, it returns FALSE.

x > 1
## [1] FALSE  TRUE  TRUE
all(x > 1)
## [1] FALSE

It is like any() chains many OR expressions,

any(x > 2)
## [1] TRUE
1 > 2 | 2 > 2 | 3 > 2
## [1] TRUE
any(x > 3)
## [1] FALSE
1 > 3 | 2 > 3 | 3 > 3
## [1] FALSE

while all() many AND ones.

all(x > 0)
## [1] TRUE
1 > 0 & 2 > 0 & 3 > 0
## [1] TRUE
all(x > 1)
## [1] FALSE
1 > 1 & 2 > 1 & 3 > 1
## [1] FALSE

- with filter() on grouped_dfs

With these behaviors, we can use them in filter() calls on grouped data frames to preserve all the rows of groups that either at times or all the times satisfy a condition.

For example we might want to use any() to preserve all the rows of the stock codes that had a negative value in the Quantity column at least once.

df %>%
  group_by(StockCode) %>%
  filter(any(Quantity < 0))
A tibble: 461492 x 8
Groups: StockCode [3117]

So, for every stock code, if there is at least one row with Quantity < 0 (which will therefore return TRUE), all that stock code’s rows will be preserved, even if the other values in the Quantity column are positive.

df %>%
  group_by(StockCode) %>%
  filter(any(Quantity < 0)) %>%  
  filter(StockCode == "85048") %>%
  select(StockCode, Quantity)
A tibble: 319 x 2
Groups: StockCode [1]

With all() instead we would preserve the stock codes that always had negative values, as all of their rows will return TRUE.

df %>%
  group_by(StockCode) %>%
  filter(all(Quantity < 0))
A tibble: 334 x 8
Groups: StockCode [315]
df %>%
  group_by(StockCode) %>%
  filter(all(Quantity < 0)) %>%
  ungroup() %>%
  summarise(`# of StockCodes` = n_distinct(StockCode),
            `% of StockCodes with Negative Quantities` = formattable::percent(mean(Quantity < 0)))
A tibble: 1 x 2

Without any() or all() we would just preserve the rows with negative quantities, regardless of what group they belong to.

df %>%
  group_by(StockCode) %>%
  filter(Quantity < 0)
A tibble: 12326 x 8
Groups: StockCode [3117]

Additionally, this operation yields the same results on grouped and ungrouped data frames.

df %>%
  filter(Quantity < 0)
A tibble: 12326 x 8

Another example could be when we use an aggregate function like mean().

Here for instance we preserve the rows of the stock codes that had at least once a Price higher than the mean of their prices.

df %>%
  group_by(StockCode) %>%
  filter(any(Price > mean(Price)))
A tibble: 522649 x 8
Groups: StockCode [3965]

- interactions between NAs & TRUE or FALSE

We remind the outputs of the interactions between NAs and the logical constants TRUE and FALSE in OR statements.

NA | TRUE 
## [1] TRUE
NA | FALSE
## [1] NA

Because the vectors we evaluate can sometimes have NAs,

(xNA <- c(1, 2, 3, NA))
## [1]  1  2  3 NA
xNA > 3
## [1] FALSE FALSE FALSE    NA

so in case we apply any() on a vector with FALSEs and NAs we can get an NA instead of FALSE.

any(xNA > 3)
## [1] NA

To prevent that we can use the na.rm argument, even if filter() treats NAs as FALSE so that should not constitute a problem.

any(xNA > 3, na.rm = TRUE)
## [1] FALSE

With all() instead, an NA could prevent it to output TRUE.

NA & TRUE
## [1] NA
NA & FALSE
## [1] FALSE
xNA > 0
## [1] TRUE TRUE TRUE   NA
all(xNA > 0)
## [1] NA

So here the na.rm argument becomes necessary.

all(xNA > 0, na.rm = TRUE)
## [1] TRUE

- with summarise() on grouped_dfs

As they compress one vector into one value, any() and all() work well with the akin function summarise() on grouped data frames.

For example we may want to know which invoices have at least one stock code with a Price higher than 5,

df %>%
  group_by(Invoice) %>%
  summarise(One_Price_Higher_5 = any(Price > 5))
A tibble: 28816 x 2

or the ones that have all of the stock codes with a Price higher than 5.

df %>%
  group_by(Invoice) %>%
  summarise(All_Prices_Higher_5 = all(Price > 5))
A tibble: 28816 x 2

Exploiting the properties of TRUE and FALSE (that evaluate to 1 and 0 in calculations),

TRUE + TRUE
## [1] 2
TRUE + FALSE
## [1] 1

we can also use any() and all() to create tables with counts and proportions.

df %>%
  group_by(Invoice) %>%
  summarise(One_Price_Higher_100 = any(Price > 100))
A tibble: 28816 x 2
df %>%
  group_by(Invoice) %>%
  summarise(One_Price_Higher_100 = any(Price > 100)) %>%
  summarise(`Tot # Invoices` = n(),
            `# Invoices with Expensive Items` = sum(One_Price_Higher_100),
            `% Invoices with Expensive Items` = formattable::percent(mean(One_Price_Higher_100)))
A tibble: 1 x 3

As seen in the interactions between NAs & TRUE or FALSE sub-section, NAs can modify the output of any() and all(), returning NA instead of a logical value.

xNA
## [1]  1  2  3 NA
xNA > 3
## [1] FALSE FALSE FALSE    NA
any(xNA > 3)
## [1] NA
xNA > 0
## [1] TRUE TRUE TRUE   NA
all(xNA > 0)
## [1] NA

That could be a problem with summarise() when the columns we are evaluating have NAs in them,

df %>%
  rows_append(tibble(Invoice = "489435")) %>%
  arrange(Invoice, !is.na(StockCode))
A tibble: 525462 x 8

as that could modify the desired output.

df %>%
  rows_append(tibble(Invoice = "489435")) %>%
  group_by(Invoice) %>%
  summarise(One_Price_Higher_5 = any(Price > 5))
A tibble: 28816 x 2

To prevent that, we can use the na.rm argument, available for both any() and all().

df %>%
  rows_append(tibble(Invoice = "489435")) %>%
  group_by(Invoice) %>%
  summarise(One_Price_Higher_5 = any(Price > 5, na.rm = TRUE))
A tibble: 28816 x 2

- with where() inside select() calls

where() is a select() helper that allows to use predicate functions (i.e. functions that return logical values, like any() and all()), for example to only preserve numerical columns.

df %>%
  select(where(is.numeric))
A tibble: 525461 x 3

The way it works is by testing the condition “is the column numerical?” on every column, returning a logical vector as the output. select() will then preserve the columns whose positions match the ones with a TRUE value in the output returned by where().

With that in mind, it is easy to see how we can utilize any() or all() inside of it, for example to preserve the columns with some NAs (we utilize the ~ formula-like syntax here),

df %>%
  select(where(~ any(is.na(.x))))
A tibble: 525461 x 2

or the ones where all the values are some specific ones.

df %>%
  mutate(Country = Country,
         IT_Country = rep("Italy"),
         FR_Country = rep("France"),
         UK_Country = rep("United Kingdom"), .keep = "used")
A tibble: 525461 x 4
df %>%
  mutate(Country = Country,
         IT_Country = rep("Italy"),
         FR_Country = rep("France"),
         UK_Country = rep("United Kingdom"), .keep = "used") %>%
  select(where(~ all(.x %in% c("Italy", "France"))))
A tibble: 525461 x 2

We can also find situations where the na.rm argument can be useful, for example here where we only want to preserve the columns, amongst the numerical ones, with at least one value higher than the mean of them.

df %>%
  select(where(is.numeric)) %>%
  select(where(~ any(.x > mean(.x))))
Error in `where()`:
! Predicate must return `TRUE` or `FALSE`, not `NA`.

Without na.rm = TRUE we received an error as the column Customer ID has NAs in it, so its mean is NA and a test with NAs always returns NA.

13085 > NA
## [1] NA

Evaluating that column any() will then return one NA and those are not accepted by where().

With na.rm = TRUE we can remove that value and allow the operation to return an output without errors.

df %>%
  select(where(is.numeric)) %>%
  select(where(~ any(.x > mean(.x), na.rm = TRUE)))
A tibble: 525461 x 2

We could have as well removed the NAs from the Customer ID column by applying na.rm = TRUE to mean(). The result is different as here we will preserve also the aforementioned column, as its average is not NA anymore.

df %>%
  select(where(is.numeric)) %>%
  select(where(~ any(.x > mean(.x, na.rm = TRUE))))
A tibble: 525461 x 3

Using a unique select() call would have issued some warnings, per the preemptive evaluation of the mean() function also to non numerical columns. We didn’t receive an error as mean(), thanks to the & operator, is only effectively applied to numerical columns.

df %>%
  select(where(~ is.numeric(.x) &
                 any(.x > mean(.x), na.rm = TRUE)))
Warning in mean.default(.x): argument is not numeric or logical: returning NA

Warning in mean.default(.x): argument is not numeric or logical: returning NA

Warning in mean.default(.x): argument is not numeric or logical: returning NA

Warning in mean.default(.x): argument is not numeric or logical: returning NA
A tibble: 525461 x 2

- variants

- any_of() & all_of() with select() or rename()

any_of() and all_of() are two select() helpers that must be used when we have columns’ names stored in an external vector,

cols_vec <- c("Invoice", "InvoiceDate")
df %>%
  select(any_of(cols_vec))
A tibble: 525461 x 2
df %>%
  select(all_of(cols_vec))
A tibble: 525461 x 2

as without them we would receive a warning.

df %>%
  select(cols_vec)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(cols_vec)

  # Now:
  data %>% select(all_of(cols_vec))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
A tibble: 525461 x 2

Their difference comes off when a column from the vector is not present in the data frame, as all_of()throws an error,

cols_vec <- c("Invoice", "InvoiceDate", "Shipment")
df %>%
  select(all_of(cols_vec))
Error in `all_of()`:
! Can't subset columns that don't exist.
✖ Column `Shipment` doesn't exist.

where any_of() returns just the columns of the data frame that match the ones in the vector.

df %>%
  select(any_of(cols_vec))
A tibble: 525461 x 2

The same behavior is present in case of de-selections.

df %>%
  select(-all_of(cols_vec))
Error in `all_of()`:
! Can't subset columns that don't exist.
✖ Column `Shipment` doesn't exist.
df %>%
  select(-any_of(cols_vec))
A tibble: 525461 x 6

any_of() and all_of() can be used with rename() as well, when we have a vector with the modifications we want to apply.

any_new_names <- c(CUSTOMER_ID = "Customer ID", COUNTRY = "Country", SHIPMENT = "Shipment")
df %>%
  rename(any_of(any_new_names))
A tibble: 525461 x 8
all_new_names <- c(CUSTOMER_ID = "Customer ID", COUNTRY = "Country")
df %>%
  rename(all_of(all_new_names))
A tibble: 525461 x 8

By the same token as with select(), all of the columns to be modified must be present in the data frame when we use all_of().

df %>%
  rename(all_of(any_new_names))
Error in `all_of()`:
! Can't rename columns that don't exist.
✖ Column `Shipment` doesn't exist.

- cumany() & cumall() with filter()

Two other variants are cumany() and cumall(), two cumulative aggregate window functions, both of which evaluate logical vectors.

(x <- c(1, 2, 3, 2))
## [1] 1 2 3 2
x > 2
## [1] FALSE FALSE  TRUE FALSE

cumany() returns TRUE values after the first TRUE, even if there are, in the logical vector, some FALSE ones after that.

cumany(x > 2)
## [1] FALSE FALSE  TRUE  TRUE

cumall() instead returns FALSE values after the first FALSE, even if there are some TRUE ones after it.

x < 3
## [1]  TRUE  TRUE FALSE  TRUE
cumall(x < 3)
## [1]  TRUE  TRUE FALSE FALSE

With these behaviors, cumany() and cumall() can be used inside filter() calls to preserve rows hereafter or up to a particular one.

cumany() is used to preserve rows after a certain condition is TRUE for the first time, the row where this happens included, regardless of the condition becoming FALSE afterwards.

Like for example the rows after a stock code had a price higher than 2 for the first time.

df %>%
  filter(StockCode == 21232)
A tibble: 1843 x 8
df %>%
  filter(StockCode == 21232) %>%
  filter(cumany(Price > 2))
A tibble: 1834 x 8

cumall() preserves the rows before a certain condition is no longer TRUE, regardless of it becoming TRUE again, like the rows before the price of a stock code was no longer less than 2.

df %>%
  filter(StockCode == 21232) %>%
  filter(cumall(Price < 2))
A tibble: 9 x 8

With that in mind, we can’t use cumall() with a condition that returns FALSE as the first outputted value, as that will preserve no rows.

df %>%
  filter(StockCode == 21232) %>%
  filter(cumall(Price > 2))
A tibble: 0 x 8

With NAs, || (the non-cumulative version of cumany()) returns TRUE with TRUE

NA || TRUE
## [1] TRUE

and NA with FALSE.

NA || FALSE
## [1] NA

So, if there are NAs, cumany() will return NA instead of FALSE.

(xNA <- c(1, 2, 3, NA, 2))
## [1]  1  2  3 NA  2
xNA > 3
## [1] FALSE FALSE FALSE    NA FALSE
cumany(xNA > 3)
## [1] FALSE FALSE FALSE    NA    NA

filter() treats the rows with NAs like FALSE so with these manipulations that is not a problem, beside the fact that cumany() returns TRUE after the first TRUE, so subsequent NAs are ininfluential.

df %>%
  filter(StockCode == "10123G")
A tibble: 20 x 8
df %>%
  filter(StockCode == "10123G") %>%
  filter(cumany(`Customer ID` < 17968))
A tibble: 17 x 8

&& (the non-cumulative version of cumall()) instead returns NA with TRUE,

NA && TRUE
## [1] NA
NA && FALSE
## [1] FALSE

and that could pose a problem for cumall(),

xNA
## [1]  1  2  3 NA  2
xNA > 0
## [1] TRUE TRUE TRUE   NA TRUE
cumall(xNA > 0)
## [1] TRUE TRUE TRUE   NA   NA

as rows with NAs will not be preserved by filter().

df %>%
  filter(StockCode == "10123G") %>%
  slice(4:n())
A tibble: 17 x 8
df %>%
  filter(StockCode == "10123G") %>%
  slice(4:n()) %>%
  filter(cumall(`Customer ID` < 17920))
A tibble: 3 x 8

The only solution is to remove NAs beforehand.

df %>%
  filter(StockCode == "10123G" &
           !is.na(`Customer ID`)) %>%
  slice(3:n())
A tibble: 12 x 8
df %>%
  filter(StockCode == "10123G" &
           !is.na(`Customer ID`)) %>%
  slice(3:n()) %>%
  filter(cumall(`Customer ID` < 17920))
A tibble: 10 x 8

- if_any() & if_all() with filter() in place of across()

if_any() and if_all() are two substitutes of across() to be used in filter() call, as its use is deprecated.

df %>%
  filter(Quantity > 5 &
           Price > 5)
A tibble: 10305 x 8
df %>%
  filter(across(c(Quantity, Price), ~ .x > 5))
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
A tibble: 10305 x 8

Their purpose is the same, to apply the same manipulation to different columns. We use if_all() when we want to preserve the rows that have, for example, a value greater than 5 in all of the columns selected,

df %>%
  filter(if_all(c(Quantity, Price), ~ .x > 5))
A tibble: 10305 x 8

and if_any() in case we want to preserve the ones that have a value greater than 5 in any of them.

df %>%
  filter(Quantity > 5 |
           Price > 5)
A tibble: 296618 x 8
df %>%
  filter(if_any(c(Quantity, Price), ~ .x > 5))
A tibble: 296618 x 8

Note that, while the first example could be written with across() instead of if_all(), for the second one we can only use if_any().

We can also use them to filter for the same interval on several columns, either on all of them at the same time,

df %>%
  filter(between(Quantity, 10, 20) &
           between(Price, 10, 20))
A tibble: 512 x 8
df %>%
  filter(if_all(c(Quantity, Price), ~ between(.x, 10, 20)))
A tibble: 512 x 8
df %>%
  filter(if_all(c(Quantity, Price), ~ .x >= 10 & .x <= 20))
A tibble: 512 x 8

or just on any of them.

df %>%
  filter(between(Quantity, 10, 20) |
           between(Price, 10, 20))
A tibble: 112209 x 8
df %>%
  filter(if_any(c(Quantity, Price), ~ between(.x, 10, 20)))
A tibble: 112209 x 8
df %>%
  filter(if_any(c(Quantity, Price), ~ .x >= 10 & .x <= 20))
A tibble: 112209 x 8

By design though they don’t understand cur_column() (an across() specific function) so if we want to filter each column for a different value,

df %>%
  filter(Quantity > 2 &
           Price > 10)
A tibble: 4217 x 8

we need to use across() (thus an & statement) therefore we can only preserve the rows where both conditions are TRUE at the same time.

fltr <- list(Quantity = 2, 
             Price = 10)
df %>%
  filter(across(c(Quantity, Price), ~ .x > fltr[[cur_column()]]))
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
A tibble: 4217 x 8

Another example with between(), where we filter on two columns for different ranges.

df %>%
  filter(between(Quantity, 10, 20) &
           between(Price, 20, 30))
A tibble: 32 x 8
fltr2 <- list(Quantity = c(10, 20),
               Price = c(20, 30))
df %>%
  filter(across(c(Quantity, Price), ~ between(.x, fltr2[[cur_column()]][1], fltr2[[cur_column()]][2])))
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
A tibble: 32 x 8

If the data frame already has logical columns,

df %>%
  mutate(Boolean_TFT = rep(c(TRUE, FALSE, TRUE), length.out = nrow(df)),
         Boolean_TF = rep(c(TRUE, FALSE), length.out = nrow(df)), .before = "Invoice")
A tibble: 525461 x 10

if_any() and if_all() can be used also without a function.

df %>%
  mutate(Boolean_TFT = rep(c(TRUE, FALSE, TRUE), length.out = nrow(df)),
         Boolean_TF = rep(c(TRUE, FALSE), length.out = nrow(df)), .before = "Invoice") %>%
  filter(if_any(where(is.logical)))
A tibble: 437884 x 10
df %>%
  mutate(Boolean_TFT = rep(c(TRUE, FALSE, TRUE), length.out = nrow(df)),
         Boolean_TF = rep(c(TRUE, FALSE), length.out = nrow(df)), .before = "Invoice") %>%
  filter(if_all(where(is.logical)))
A tibble: 175154 x 10

- if_any() & if_all() with mutate() or summarise()

if_any() and if_all() can be used in mutate() or summarise() calls as well, as long as we exploit their property of returning logical vectors.

df %>%
  mutate(High = case_when(Quantity > 50 & Price > 50  ~ "volume_and_price",
                          Quantity > 50 | Price > 50  ~ "volume_or_price",
                          Quantity < 50 & Price < 50  ~ "neither"), .keep = "used")
A tibble: 525461 x 3
df %>%
  mutate(High = case_when(if_all(c(Quantity, Price), ~ .x > 50) ~ "volume_and_price",
                          if_any(c(Quantity, Price), ~ .x > 50) ~ "volume_or_price",
                          if_all(c(Quantity, Price), ~ .x <= 50) ~ "neither"))
A tibble: 525461 x 9