- case_when()

We can nest several if_else()s if we want a column with more than two classes in the output.

EU <-  c("Austria", "Belgium", "Channel Islands", "Cyprus", "Denmark", "EIRE", "Finland", "France", "Germany", 
         "Greece", "Iceland", "Italy", "Lithuania", "Malta", "Netherlands", "Norway", "Poland", 
         "Portugal", "Spain", "Sweden", "Switzerland")

df %>%
  mutate(in_EU = if_else(Country == "Unspecified", Country, 
                         if_else(Country %in% EU, "Yes", "No")), .keep = "used")

A tibble: 525461 x 2

But it is best to use case_when(), which is built for this purpose. It uses a formula like syntax, where on the right of ~ we specify which value to assign when the test on the left evaluates to TRUE.

df %>%
  mutate(in_EU = case_when(Country == "Unspecified" ~ Country,
                           Country %in% EU ~ "Yes",
                           !Country %in% EU ~ "No"), .keep = "used")

A tibble: 525461 x 2

We can’t specify any value for when a test evaluates to FALSE so, when an element doesn’t evaluate to TRUE to any test, the output is NA.

df %>%
  mutate(in_EU = case_when(Country == "Unspecified" ~ Country,
                           Country %in% EU ~ "Yes"), .keep = "used")

A tibble: 525461 x 2

This means that, in order to not produce NAs, each element needs a test that evaluates to TRUE.

- .default

Sometimes that can be bothersome to do, like if we only wanted to modify the name of 2 elements, and we might also miss on some values.

df %>%
  mutate(New_Country = case_when(Country == "United Kingdom" ~ "UK",
                                 Country == "EIRE" ~ "Ireland",
                                 Country == "Austria" ~ Country,
                                 Country == "Belgium" ~ Country,
                                 Country == "Channel Islands" ~ Country,
                                 Country == "Cyprus" ~ Country,
                                 Country == "Denmark" ~ Country,
                                 Country == "Finland" ~ Country,
                                 Country == "France" ~ Country,
                                 Country == "Germany" ~ Country,
                                 Country == "Greece" ~ Country,
                                 Country == "Iceland" ~ Country,
                                 Country == "Italy" ~ Country,
                                 Country == "Lithuania" ~ Country,
                                 Country == "Malta" ~ Country,
                                 Country == "Netherlands" ~ Country,
                                 Country == "Norway" ~ Country,
                                 Country == "Poland" ~ Country,
                                 Country == "Portugal" ~ Country,
                                 Country == "Spain" ~ Country,
                                 Country == "Sweden" ~ Country,
                                 Country == "Switzerland" ~ Country), .keep = "used") %>%
  arrange(Country != "Bahrain")

A tibble: 525461 x 2

So we can use the .default argument for assigning a value of choice instead of NA.

df %>%
  mutate(New_Country = case_when(Country == "United Kingdom" ~ "UK",
                                 Country == "EIRE" ~ "Ireland", .default = Country), .keep = "used") %>%
  arrange(Country != "Bahrain")

A tibble: 525461 x 2

NAs in the input are treated as FALSE, therefore they will return NAs (i.e. row 12).

df %>%
  filter(StockCode == "85166B") %>%
  mutate(Country = na_if(Country, "Unspecified"),
         is_UK = case_when(!Country == "United Kingdom" ~ "No",
                           Country == "United Kingdom" ~ "Yes"), .keep = "used")

A tibble: 39 x 2

But we can again use the .default argument to change them to a value of choice,

df %>%
  filter(StockCode == "85166B") %>%
  mutate(Country = na_if(Country, "Unspecified"),
         is_UK = case_when(!Country == "United Kingdom" ~ "No",
                           Country == "United Kingdom" ~ "Yes", .default = "not available"), .keep = "used")

A tibble: 39 x 2

or alternatively write a test just for them.

df %>%
  filter(StockCode == "85166B") %>%
  mutate(Country = na_if(Country, "Unspecified"),
         is_UK = case_when(is.na(Country) ~ "not available",
                           !Country == "United Kingdom" ~ "No",
                           Country == "United Kingdom" ~ "Yes"), .keep = "used")

A tibble: 39 x 2

- order of evaluation

As the tests are evaluated in order, it is important to start from the more particular one and then generalize, to avoid misclassifications when an element can pertain to more than one class, like “Unspecified” here.

df %>%
  mutate(is_EU = case_when(Country == "Unspecified" ~ Country,
                           Country %in% EU ~ "Yes",
                           !Country %in% EU ~ "No")) %>% 
  count(is_EU, sort = TRUE)

A tibble: 3 x 2

Because, if we had put the “No” condition before the “Unspecified” one, we would have had a column without the “Unspecified” value, as “Unspecified” doesn’t appear in EU so it would have been classified as “No”.

df %>%
  mutate(is_EU = case_when(Country %in% EU ~ "Yes",
                           !Country %in% EU ~ "No",
                           Country == "Unspecified" ~ Country)) %>% 
  count(is_EU)

A tibble: 2 x 2

When present, .default is evaluated last.

As with if_else(), the length of the classes must be consistent across all tests and equal to the number of rows of the data frame or to 1.

It is also important that the length of the logical vectors outputted by the tests are all of the same size (in the second test of the following example that is not the case).

df %>%
  mutate(is_EU = case_when(Country == "Unspecified" ~ Country,
                           EU %in% Country ~ "Yes",
                           !Country %in% EU ~ "No"))

Error in `mutate()`:
ℹ In argument: `is_EU = case_when(...)`.
Caused by error in `case_when()`:
! Can't recycle `..1 (left)` (size 525461) to match `..2 (left)` (size 21).

All the elements of the new column must be of the same type as well.

df %>%
  mutate(is_EU = case_when(Country == "Unspecified" ~ Country,
                           Country %in% EU ~ TRUE,
                           !Country %in% EU ~ FALSE))

Error in `mutate()`:
ℹ In argument: `is_EU = case_when(...)`.
Caused by error in `case_when()`:
! Can't combine `..1 (right)` <character> and `..2 (right)` <logical>.

As with if_else(), NAs can be evaluated to one of the classes.

df %>%
  mutate(Country = na_if(Country, "Unspecified")) %>%
  mutate(is_EU = case_when(Country %in% EU ~ "Yes",
                           !Country %in% EU ~ "No"), .keep = "used") %>%
  arrange(!is.na(Country))

A tibble: 525461 x 2

If this goes against our wishes, we need to add another test just for them.

df %>%
  mutate(Country = na_if(Country, "Unspecified")) %>%
  mutate(is_EU = case_when(is.na(Country) ~ NA,
                           Country %in% EU ~ "Yes",
                           !Country %in% EU ~ "No"), .keep = "used") %>%
  arrange(!is.na(Country))

A tibble: 525461 x 2

As case_when() first evaluates all of the classes for all of the elements to be tested, we might get messages that don’t seem to apply to our situation, like here where the evaluation of sqrt(Quantity) produces NaNs for negative values, NaNs that won’t be in the final output though as for those case_when() will assign a 0.

df %>%
  mutate(New_Quantity = case_when(Quantity > 0 ~ sqrt(Quantity),
                                  Quantity <= 0 ~ 0), .keep = "used") %>%
  arrange(Quantity)

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `New_Quantity = case_when(Quantity > 0 ~
  sqrt(Quantity), Quantity <= 0 ~ 0)`.
Caused by warning in `sqrt()`:
! NaNs produced

A tibble: 525461 x 2

We can use expressions not only for the classes but also for the tests.

df %>%
  mutate(New_Quantity = case_when(Quantity / 2 > 0 ~ sqrt(Quantity),
                                  Quantity <= 0 ~ 0), .keep = "used")

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `New_Quantity = case_when(Quantity/2 > 0 ~
  sqrt(Quantity), Quantity <= 0 ~ 0)`.
Caused by warning in `sqrt()`:
! NaNs produced

A tibble: 525461 x 2

- .ptype

With the .ptype argument we can change the type of the output.

df %>%
  filter(StockCode == "72781") %>%
  mutate(Country = na_if(Country, "Unspecified")) %>%
  mutate(is_EU = case_when(is.na(Country) ~ NA,
                           Country %in% EU ~ TRUE,
                           !Country %in% EU ~ FALSE, .ptype = integer()), .keep = "used")

A tibble: 17 x 2

- .size

And with the .size one we can override its length, for example to recycle an unitary test output to the length of the data frame.

df %>%
  mutate(New_Col = case_when(1 == 1 ~ TRUE, .size = nrow(df)), .keep = "used")

A tibble: 525461 x 1

- case_match()

Thanks to case_match() we can simplify the following example,

df %>%
  mutate(Language = case_when(Country %in% c("Austria", "Germany") ~ "Germanic",
                              Country %in% c("France", "Italy", "Portugal", "Spain") ~ "Romance",
                              Country %in% c("Australia", "EIRE", "United Kingdom", "USA") ~ "English",.default = Country), .keep = "used")

A tibble: 525461 x 2

into this, basically substituting the tests with the values used for matchmaking. Notice how the first argument of the function is the column.

df %>%
  mutate(Language = case_match(Country,
                               c("Austria", "Germany") ~ "Germanic",
                               c("France", "Italy", "Portugal", "Spain") ~ "Romance",
                               c("Australia", "EIRE", "United Kingdom", "USA") ~ "English", .default = Country), .keep = "used")

A tibble: 525461 x 2

- .default / .ptype

As seen in the previous example, we can use the .default argument with case_match().
Furthermore, it has a .ptype argument as well.

df %>%
  mutate(English_Language = case_match(Country,
                                       c("Australia", "EIRE", "United Kingdom", "USA") ~ TRUE, .default = FALSE, .ptype = integer()), .keep = "used")

A tibble: 525461 x 2

- with group_by()

When the data frame is grouped the tests are performed group-wise, so with specific ones the conditions may vary between groups.

For instance a Price of 4.25 is in the following example labeled as “Expensive” for Germany (row 32) but not for United Kingdom.

df %>%
  group_by(Country) %>%
  mutate(Local_Avg_Price = mean(Price),
         Exp_Item = case_when(Price > mean(Price) ~ "Expensive", 
                              Price == mean(Price) ~ "Average",
                              Price < mean(Price) ~ "Not Expensive"), .keep = "used")  %>%
  filter(between(Price, 4.189418, 4.543470))

A tibble: 20476 x 4

Groups: Country [35]