We can nest several if_else()
s if we want a column with
more than two classes in the output.
<- c("Austria", "Belgium", "Channel Islands", "Cyprus", "Denmark", "EIRE", "Finland", "France", "Germany",
EU "Greece", "Iceland", "Italy", "Lithuania", "Malta", "Netherlands", "Norway", "Poland",
"Portugal", "Spain", "Sweden", "Switzerland")
%>%
df mutate(in_EU = if_else(Country == "Unspecified", Country,
if_else(Country %in% EU, "Yes", "No")), .keep = "used")
But it is best to use case_when()
, which is built for
this purpose. It uses a formula like syntax, where on the right of
~
we specify which value to assign when the test on the
left evaluates to TRUE.
%>%
df mutate(in_EU = case_when(Country == "Unspecified" ~ Country,
%in% EU ~ "Yes",
Country !Country %in% EU ~ "No"), .keep = "used")
We can’t specify any value for when a test evaluates to FALSE so, when an element doesn’t evaluate to TRUE to any test, the output is NA.
%>%
df mutate(in_EU = case_when(Country == "Unspecified" ~ Country,
%in% EU ~ "Yes"), .keep = "used") Country
This means that, in order to not produce NAs, each element needs a test that evaluates to TRUE.
Sometimes that can be bothersome to do, like if we only wanted to modify the name of 2 elements, and we might also miss on some values.
%>%
df mutate(New_Country = case_when(Country == "United Kingdom" ~ "UK",
== "EIRE" ~ "Ireland",
Country == "Austria" ~ Country,
Country == "Belgium" ~ Country,
Country == "Channel Islands" ~ Country,
Country == "Cyprus" ~ Country,
Country == "Denmark" ~ Country,
Country == "Finland" ~ Country,
Country == "France" ~ Country,
Country == "Germany" ~ Country,
Country == "Greece" ~ Country,
Country == "Iceland" ~ Country,
Country == "Italy" ~ Country,
Country == "Lithuania" ~ Country,
Country == "Malta" ~ Country,
Country == "Netherlands" ~ Country,
Country == "Norway" ~ Country,
Country == "Poland" ~ Country,
Country == "Portugal" ~ Country,
Country == "Spain" ~ Country,
Country == "Sweden" ~ Country,
Country == "Switzerland" ~ Country), .keep = "used") %>%
Country arrange(Country != "Bahrain")
So we can use the .default
argument for assigning a
value of choice instead of NA.
%>%
df mutate(New_Country = case_when(Country == "United Kingdom" ~ "UK",
== "EIRE" ~ "Ireland", .default = Country), .keep = "used") %>%
Country arrange(Country != "Bahrain")
NAs in the input are treated as FALSE, therefore they will return NAs (i.e. row 12).
%>%
df filter(StockCode == "85166B") %>%
mutate(Country = na_if(Country, "Unspecified"),
is_UK = case_when(!Country == "United Kingdom" ~ "No",
== "United Kingdom" ~ "Yes"), .keep = "used") Country
But we can again use the .default
argument to change
them to a value of choice,
%>%
df filter(StockCode == "85166B") %>%
mutate(Country = na_if(Country, "Unspecified"),
is_UK = case_when(!Country == "United Kingdom" ~ "No",
== "United Kingdom" ~ "Yes", .default = "not available"), .keep = "used") Country
or alternatively write a test just for them.
%>%
df filter(StockCode == "85166B") %>%
mutate(Country = na_if(Country, "Unspecified"),
is_UK = case_when(is.na(Country) ~ "not available",
!Country == "United Kingdom" ~ "No",
== "United Kingdom" ~ "Yes"), .keep = "used") Country
As the tests are evaluated in order, it is important to start from the more particular one and then generalize, to avoid misclassifications when an element can pertain to more than one class, like “Unspecified” here.
%>%
df mutate(is_EU = case_when(Country == "Unspecified" ~ Country,
%in% EU ~ "Yes",
Country !Country %in% EU ~ "No")) %>%
count(is_EU, sort = TRUE)
Because, if we had put the “No” condition before the “Unspecified”
one, we would have had a column without the “Unspecified” value, as
“Unspecified” doesn’t appear in EU
so it would have been
classified as “No”.
%>%
df mutate(is_EU = case_when(Country %in% EU ~ "Yes",
!Country %in% EU ~ "No",
== "Unspecified" ~ Country)) %>%
Country count(is_EU)
When present, .default
is evaluated last.
As with if_else()
, the length of the classes must be
consistent across all tests and equal to the number of rows of the data
frame or to 1.
It is also important that the length of the logical vectors outputted by the tests are all of the same size (in the second test of the following example that is not the case).
%>%
df mutate(is_EU = case_when(Country == "Unspecified" ~ Country,
%in% Country ~ "Yes",
EU !Country %in% EU ~ "No"))
Error in `mutate()`:
ℹ In argument: `is_EU = case_when(...)`.
Caused by error in `case_when()`:
! Can't recycle `..1 (left)` (size 525461) to match `..2 (left)` (size 21).
All the elements of the new column must be of the same type as well.
%>%
df mutate(is_EU = case_when(Country == "Unspecified" ~ Country,
%in% EU ~ TRUE,
Country !Country %in% EU ~ FALSE))
Error in `mutate()`:
ℹ In argument: `is_EU = case_when(...)`.
Caused by error in `case_when()`:
! Can't combine `..1 (right)` <character> and `..2 (right)` <logical>.
As with if_else()
, NAs can be evaluated to one of the
classes.
%>%
df mutate(Country = na_if(Country, "Unspecified")) %>%
mutate(is_EU = case_when(Country %in% EU ~ "Yes",
!Country %in% EU ~ "No"), .keep = "used") %>%
arrange(!is.na(Country))
If this goes against our wishes, we need to add another test just for them.
%>%
df mutate(Country = na_if(Country, "Unspecified")) %>%
mutate(is_EU = case_when(is.na(Country) ~ NA,
%in% EU ~ "Yes",
Country !Country %in% EU ~ "No"), .keep = "used") %>%
arrange(!is.na(Country))
As case_when()
first evaluates all of the classes for
all of the elements to be tested, we might get messages that don’t seem
to apply to our situation, like here where the evaluation of
sqrt(Quantity)
produces NaNs for negative values, NaNs that
won’t be in the final output though as for those
case_when()
will assign a 0.
%>%
df mutate(New_Quantity = case_when(Quantity > 0 ~ sqrt(Quantity),
<= 0 ~ 0), .keep = "used") %>%
Quantity arrange(Quantity)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `New_Quantity = case_when(Quantity > 0 ~
sqrt(Quantity), Quantity <= 0 ~ 0)`.
Caused by warning in `sqrt()`:
! NaNs produced
We can use expressions not only for the classes but also for the tests.
%>%
df mutate(New_Quantity = case_when(Quantity / 2 > 0 ~ sqrt(Quantity),
<= 0 ~ 0), .keep = "used") Quantity
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `New_Quantity = case_when(Quantity/2 > 0 ~
sqrt(Quantity), Quantity <= 0 ~ 0)`.
Caused by warning in `sqrt()`:
! NaNs produced
With the .ptype
argument we can change the type of the
output.
%>%
df filter(StockCode == "72781") %>%
mutate(Country = na_if(Country, "Unspecified")) %>%
mutate(is_EU = case_when(is.na(Country) ~ NA,
%in% EU ~ TRUE,
Country !Country %in% EU ~ FALSE, .ptype = integer()), .keep = "used")
And with the .size
one we can override its length, for
example to recycle an unitary test output to the length of the data
frame.
%>%
df mutate(New_Col = case_when(1 == 1 ~ TRUE, .size = nrow(df)), .keep = "used")
Thanks to case_match()
we can simplify the following
example,
%>%
df mutate(Language = case_when(Country %in% c("Austria", "Germany") ~ "Germanic",
%in% c("France", "Italy", "Portugal", "Spain") ~ "Romance",
Country %in% c("Australia", "EIRE", "United Kingdom", "USA") ~ "English",.default = Country), .keep = "used") Country
into this, basically substituting the tests with the values used for matchmaking. Notice how the first argument of the function is the column.
%>%
df mutate(Language = case_match(Country,
c("Austria", "Germany") ~ "Germanic",
c("France", "Italy", "Portugal", "Spain") ~ "Romance",
c("Australia", "EIRE", "United Kingdom", "USA") ~ "English", .default = Country), .keep = "used")
As seen in the previous example, we can use the .default
argument with case_match()
.
Furthermore, it has a .ptype
argument as well.
%>%
df mutate(English_Language = case_match(Country,
c("Australia", "EIRE", "United Kingdom", "USA") ~ TRUE, .default = FALSE, .ptype = integer()), .keep = "used")
When the data frame is grouped the tests are performed group-wise, so with specific ones the conditions may vary between groups.
For instance a Price
of 4.25 is in the following example
labeled as “Expensive” for Germany (row 32) but not for United
Kingdom.
%>%
df group_by(Country) %>%
mutate(Local_Avg_Price = mean(Price),
Exp_Item = case_when(Price > mean(Price) ~ "Expensive",
== mean(Price) ~ "Average",
Price < mean(Price) ~ "Not Expensive"), .keep = "used") %>%
Price filter(between(Price, 4.189418, 4.543470))