Missing values in a data frame are coded as NA (not available) and are elements that can prove themselves troublesome, especially as, when involved, they can alter the results of calculations.
<- c(1, 2, 3)) (x
## [1] 1 2 3
mean(x)
## [1] 2
sum(x)
## [1] 6
<- c(x, NA)) (xNA
## [1] 1 2 3 NA
mean(xNA)
## [1] NA
sum(xNA)
## [1] NA
mean()
and sum()
have an na.rm
argument that helps deal with them
mean(xNA, na.rm = TRUE)
## [1] 2
sum(xNA, na.rm = TRUE)
## [1] 6
but other functions might not, so it is important to know how to detect and remove or substitute them, hence why I decided to collect and organize in this page procedures possibly already presented in other ones.
At first it is useful to know their location, marking with TRUE the columns where they are present and with FALSE otherwise.
%>%
df summarise(across(everything(), ~ any(is.na(.x))))
Additionally, we may want to know how many of them are there per column,
%>%
df summarise(across(everything(), ~ sum(is.na(.x))))
and as a percentage over the number of rows.
%>%
df summarise(across(everything(), ~ formattable::percent(mean(is.na(.x)))))
We can combine those two information in just one output.
%>%
df summarise(across(everything(), ~ sum(is.na(.x)))) %>%
::pivot_longer(everything(), names_to = "Column", values_to = "#_of_NAs") %>%
tidyrbind_cols(df %>%
summarise(across(everything(), ~ formattable::percent(mean(is.na(.x))))) %>%
::pivot_longer(everything(), names_to = "Column", values_to = "%_of_NAs") %>%
tidyrselect(2))
Knowing about their presence and number for each row can also be useful.
%>%
df mutate(row_with_NAs = as.logical(rowSums(is.na(df))), .before = 1)
%>%
df mutate(`#_NAs` = rowSums(is.na(df)), .before = 1)
For a grouped data frame we can show, for every column and for groups with NAs, both their total number
%>%
df group_by(Country) %>%
summarise(across(everything(), ~ sum(is.na(.x)))) %>%
filter(rowSums(across(Invoice:`Customer ID`)) != 0)
and the percentage over the total number of rows for each group.
%>%
df group_by(Country) %>%
summarise(across(everything(), ~ formattable::percent(mean(is.na(.x))))) %>%
filter(rowSums(across(Invoice:`Customer ID`)) != 0)
We can also concentrate on only one column,
%>%
df group_by(Country) %>%
summarise(NAs_in_Customer_ID = any(is.na(`Customer ID`)))
%>%
df group_by(Country) %>%
summarise(`#_of_NAs_in_Customer_ID` = sum(is.na(`Customer ID`)))
merging the two information into one table, only for groups with NAs.
%>%
df count(`NAs_in_Customer_ID` = is.na(`Customer ID`), Country, name = "#_of_rows") %>%
arrange(Country) %>%
group_by(Country) %>%
filter(n() > 1)
If we wish to remove the columns with NAs we can select the ones where all their values are not NAs,
%>%
df select(where(~ all(!is.na(.x))))
or deselect the ones where any of their values are NAs; these two commands are equivalent.
%>%
df select(where(~ !any(is.na(.x))))
Of course we can also decide to select the columns with NAs, with either one of the following lines of code.
%>%
df select(where(~ any(is.na(.x))))
%>%
df select(where(~ !all(!is.na(.x))))
A similar call can be useful to investigate if there are columns with only NAs.
%>%
df select(where(~ all(is.na(.x))))
If we wish to preserve the rows without NAs in a particular column we can write
%>%
df filter(!is.na(Description))
Using the correct Boolean operator, we can also decide to preserve the rows without NAs in either one of the specified columns,
%>%
df filter(!is.na(Description) |
!is.na(`Customer ID`))
or in all of them.
%>%
df filter(!is.na(Description) &
!is.na(`Customer ID`))
If we yet don’t know which columns have NAs, we can preserve the rows without them like this:
%>%
df filter(if_all(everything(), ~ !is.na(.x)))
Using if_any()
will instead be equivalent to preserving
the rows without NAs in any given column, so only rows with all NAs will
be removed in this case.
%>%
df filter(if_any(everything(), ~ !is.na(.x)))
Of course we can also decide to preserve rows with NAs; we can do so
by removing the negations from the previous five examples. With the
first one we will only preserve the rows with NAs in the
Description
column.
%>%
df filter(is.na(Description))
With the second one we will preserve rows with NAs in either one of the columns specified,
%>%
df filter(is.na(Description) |
is.na(`Customer ID`))
which is here equivalent to the last one without negations, as all
the rows with NAs in Description
have NAs in
Customer ID
.
%>%
df filter(if_any(everything(), ~ is.na(.x)))
With the third one the rows with NAs in all of the columns specified, again equivalent to a previous call given the NAs distribution of this data frame.
%>%
df filter(is.na(Description) &
is.na(`Customer ID`))
And with the fourth example the rows that only have NAs (so none with this data frame).
%>%
df filter(if_all(everything(), ~ is.na(.x)))
Another approach is to substitute NAs with another value; the following will work as long as the columns that contain them are of the same type of the value we want to input.
%>%
df mutate(across(where(~ any(is.na(.x)) & is.character(.x)), ~ coalesce(.x, "missing")))
%>%
df mutate(across(where(~ any(is.na(.x)) & is.numeric(.x)), ~ coalesce(.x, 000000)))
The previous two calls can be combined into a single one.
%>%
df mutate(across(where(~ any(is.na(.x)) & is.character(.x)), ~ coalesce(.x, "missing")),
across(where(~ any(is.na(.x)) & is.numeric(.x)), ~ coalesce(.x, 000000)))
If instead we want to recode a specific value to NA, we can proceed like this.
%>%
df mutate(across(where(is.character), ~ na_if(., "Unspecified")),
across(where(is.numeric), ~ na_if(., 12)))
NA is a special coded value, useful to identify missing values in data frames, but we can apply the previous procedures also to a value of choice, like for example “CHRISTMAS”.
%>%
df summarise(across(everything(), ~ any(stringr::str_detect(.x, "CHRISTMAS"), na.rm = TRUE)))
%>%
df summarise(across(everything(), ~ sum(stringr::str_detect(.x, "CHRISTMAS"), na.rm = TRUE)))
%>%
df summarise(across(everything(), ~ formattable::percent(mean(stringr::str_detect(.x, "CHRISTMAS"), na.rm = TRUE))))
We had to use a function from the stringr
package to act
on behalf of is.na()
, and also to add
na.rm = TRUE
to not have NAs unfavourably modify the output
(for any()
/all()
the reasons are explained here
in the any(), all() & their variants page, but briefly
any(FALSE, NA)
returns NA
).
%>%
df summarise(across(everything(), ~ any(stringr::str_detect(.x, "CHRISTMAS"))))
To substitute, we can use str_replace_all()
, always from
the stringr
package.
%>%
df mutate(across(where(is.character), ~ stringr::str_replace_all(.x, "CHRISTMAS", "EASTER")))