- introduction

Missing values in a data frame are coded as NA (not available) and are elements that can prove themselves troublesome, especially as, when involved, they can alter the results of calculations.

(x <- c(1, 2, 3))

## [1] 1 2 3

mean(x)

## [1] 2

sum(x)

## [1] 6

(xNA <- c(x, NA))

## [1]  1  2  3 NA

mean(xNA)

## [1] NA

sum(xNA)

## [1] NA

mean() and sum() have an na.rm argument that helps deal with them

mean(xNA, na.rm = TRUE)

## [1] 2

sum(xNA, na.rm = TRUE)

## [1] 6

but other functions might not, so it is important to know how to detect and remove or substitute them, hence why I decided to collect and organize in this page procedures possibly already presented in other ones.

- column-wise detection

At first it is useful to know their location, marking with TRUE the columns where they are present and with FALSE otherwise.

df %>%
  summarise(across(everything(), ~ any(is.na(.x))))

A tibble: 1 x 8

Additionally, we may want to know how many of them are there per column,

df %>%
  summarise(across(everything(), ~ sum(is.na(.x))))

A tibble: 1 x 8

and as a percentage over the number of rows.

df %>%
  summarise(across(everything(), ~ formattable::percent(mean(is.na(.x)))))

A tibble: 1 x 8

We can combine those two information in just one output.

df %>%
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  tidyr::pivot_longer(everything(), names_to = "Column", values_to = "#_of_NAs") %>%
  bind_cols(df %>%
              summarise(across(everything(), ~ formattable::percent(mean(is.na(.x))))) %>%
              tidyr::pivot_longer(everything(), names_to = "Column", values_to = "%_of_NAs") %>%
              select(2))

A tibble: 8 x 3

- row-wise detection

Knowing about their presence and number for each row can also be useful.

df %>%
  mutate(row_with_NAs = as.logical(rowSums(is.na(df))), .before = 1)

A tibble: 525461 x 9

df %>%
  mutate(`#_NAs` = rowSums(is.na(df)), .before = 1)

A tibble: 525461 x 9

- group-wise detection

For a grouped data frame we can show, for every column and for groups with NAs, both their total number

df %>%
  group_by(Country) %>%
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  filter(rowSums(across(Invoice:`Customer ID`)) != 0)

A tibble: 13 x 8

and the percentage over the total number of rows for each group.

df %>%
  group_by(Country) %>%
  summarise(across(everything(), ~ formattable::percent(mean(is.na(.x))))) %>%
  filter(rowSums(across(Invoice:`Customer ID`)) != 0)

A tibble: 13 x 8

We can also concentrate on only one column,

df %>%
  group_by(Country) %>%
  summarise(NAs_in_Customer_ID = any(is.na(`Customer ID`)))

A tibble: 40 x 2

df %>%
  group_by(Country) %>%
  summarise(`#_of_NAs_in_Customer_ID` = sum(is.na(`Customer ID`)))

A tibble: 40 x 2

merging the two information into one table, only for groups with NAs.

df %>%
  count(`NAs_in_Customer_ID` = is.na(`Customer ID`), Country, name = "#_of_rows") %>%
  arrange(Country) %>%
  group_by(Country) %>%
  filter(n() > 1)

A tibble: 20 x 3

Groups: Country [10]

- column-wise removal

If we wish to remove the columns with NAs we can select the ones where all their values are not NAs,

df %>%
  select(where(~ all(!is.na(.x))))

A tibble: 525461 x 6

or deselect the ones where any of their values are NAs; these two commands are equivalent.

df %>%
  select(where(~ !any(is.na(.x))))

A tibble: 525461 x 6

Of course we can also decide to select the columns with NAs, with either one of the following lines of code.

df %>%
  select(where(~ any(is.na(.x))))

A tibble: 525461 x 2

df %>%
  select(where(~ !all(!is.na(.x))))

A tibble: 525461 x 2

A similar call can be useful to investigate if there are columns with only NAs.

df %>%
  select(where(~ all(is.na(.x))))

A tibble: 525461 x 0

- row-wise removal

If we wish to preserve the rows without NAs in a particular column we can write

df %>%
  filter(!is.na(Description))

A tibble: 522533 x 8

Using the correct Boolean operator, we can also decide to preserve the rows without NAs in either one of the specified columns,

df %>%
  filter(!is.na(Description) |
           !is.na(`Customer ID`))

A tibble: 522533 x 8

or in all of them.

df %>%
  filter(!is.na(Description) &
           !is.na(`Customer ID`))

A tibble: 417534 x 8

If we yet don’t know which columns have NAs, we can preserve the rows without them like this:

df %>%
  filter(if_all(everything(), ~ !is.na(.x)))

A tibble: 417534 x 8

Using if_any() will instead be equivalent to preserving the rows without NAs in any given column, so only rows with all NAs will be removed in this case.

df %>%
  filter(if_any(everything(), ~ !is.na(.x)))

A tibble: 525461 x 8

Of course we can also decide to preserve rows with NAs; we can do so by removing the negations from the previous five examples. With the first one we will only preserve the rows with NAs in the Description column.

df %>%
  filter(is.na(Description))

A tibble: 2928 x 8

With the second one we will preserve rows with NAs in either one of the columns specified,

df %>%
  filter(is.na(Description) |
           is.na(`Customer ID`))

A tibble: 107927 x 8

which is here equivalent to the last one without negations, as all the rows with NAs in Description have NAs in Customer ID.

df %>%
  filter(if_any(everything(), ~ is.na(.x)))

A tibble: 107927 x 8

With the third one the rows with NAs in all of the columns specified, again equivalent to a previous call given the NAs distribution of this data frame.

df %>%
  filter(is.na(Description) &
           is.na(`Customer ID`))

A tibble: 2928 x 8

And with the fourth example the rows that only have NAs (so none with this data frame).

df %>%
  filter(if_all(everything(), ~ is.na(.x)))

A tibble: 0 x 8

- substitution

Another approach is to substitute NAs with another value; the following will work as long as the columns that contain them are of the same type of the value we want to input.

df %>%
  mutate(across(where(~ any(is.na(.x)) & is.character(.x)), ~ coalesce(.x, "missing")))

A tibble: 525461 x 8

df %>%
  mutate(across(where(~ any(is.na(.x)) & is.numeric(.x)), ~ coalesce(.x, 000000)))

A tibble: 525461 x 8

The previous two calls can be combined into a single one.

df %>%
  mutate(across(where(~ any(is.na(.x)) & is.character(.x)), ~ coalesce(.x, "missing")),
         across(where(~ any(is.na(.x)) & is.numeric(.x)), ~ coalesce(.x, 000000)))

A tibble: 525461 x 8

If instead we want to recode a specific value to NA, we can proceed like this.

df %>%
  mutate(across(where(is.character), ~ na_if(., "Unspecified")),
         across(where(is.numeric), ~ na_if(., 12)))

A tibble: 525461 x 8

- other values besides NA

NA is a special coded value, useful to identify missing values in data frames, but we can apply the previous procedures also to a value of choice, like for example “CHRISTMAS”.

df %>%
  summarise(across(everything(), ~ any(stringr::str_detect(.x, "CHRISTMAS"), na.rm = TRUE)))

A tibble: 1 x 8

df %>%
  summarise(across(everything(), ~ sum(stringr::str_detect(.x, "CHRISTMAS"), na.rm = TRUE)))

A tibble: 1 x 8

df %>%
  summarise(across(everything(), ~ formattable::percent(mean(stringr::str_detect(.x, "CHRISTMAS"), na.rm = TRUE))))

A tibble: 1 x 8

We had to use a function from the stringr package to act on behalf of is.na(), and also to add na.rm = TRUE to not have NAs unfavourably modify the output (for any()/all() the reasons are explained here in the any(), all() & their variants page, but briefly any(FALSE, NA) returns NA).

df %>%
  summarise(across(everything(), ~ any(stringr::str_detect(.x, "CHRISTMAS"))))

A tibble: 1 x 8

To substitute, we can use str_replace_all(), always from the stringr package.

df %>%
  mutate(across(where(is.character), ~ stringr::str_replace_all(.x, "CHRISTMAS", "EASTER")))

A tibble: 525461 x 8