Building from the previous document, where we removed the rows not pertaining to actual transactions, we continue to clean the data frame, concentrating here on missing values and possible duplicated rows.

- missing values (NAs)

In R missing values are coded as NA (Not Available), and they signify that, for whatever reason, a value in a cell has not been imputed like here for the Description and CustomerID columns.

df %>%
  filter(if_any(everything(), ~ is.na(.x))) %>%
  slice(4)

Our original data frame had this distribution of missing values.

df %>%
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  tidyr::pivot_longer(cols = everything()) %>%
  left_join(df %>%
              summarise(across(everything(), ~ formattable::percent(mean(is.na(.x))))) %>%
              tidyr::pivot_longer(cols = everything()), by = "name") %>%
  rename(Column = name, "Total Number" = value.x, Percentage = value.y) %>%
  arrange(desc(`Total Number`))

While the one resulting from the cleaning procedure of the last document this one,

df_cleaned %>%
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  tidyr::pivot_longer(cols = everything()) %>%
  left_join(df_cleaned %>%
              summarise(across(everything(), ~ formattable::percent(mean(is.na(.x))))) %>%
              tidyr::pivot_longer(cols = everything()), by = "name") %>%
  rename(Column = name, "Total Number" = value.x, Percentage = value.y) %>%
  arrange(desc(`Total Number`))

from where we can see that the manipulations we applied to the original data frame removed some NAs in the CustomerID column and all of them in the Description one, while adding 308 in Country, as we modified the Unspecified value to NA.

We also have to mention that removing NAs can modify tables seen in the last section of the previous document: more specifically this one where, for some countries, the number of customers decreases of one digit if we remove the missing values in the CustomerID column.

df_cleaned %>%
  count(Country, wt = n_distinct(CustomerID), sort = TRUE, name = "Number of Customers") %>%
  full_join(df_cleaned %>%
              filter(!is.na(CustomerID)) %>%
              count(Country, wt = n_distinct(CustomerID), name = "Number of Customers after Removing NAs"), by = "Country")

That is because NAs are counted as one value, like they are one actual CustomerID value, as we can see in the following table where we show their distribution for EIRE.

df_cleaned %>%
  filter(Country == "EIRE") %>%
  count(CustomerID, name = "Number of Occurrences")

From the previous table, we notice as well how we lost three countries by removing NAs (Bermuda, Hong Kong and Lebanon), as evidently these three countries only had those in the CustomerID column.

- CustomerID

Getting back to the table with the distribution of NAs, CustomerID is obviously very concerning, as its missing values amount to 20% of the all data frame. Removing them would cause a big loss in information and modeling power so we would have to decide on a case per case basis depending on the type of analysis.

Let’s see if there are some patterns for these missing values, as a way to find causes or common factors.

Every Invoice value has either the CustomerID’s one present or missing for all of its rows so we exclude an error in data entry like that the value of CustomerID has, for example, been imputed only for the first row of each value of Invoice and not for the subsequent others.

df_cleaned %>%
  group_by(Invoice) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(CustomerID)))) %>%
  count(`Percentage of NAs`, name = "Number of Invoices")

Let’s see if there are some values in the other characters columns for which the CustomerID value is always missing.

We start with StockCode (together with Description, to give some context).

df_cleaned %>%
  group_by(StockCode, Description) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(CustomerID))),
            "Number of Occurrences" = n(), .groups = "drop") %>%
  filter(`Percentage of NAs` == 1) %>%
  arrange(desc(`Number of Occurrences`))

Then Description alone.

df_cleaned %>%
  group_by(Description) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(CustomerID))),
            "Number of Occurrences" = n()) %>%
  filter(`Percentage of NAs` == 1) %>%
  arrange(desc(`Number of Occurrences`))

And finally Country, where we rediscover the three countries mentioned ealier,

df_cleaned %>%
  group_by(Country) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(CustomerID))),
            "Number of Occurrences" = n()) %>%
  filter(`Percentage of NAs` == 1) %>%
  arrange(desc(`Number of Occurrences`))

three countries that don’t contribute much though to the overall number of NAs in the CustomerID column (only 118 rows out of 103183).

df_cleaned %>%
  filter(Country %in% c("Bermuda", "Hong Kong", "Lebanon"))

Also the countries with a percentage of NAs in the CustomerID column less than 100% don’t communicate a lot, besides the highest percentages belonging to countries outside of Europe.

df_cleaned %>%
  group_by(Country) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(CustomerID))),
            "Number of Occurrences" = n()) %>%
  filter(`Percentage of NAs` > 0 &
           `Percentage of NAs` < 1) %>%
  arrange(desc(`Percentage of NAs`))

But we have countries outside of Europe with 0 NAs in the CustomerID column as well.

df_cleaned %>%
  group_by(Country) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(CustomerID))),
            "Number of Occurrences" = n()) %>%
  filter(`Percentage of NAs` == 0) %>%
  arrange(desc(`Number of Occurrences`))

Likewise, it doesn’t seem to exist a suspicious value of Quantity

df_cleaned %>%
  group_by(Quantity) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(CustomerID))),
            "Number of Occurrences" = n()) %>%
  filter(`Percentage of NAs` == 1) %>%
  arrange(desc(`Number of Occurrences`))

or Price

df_cleaned %>%
  group_by(Price) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(CustomerID))),
            "Number of Occurrences" = n()) %>%
  filter(`Percentage of NAs` == 1) %>%
  arrange(desc(`Number of Occurrences`))

for which 100% of NAs in the CustomerID column is evidently more numerous than for the others.

For an unitary value of Quantity we see a large amount of them though, 44.67% of 144125 rows, and this is curious for a retailer that caters to wholesalers.

df_cleaned %>%
  group_by(Quantity) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(CustomerID))),
            "Number of Occurrences" = n()) %>%
  arrange(desc(`Number of Occurrences`))

Let’s see what items these unitary purchases contain and if there are some more prominent than others.

df_cleaned %>%
  filter(is.na(CustomerID) &
           Quantity == 1) %>%
  count(across(c(StockCode, Description)), sort = TRUE, name = "Number of Occurrences")

Maybe there were some specific days where the CustomerID has not been imputed?

library(ggplot2)
df_cleaned %>%
  mutate(NA_CustomerID = if_else(is.na(CustomerID), 1, 0)) %>%
  group_by(InvoiceDay = as.Date(InvoiceDate)) %>%
  summarize(perc = sum(NA_CustomerID) / n()) %>%
  ggplot(aes(InvoiceDay, perc)) +
  geom_line() +
  scale_y_continuous(labels = scales::label_percent(), limits = c(0, 1)) +
  labs(x = NULL,
       y = NULL,
       title = "% of CustomerID values missing per single InvoiceDay")

The graph shows us that the lack of CustomerID is very distributed along the temporal dimension of our data frame, with some spikes on specific days. We have 20% of missing values so we could expect it.

- Country

About the missing values in the Country column, much less diffused as they amount to only 308 rows,

df_cleaned %>%
  filter(is.na(Country))

they are present for just 4 customers,

df_cleaned %>%
  filter(is.na(Country)) %>%
  count(CustomerID, name = "Number of Occurrences")

and 13 invoices, most of which confirmed.

df_cleaned %>%
  filter(is.na(Country)) %>%
  count(Invoice, sort = TRUE, name = "Number of Purchases")

These customers never had a Country value in this data frame, so we can’t impute them.

df_cleaned %>%
  filter(is.na(Country) &
           !is.na(CustomerID)) %>%
  count(CustomerID, Country, name = "Number of Occurrences")

There doesn’t seem to be a particular item

df_cleaned %>%
  group_by(StockCode, Description) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(Country))),
            "Number of Occurrences" = n(), .groups = "drop") %>%
  arrange(desc(`Percentage of NAs`))

or a Quantity

df_cleaned %>%
  group_by(Quantity) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(Country))),
            "Number of Occurrences" = n(), .groups = "drop") %>%
  arrange(desc(`Percentage of NAs`))

or Price

df_cleaned %>%
  group_by(Price) %>%
  summarize("Percentage of NAs" = formattable::percent(mean(is.na(Country))),
            "Number of Occurrences" = n(), .groups = "drop") %>%
  arrange(desc(`Percentage of NAs`))

value for which the NAs in the Country column are evidently more frequent.

On the time scale, this is where NAs are located.

df_cleaned %>%
  filter(is.na(Country)) %>%
  group_by(Invoice, "Invoice Day" = as.Date(InvoiceDate)) %>%
  summarise("Number of Purchases" = n(), .groups = "drop") %>%
  arrange(`Invoice Day`)

df_cleaned %>%
  group_by(InvoiceDay = as.Date(InvoiceDate)) %>%
  summarize(N = sum(is.na(Country))) %>%
  ggplot(aes(InvoiceDay, N)) +
  geom_line() +
  labs(x = NULL,
       y = NULL,
       title = "Number of Country values missing per single InvoiceDay")

Final observation, there are 30 rows where we have NAs in both the CustomerID and Country columns.

df_cleaned %>%
  filter(is.na(CustomerID) & 
           is.na(Country))

- duplicated rows

Sometimes we might have rows that are duplicated; as an example we show the repetitions for the stock codes 21491 and 21912, occurring in invoice 489517.

df_cleaned %>%
  filter(Invoice == "489517" &
           StockCode %in% c("21491", "21912"))

We can easily remove the duplicated rows after the first,

df_cleaned %>%
  filter(Invoice == "489517" &
           StockCode %in% c("21491", "21912")) %>%
  distinct()

and doing that on all the data frame we notice that we lose 6853 rows, roughly the 1%.

tibble("with Duplicated Rows" = nrow(df_cleaned), 
       "w/o Duplicated Rows" = df_cleaned %>%
         distinct() %>%
         nrow(),
       "Difference" = `with Duplicated Rows` - `w/o Duplicated Rows`,
       "Percentage" = formattable::percent(Difference / `with Duplicated Rows`))

We can inspect the removed rows,

df_cleaned %>%
  count(across(everything()), name = "Number of Occurrences") %>%
  filter(`Number of Occurrences` > 1) %>%
  arrange(InvoiceDate)

to see if they are present during certain condition, usually at a specific time stamp, and we see that at the end of 2010 they are slightly more frequent, as shown by the horizontal line that represents the trend (do recall though that we have a significant upward one of invoices per day).

library(ggplot2)
df_cleaned %>%
  count(across(everything()), name = "Number of Occurrences") %>%
  mutate(Status = if_else(`Number of Occurrences` > 1, "Repetition", "Not A Repetition"),
         InvoiceDay = as.Date(InvoiceDate)) %>%
  group_by(InvoiceDay, Status) %>%
  summarise(n = sum(`Number of Occurrences`), .groups = "drop_last") %>%
  mutate(Percentage = n / sum(n)) %>%
  ungroup() %>%
  filter(Status == "Repetition") %>%
  ggplot(aes(InvoiceDay, Percentage)) +
  geom_line() +
  geom_smooth(formula = y ~ x, method = "loess", color = "black", se = FALSE) +  
  scale_y_continuous(labels = scales::label_percent()) +
  labs(x = NULL,
       y = NULL,
       title = "Percentage of Duplicated Rows per Day")

To advance an hypothesis, we recall invoice 489517, an invoice with 38 rows, 8 of which are duplicated one or several times.

df_cleaned %>%
  filter(Invoice == "489517") %>%
  group_by(across(everything())) %>%
  mutate("Number of Occurrences" = n()) %>%
  ungroup() %>%
  arrange(StockCode)

We suggest that it could also be possible that, given the nature of this data frame, those are not duplicated rows but just purchases that has been imputed several times without aggregating them. Here for example stock code 21491 might have had just one row with 2 for Quantity.

df_cleaned %>%
  filter(Invoice == "489517" &
           StockCode == "21491")

In the same way as stock code 21790 has one row with 4 and another with 1.

df_cleaned %>%
  filter(Invoice == "489517" &
           StockCode == "21790")

So erasing them might not be a good option as we would lose information about Quantity.

If we were to remove also the rows like the one for stock code 21790, where we only have a difference in the Quantity column, we would lose 5985 additional rows.

tibble("w/o Duplicated Rows" = df_cleaned %>%
         distinct() %>%
         nrow(),
       "w/o Different Quantity Rows" = df_cleaned %>%
         group_by(pick(!Quantity)) %>%
         slice(1) %>%
         ungroup() %>%
         nrow(),
       "Difference" = `w/o Duplicated Rows` - `w/o Different Quantity Rows`)

We could therefore merge these lines in case we are considering that are not duplicated.

df_cleaned %>%
  group_by(across(c(-Quantity))) %>%
  summarise(Quantity = sum(Quantity), .groups = "drop") %>%
  filter(Invoice == "489517") %>%
  relocate(Quantity, .after = Description)

The data frame presents various cases like these, where we have the same stock code with a different quantity in the same invoice,

df %>%
  count(Invoice, StockCode, wt = n_distinct(Quantity), name = "Number of Occurrences") %>%
  filter(`Number of Occurrences` > 1)

but also stock codes with different prices in the same invoice, like here

df %>%
  filter(Invoice == "489560" &
           StockCode == "35955")

albeit this phenomenon is less frequent,

df %>%
  count(Invoice, StockCode, wt = n_distinct(Price), name = "Number of Occurrences") %>%
  filter(`Number of Occurrences` > 1)

and can be more easily explained by different unit prices for different volumes of purchases, although it happening in the same invoice is rather odd.

- main takeaways

we could not find an explanation for the high number of missing values in the CustomerID column
the missing values in the Country column cannot be imputed, as they pertain to customers for which we don’t have that information
duplicated rows might be actual purchases and not repetitions caused by faulty events
the same invoice can present different prices for the same stock code

- actions performed

No definitive manipulations were performed here, we will decide on a case per case basis depending on the analysis at hand.