We will continue here, with a snappier exposition, in the same vein as in the previous document, focusing this time on the Price column, whose values have this distribution.

df %>%
  reframe(Value = summary(Price))  %>%
  mutate(Statistic = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max"), .before = Value)


- preliminary inspection

We will here inspect the highest

df %>%
  arrange(desc(Price)) %>%
  relocate(InvoiceDate, .after = Price)

and lowest values of the column, to see if we spot some bizarre entries.

df %>%
  arrange(Price)  %>%
  relocate(InvoiceDate, .after = Price)

From the most expensive purchases, the item FLAG OF ST GEORGE CAR FLAG, purchases 65 times, seems to have a high variability in price,

df %>%
  filter(StockCode == "84016")

with 0.42 being the most regular one but we can also encounter values of 1157.15.

df %>%
  filter(StockCode == "84016") %>%
  count(Price, sort = TRUE, name = "Number of Occurrences")

These high values are somehow concentrated in the middle of the year.

library(ggplot2)
df %>%
  filter(StockCode == "84016") %>%
  ggplot(aes(InvoiceDate, Price)) +
  geom_point() +
  labs(x = NULL,
       y = NULL,
       title = "Prices of Purchases for StockCode 84016 (FLAG OF ST GEORGE CAR FLAG) \n along the Time Period")

And, interestingly enough, they are not associated to any CustomerID, while the lower ones, except 5 instances, all are.

df %>%
  filter(StockCode == "84016") %>%
  count(Price, CustomerID, name = "Number of Occurrences") %>%
  arrange(desc(Price))

They don’t seem typos, but transactions we don’t have enough information about. As their presence doesn’t modify the values of the upper and lower bounds returned by the formulas, we will not remove them, but it is important that we are aware of their existence going forward.

But we will still eliminate the rows of cancelled invoices, even if they also don’t change the values of the bounds, as their substantial number can modify many percentages, plus, as we discussed in the previous document, we find more correct to consider cancelled invoices as they never existed for the goals of this analysis.


- upper outliers

The number returned by the formula is then 8.65,

df_outlier <- df %>%
  filter(Quantity >= 0)
"Price Upper Bound" <- unname(quantile(df_outlier$Price, probs = 0.75) + 1.5 * (quantile(df_outlier$Price, probs = 0.75) - quantile(df_outlier$Price, probs = 0.25)))
library(knitr)
kable(tibble(`Price Upper Bound`), align = "l")
Price Upper Bound
8.65

so, mirroring the analysis we did for the Quantity column, we can start by evaluating how many purchases reached higher prices.

df_outlier %>%
  mutate("Upper Outlier" = if_else(Price > `Price Upper Bound`, TRUE, FALSE)) %>%
  summarise("Total of Purchases" = n(),
            "Upper Outliers" = sum(`Upper Outlier`),
            "Percentage of Upper Outliers" = formattable::percent(mean(`Upper Outlier`)))


We continue then with this table of items sorted by the most frequently highly priced,

df_outlier %>%
  mutate("Upper Outlier" = if_else(Price > `Price Upper Bound`, TRUE, FALSE)) %>%
  group_by(StockCode, Description) %>%
  summarise("Rounded Median Price" = round(median(Price), 0),
            "Number of Purchases" = n(),
            "Percentage of Upper Outliers" = formattable::percent(mean(`Upper Outlier`)), .groups = "drop") %>%
  arrange(desc(`Percentage of Upper Outliers`))

then showing its distribution of Percentage of Upper Outliers, first with a table

df_outlier %>%
  mutate("Upper Outlier" = if_else(Price > `Price Upper Bound`, TRUE, FALSE)) %>%
  group_by(StockCode, Description) %>%
  summarise("Percentage of Upper Outliers" = formattable::percent(mean(`Upper Outlier`)), .groups = "drop") %>%
  count(`Percentage of Upper Outliers`, name = "Number of Distinct Items", sort = TRUE) %>%
  mutate("Percentage of Items" = formattable::percent(`Number of Distinct Items` / sum(`Number of Distinct Items`)))

and then with a graph, where we binned the percentages into 100 different buckets.

df_outlier %>%
  mutate("Upper Outlier" = if_else(Price > `Price Upper Bound`, TRUE, FALSE)) %>%
  group_by(StockCode, Description) %>%
  summarise("Percentage of Upper Outliers" = mean(`Upper Outlier`), .groups = "drop") %>%
  ggplot(aes(`Percentage of Upper Outliers`)) +
  geom_histogram(bins = 100) +
  scale_x_continuous(labels = scales::label_percent()) +
  labs(x = NULL,
       y = NULL,
       title = "Number of Distinct Items for Integer Percentages of Upper Outliers Purchases")

We notice that highly priced items are sold in small quantities.

df_outlier %>%
  mutate("Upper Outlier" = if_else(Price > `Price Upper Bound`, TRUE, FALSE)) %>%
  group_by(StockCode, Description) %>%
  mutate("Percentage of Upper Outliers" = mean(`Upper Outlier`),
         "100% Upper Outlier" = if_else(`Percentage of Upper Outliers` == 1, TRUE, FALSE)) %>%
  ungroup() %>%
  count(`100% Upper Outlier`, wt = round(median(Quantity), 2), name = "Rounded Median Quantity")


We proceed to determine if there are some days where the purchase of highly priced item is more common.

df_outlier %>%
  mutate("Upper Outlier" = if_else(Price > `Price Upper Bound`, TRUE, FALSE)) %>%
  group_by(InvoiceDay = as.Date(InvoiceDate)) %>%
  summarise("Rounded Median Price" = round(median(Price), 0),
            "Number of Purchases" = n(),
            "Percentage of Upper Outliers" = formattable::percent(mean(`Upper Outlier`))) %>%
  arrange(desc(`Percentage of Upper Outliers`))

Now with a graph, to better show that they happen more frequently before the Christmas break.

df_outlier %>%
  mutate("Upper Outlier" = if_else(Price > `Price Upper Bound`, TRUE, FALSE)) %>%
  group_by(InvoiceDay = as.Date(InvoiceDate)) %>%
  summarise("Percentage of High Outliers" = mean(`Upper Outlier`)) %>%
  ggplot(aes(InvoiceDay, `Percentage of High Outliers`)) +
  geom_col() +
  scale_y_continuous(labels = scales::label_percent(), limits = c(0, 1)) +
  labs(x = NULL,
       y = NULL,
       title = "Percentages of High Ouliers Purchases along the Time Period")

But it is also possible that items are priced higher before Christmas,

df_outlier %>%
  mutate(Christmas = if_else(as.Date(InvoiceDate) < "2009-12-22", "Median Price Before Christmas", "Median Price After Christmas")) %>%
  group_by(StockCode, Description, Christmas) %>%
  summarise("Median Price" = median(Price), .groups = "drop") %>% 
  tidyr::pivot_wider(names_from = Christmas, values_from = `Median Price`) %>%
  mutate("More Expensive before Christmas?" = if_else(`Median Price Before Christmas` > `Median Price After Christmas`, "Yes", "No"), .after = Description,
         "More Expensive before Christmas?" = coalesce(`More Expensive before Christmas?`, "missing information"),
         across(where(is.numeric), ~ coalesce(as.character(.x), "never sold")))

and that is true for 794 stock codes out of 4691.

df_outlier %>%
  mutate(Christmas = if_else(as.Date(InvoiceDate) < "2009-12-22", "Median Price Before Christmas", "Median Price Following Christmas")) %>%
  group_by(StockCode, Description, Christmas) %>%
  summarise("Median Price" = median(Price), .groups = "drop") %>% 
  tidyr::pivot_wider(names_from = Christmas, values_from = `Median Price`) %>%
  mutate("More Expensive before Christmas?" = if_else(`Median Price Before Christmas` > `Median Price Following Christmas`, "Yes", "No"), .after = Description,
         "More Expensive before Christmas?" = coalesce(`More Expensive before Christmas?`, "missing information"),
         across(where(is.numeric), ~ coalesce(as.character(.x), "never sold"))) %>%
  count(`More Expensive before Christmas?`, name = "Number of Distinct Items")


This table shows how frequently each customer bought highly priced items,

df_outlier %>%
  mutate("Upper Outlier" = if_else(Price > `Price Upper Bound`, TRUE, FALSE)) %>%
  group_by(CustomerID) %>%
  summarise("Rounded Median Price" = round(median(Price), 0),
            "Number of Purchases" = n(),
            "Percentage of Upper Outliers" = formattable::percent(mean(`Upper Outlier`))) %>%
  arrange(desc(`Percentage of Upper Outliers`), desc(`Number of Purchases`))

while this one how many of them there are for every percentage.

df %>%
  mutate("Upper Outlier" = if_else(Price > `Price Upper Bound`, TRUE, FALSE)) %>%
  group_by(CustomerID) %>%
  summarise("Percentage of Upper Outliers" = formattable::percent(mean(`Upper Outlier`))) %>%
  count(`Percentage of Upper Outliers`, name = "Number of Customers", sort = TRUE) %>%
  mutate("Percentage of Customers" = formattable::percent(`Number of Customers` / sum(`Number of Customers`)))

In this graph we binned every aforementioned percentage into 100 buckets, to show the number of customer for each of them.

df_outlier %>%
  mutate("Upper Outlier" = if_else(Price > `Price Upper Bound`, TRUE, FALSE)) %>%
  group_by(CustomerID) %>%
  summarise("Percentage of Upper Outliers" = mean(`Upper Outlier`)) %>%
  ggplot(aes(`Percentage of Upper Outliers`)) +
  geom_histogram(bins = 100) +
  scale_x_continuous(labels = scales::label_percent()) +
  labs(x = NULL,
       y = NULL,
       title = "Number of Customers for Integer Percentages of Upper Outliers Purchases")


About the countries, we didn’t find any that mainly buys highly priced items.

df %>%
  mutate("Upper Outlier" = if_else(Price > `Price Upper Bound`, TRUE, FALSE)) %>%
  group_by(Country) %>%
  summarise("Number of Purchases" = n(),
            "Percentage of Upper Outliers" = formattable::percent(mean(`Upper Outlier`))) %>%
  arrange(desc(`Percentage of Upper Outliers`))


- lower outliers

for the lower outliers, as their bound is a negative number,

`Price Lower Bound` <- unname(quantile(df_outlier$Price, probs = 0.25) - 1.5 * (quantile(df_outlier$Price, probs = 0.75) - quantile(df_outlier$Price, probs = 0.25)))
kable(tibble(`Price Lower Bound`), align = "l")
Price Lower Bound
-3.19

we picked an arbitrary value of 1.25 (the first quartile).


Let’s proceed then as we did with the upper outliers, first looking at how many of these low priced purchases there are,

df_outlier %>%
  mutate("Low Priced" = if_else(Price < 1.25, TRUE, FALSE)) %>%
  summarise("Total of Purchases" = n(),
            "Low Priced Purchases" = sum(`Low Priced`),
            "Percentage of Low Priced Purchases" = formattable::percent(mean(`Low Priced`)))


and then at what items are most commonly sold at these prices.

df_outlier %>%
  mutate("Low Priced" = if_else(Price < 1.25, TRUE, FALSE)) %>%
  group_by(StockCode, Description) %>%
  summarise("Rounded Median Price" = round(median(Price), 2),
            "Number of Purchases" = n(),
            "Percentage of Low Priced Purchases" = formattable::percent(mean(`Low Priced`)), .groups = "drop") %>%
  arrange(desc(`Percentage of Low Priced Purchases`))

The distribution, like for the upper outliers, has two values that are much more common,

df_outlier %>%
  mutate("Low Priced" = if_else(Price < 1.25, TRUE, FALSE)) %>%
  group_by(StockCode, Description) %>%
  summarise("Percentage of Low Priced Purchases" = formattable::percent(mean(`Low Priced`)), .groups = "drop") %>%
  count(`Percentage of Low Priced Purchases`, name = "Number of Distinct Items", sort = TRUE) %>%
  mutate("Percentage of Distinct Items" = formattable::percent(`Number of Distinct Items` / sum(`Number of Distinct Items`)))

as it is more easily seen by the following graph, with the percentages binned into 100 values.

df_outlier %>%
  mutate("Low Priced" = if_else(Price < 1.25, TRUE, FALSE)) %>%
  group_by(StockCode, Description) %>%
  summarise("Percentage of Low Priced Purchases" = mean(`Low Priced`), .groups = "drop") %>%
  ggplot(aes(`Percentage of Low Priced Purchases`)) +
  geom_histogram(bins = 100) +
  scale_x_continuous(labels = scales::label_percent()) +
  labs(x = NULL,
       y = NULL,
       title = "Number of Distinct Items for Integer Percentages of Low Priced Purchases")

As before, their median quantity is different from the other items, this time being higher.

df_outlier %>%
  mutate("Low Priced" = if_else(Price < 1.25, TRUE, FALSE)) %>%
  group_by(StockCode, Description) %>%
  mutate("Percentage of Low Priced Purchases" = formattable::percent(mean(`Low Priced`)),
         "100% Low Priced Purchase" = if_else(`Percentage of Low Priced Purchases` == 1, TRUE, FALSE)) %>%
  ungroup() %>%
  count(`100% Low Priced Purchase`, wt = round(median(Quantity), 2), name = "Rounded Median Quantity")

For this subset we are examining, two values of Price are more common (0.4 and 0.8).

df_outlier %>%
  filter(Price < 1.25) %>%
  ggplot(aes(Price)) +
  geom_histogram(binwidth = 0.1) +
  scale_x_continuous(labels = scales::label_dollar(prefix = "£")) +
  labs(x = NULL,
       y = NULL,
       title = "Distribution of Prices for Low Priced Purchases")


On a temporal scale, it seems that low priced items are bought evenly during the year, without any particular spikes (reminder that our retailer caters to wholesalers).

df_outlier %>%
  mutate("Low Priced" = if_else(Price < 1.25, TRUE, FALSE)) %>%
  group_by(InvoiceDay = as.Date(InvoiceDate)) %>%
  summarise("Rounded Median Price" = round(median(Price), 2),
            "Number of Purchases" = n(),
            "Percentage of Low Priced Purchases" = formattable::percent(mean(`Low Priced`))) %>%
  arrange(desc(`Percentage of Low Priced Purchases`))
df_outlier %>%
  mutate("Low Priced" = if_else(Price < 1.25, TRUE, FALSE)) %>%
  group_by(InvoiceDay = as.Date(InvoiceDate)) %>%
  summarise("Percentage of Low Priced Purchases" = mean(`Low Priced`)) %>%
  ggplot(aes(InvoiceDay, `Percentage of Low Priced Purchases`)) +
  geom_col() +
  scale_y_continuous(labels = scales::label_percent(), limits = c(0, 1)) +
  labs(x = NULL,
       y = NULL,
       title = "Percentages of Low Priced Purchases along the Time Period")


About the customers, we can here identify the ones that buy low priced items more often,

df_outlier %>%
  mutate("Low Priced" = if_else(Price < 1.25, TRUE, FALSE)) %>%
  group_by(CustomerID) %>%
  summarise("Rounded Median Price" = round(median(Price), 2),
            "Number of Purchases" = n(),
            "Percentage of Low Priced Purchases" = formattable::percent(mean(`Low Priced`))) %>%
  arrange(desc(`Percentage of Low Priced Purchases`))

and hereafter show how many of them there are for every percentage of low priced purchases,

df_outlier %>%
  mutate("Low Priced" = if_else(Price < 1.25, TRUE, FALSE)) %>%
  group_by(CustomerID) %>%
  summarise("Percentage of Low Priced Purchases" = formattable::percent(mean(`Low Priced`))) %>%
  count(`Percentage of Low Priced Purchases`, name = "Number of Customers", sort = TRUE)

with the usual graph of binned percentages.

df_outlier %>%
  mutate("Low Priced" = if_else(Price < 1.25, TRUE, FALSE)) %>%
  group_by(CustomerID) %>%
  summarise("Percentage of Low Priced Purchases" = mean(`Low Priced`)) %>%
  ggplot(aes(`Percentage of Low Priced Purchases`)) +
  geom_histogram(bins = 100) +
  scale_x_continuous(labels = scales::label_percent()) +
  labs(x = NULL,
       y = NULL,
       title = "Number of Customers for Integer Percentages of Low Priced Purchases")


Focusing on the countries, we don’t notice any that buys low priced items the majority of times.

df_outlier %>%
  mutate("Low Priced" = if_else(Price < 1.25, TRUE, FALSE)) %>%
  group_by(Country) %>%
  summarise("Number of Purchases" = n(),
            "Percentage of Low Priced Purchases" = formattable::percent(mean(`Low Priced`))) %>%
  arrange(desc(`Percentage of Low Priced Purchases`))


- main takeaways

FLAG OF ST GEORGE CAR FLAG is an item with a strange price pattern, about the rest, as we did with the Quantity column, we built an understanding of the most common occurrences for both the highly and low priced purchases, returning many tables that can be used to direct business decisions, but we didn’t discover any additional very peculiar instances worth investigating further.