After having manipulated our data frame in the previous section of
the site, removing rows that are not pertinent transactions and
exploring missing values, duplication and limit cases, we will here
start to interrogate the data, using standard data analysis techniques,
with the goal to extract information that can be used for business
purposes.
- focus of this document
In this document we will investigate the invoices, with a focus on
the phenomenon of the cancelled ones. Cancelled invoices can be very
concerning as they represent lost sales so we will try to understand how
widespread they are and to quantify the lost revenues.
We decided not to remove duplicated
rows, as we prefer not to lose information about quantity even if
that might inflate the number of purchases.
- basic breakdown
We can begin by reminding their basic figures: how many there are and
how many of them have been cancelled.
df %>%
summarise("Total Number of Invoices" = n_distinct(Invoice),
"Number of Cancelled Invoices" = n_distinct(Invoice[str_starts(Invoice, "C")]),
Percentage = formattable::percent(`Number of Cancelled Invoices` / `Total Number of Invoices`))
We can then continue by showing the number of invoices per day,
differentiating by status.
df %>%
mutate(Status = if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed"),
InvoiceDay = as.Date(InvoiceDate)) %>%
group_by(InvoiceDay, Status) %>%
summarise("Number of Invoices per Day" = n_distinct(Invoice), .groups = "drop") %>%
tidyr::pivot_wider(names_from = Status, values_from = `Number of Invoices per Day`, values_fill = 0) %>%
relocate(Confirmed, .before = Cancelled)
library(ggplot2)
df %>%
mutate(Status = forcats::fct_rev(if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed")),
InvoiceDay = as.Date(InvoiceDate)) %>%
group_by(InvoiceDay, Status) %>%
summarise("Number of Invoices per Day" = n_distinct(Invoice), .groups = "drop") %>%
ggplot(aes(InvoiceDay, `Number of Invoices per Day`, linetype = Status)) +
geom_line() +
labs(x = NULL,
y = NULL,
title = "Number of Invoices per Day, differentiating by Status") +
theme(legend.position = "bottom",
legend.title = element_blank())
From the graph it seems like the cancelled invoices are staying
constant while the confirmed ones are on an upward trend.
We complete this section with the distribution of the number of
invoices per day, both with a table and with a graph, to show the
difference in magnitude between them.
df %>%
mutate(Status = if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed"),
InvoiceDay = as.Date(InvoiceDate)) %>%
group_by(InvoiceDay, Status) %>%
summarise("Number of Invoices per Day" = n_distinct(Invoice), .groups = "drop") %>%
tidyr::pivot_wider(names_from = "Status", values_from = "Number of Invoices per Day", values_fill = 0) %>%
reframe(across(where(is.numeric), ~ summary(.x))) %>%
mutate(Statistic = c("Min." , "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")) %>%
relocate(Statistic, Confirmed, .before = Cancelled)
df %>%
mutate(Status = forcats::fct_rev(if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed")),
InvoiceDay = as.Date(InvoiceDate)) %>%
group_by(InvoiceDay, Status) %>%
summarise("Number of Invoices per Day" = n_distinct(Invoice), .groups = "drop") %>%
ggplot(aes(`Number of Invoices per Day`, linetype = Status)) +
geom_density() +
labs(x = NULL,
y = NULL,
title = "Probability Distributions of Invoices per Day, differentiating by Status") +
theme(legend.position = "bottom",
legend.title = element_blank())
- number of distinct items
Then we want to investigate the differences in homogeneity and we can
see that cancelled invoices generally contain less distinct items.
df %>%
mutate(Status = if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed")) %>%
group_by(Invoice, Status) %>%
summarise("Number of Distinct Items" = n_distinct(StockCode), .groups = "drop") %>%
tidyr::pivot_wider(names_from = Status, values_from = `Number of Distinct Items`) %>%
#we didn't substitute NAs with O with `values_fill = 0` as it would incorrectly modify the summary table, adding values that don't exist
reframe(across(where(is.numeric), ~ summary(.x))) %>%
slice(-n()) %>%
#to remove the last line from the summary table, the one with NAs
mutate(Statistic = c("Min." , "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")) %>%
relocate(Statistic, Confirmed, .before = Cancelled)
df %>%
mutate(Status = forcats::fct_rev(if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed"))) %>%
group_by(Invoice, Status) %>%
summarise("Number of Distinct Items" = n_distinct(StockCode), .groups = "drop") %>%
ggplot(aes(Status, `Number of Distinct Items`)) +
geom_boxplot() +
scale_y_log10() +
labs(x = NULL,
y = NULL,
title = "Distributions of the Number of Distinct Items per Invoice, differentiating by Status, \n Logarithmic Scale")
- differences in quantity
We proceed analogously for the Quantity
column,
calculating the total quantities for each invoice, in absolute values
(as cancelled invoices have negative ones).
df %>%
mutate(Status = if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed")) %>%
group_by(Invoice, Status) %>%
summarise("Total Quantity" = sum(abs(Quantity)), .groups = "drop") %>%
tidyr::pivot_wider(names_from = "Status", values_from = "Total Quantity") %>%
#we didn't substitute NAs with O with `values_fill = 0` as it would incorrectly modify the summary table, adding values that don't exist
reframe(across(where(is.numeric), ~ summary(.x))) %>%
slice(-n()) %>%
#to remove the last line from the summary table, the one with NAs
mutate(Statistic = c("Min." , "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")) %>%
relocate(Statistic, Confirmed, .before = Cancelled)
df %>%
mutate(Status = forcats::fct_rev(if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed"))) %>%
group_by(Invoice, Status) %>%
summarise("Total Quantity" = sum(abs(Quantity)), .groups = "drop") %>%
ggplot(aes(Status, `Total Quantity`)) +
geom_boxplot() +
scale_y_log10(labels = scales::label_number()) +
labs(x = NULL,
y = NULL,
title = "Distributions of Quantity per Invoice, differentiating by Status, \n Logarithmic Scale")
And we can see that cancelled invoices are, again, generally
smaller.
The same Max
. value for both statuses piqued our
interest and we found out that three invoices share it.
df %>%
filter(Invoice %in% c("C524235", "518505", "524174")) %>%
count(Invoice, wt = abs(Quantity), name = "Total Quantity")
It is not by chance that it happened, as all three contain the same
items in the same quantities and two of them are made by the same
customer, very curious.
df %>%
filter(Invoice %in% c("C524235", "518505", "524174")) %>%
arrange(StockCode)
- differences in revenues
Besides being comprised of less distinct items and in smaller
quantities, cancelled invoices are of smaller absolute monetary value as
well.
df %>%
mutate(Status = if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed")) %>%
group_by(Invoice, Status) %>%
summarise("Total Value" = sum(abs(Quantity * Price)), .groups = "drop") %>%
tidyr::pivot_wider(names_from = "Status", values_from = "Total Value") %>%
#we didn't substitute NAs with O with `values_fill = 0` as it would incorrectly modify the summary table, adding values that don't exist
reframe(across(where(is.numeric), ~ summary(.x))) %>%
slice(-n()) %>%
#to remove the last line from the summary table, the one with NAs
mutate(Statistic = c("Min." , "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")) %>%
relocate(Statistic, Confirmed, .before = Cancelled)
df %>%
mutate(Status = forcats::fct_rev(if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed"))) %>%
group_by(Invoice, Status) %>%
summarise("Total Value" = sum(abs(Quantity * Price)), .groups = "drop") %>%
ggplot(aes(Status, `Total Value`)) +
geom_boxplot() +
scale_y_log10(labels = scales::label_number()) +
labs(x = NULL,
y = NULL,
title = "Distributions of Revenues per Invoice, differentiating by Status, \n Logarithmic Scale")
The total loss in revenues can be quantified to
247901.4 £
, 2.46%
over the hypothetical
revenues those invoices had not been cancelled.
df %>%
summarise("Total Revenues" = sum(abs(Quantity) * Price),
"Lost Revenues" = sum(abs(Quantity[str_starts(Invoice, "C")]) * Price[str_starts(Invoice, "C")]),
Percentage = formattable::percent(`Lost Revenues` / `Total Revenues`))
- country breakdown
Let’s look now at the number of invoices per country, where we see
that more than 9
invoices out of 10
pertain to
United Kingdom
.
df %>%
count(Country, wt = n_distinct(Invoice), sort = TRUE, name = "Number of Invoices") %>%
mutate(Percentage = formattable::percent(`Number of Invoices` / sum(`Number of Invoices`)))
df %>%
count(Country, wt = n_distinct(Invoice)) %>%
ggplot(aes(reorder(Country, n), n)) +
geom_col() +
coord_flip() +
labs(x = NULL,
y = NULL,
title = "Number of Invoices per Country")
It is more interesting to see if some countries cancel invoices more
often than others,
df %>%
mutate(Status = if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed")) %>%
count(Country, Status, wt = n_distinct(Invoice), name = "Number of Invoices") %>%
group_by(Country) %>%
mutate(Percentage = formattable::percent(`Number of Invoices` / sum(`Number of Invoices`)),
Arranging_Column = sum(`Number of Invoices`)) %>%
#`Arranging_Column` is used to have the Countries with the highest total number of invoices on top
ungroup() %>%
arrange(desc(Arranging_Column), desc(Status)) %>%
select(-Arranging_Column)
and, among the 26
countries with cancellations, we can
spot Nigeria
and Japan
with fairly high rates,
but we have to consider that these two countries don’t have a elevate
number of invoices (just 2
for the former and
30
for the latter).
df %>%
mutate(Status = if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed")) %>%
count(Country, Status, wt = n_distinct(Invoice), name = "Number of Occurrences") %>%
group_by(Country) %>%
mutate("Number of Invoices" = sum(`Number of Occurrences`),
Percentage = formattable::percent(`Number of Occurrences` / sum(`Number of Occurrences`))) %>%
ungroup() %>%
filter(Status == "Cancelled") %>%
arrange(desc(Percentage)) %>%
select(Country, `Number of Invoices`, "Number of Cancellations" = `Number of Occurrences`, Percentage)
This graph reflects that, as the y axis is ordered, after
NA
, by the number of invoices.
df %>%
mutate(Status = forcats::fct_rev(if_else(str_detect(Invoice, "C"), "Cancelled", "Confirmed"))) %>%
count(Country, Status, wt = n_distinct(Invoice)) %>%
group_by(Country) %>%
mutate(perc = formattable::percent(n / sum(n))) %>%
ungroup() %>%
ggplot(aes(reorder(Country, n), perc, fill = Status)) +
geom_col() +
scale_y_continuous(labels = scales::label_percent()) +
coord_flip() +
scale_fill_grey() +
labs(x = NULL,
y = NULL,
title = "Cancellation Rates per Country") +
theme(legend.position = "bottom",
legend.title = element_blank())
- main takeaways
- Cancelled invoices, being
16.64%
of the total, are a
not negligible phenomenon
- Their daily occurrence stayed constant during last year while
confirmed ones are on a rising trend
- Compared to the confirmed ones, they are generally smaller in terms
of distinct items, total quantity and revenue
- The loss in revenue can be quantified to
247901.4 £
,
the 2.46%
- The market is mainly based in the
United Kingdom
, among
other countries Nigeria
and Japan
have high
cancellation rate, but are not very active customers