So after invoices and items, we will here investigate the customers,
in the same vein as we previously did. And, as in the previous analyses,
we will not remove duplicated
rows, to not lose information about Quantity
.
- addressing the missing values
As discussed in a previous
document the CustomerID
column suffers from a large
number of missing values, roughly the 20%
.
df %>%
summarise("Number of Distinct Customers" = n_distinct(CustomerID[!is.na(CustomerID)]),
"Number of Missing Values" = sum(is.na(CustomerID)),
"Percentage of Missing Values" = formattable::percent(mean(is.na(CustomerID))))
That 19.89%
of missing values could pertain to a single
customer (the most extreme case) but more likely to many different ones
as they are located in 13
different countries.
df %>%
filter(is.na(CustomerID)) %>%
count(Country, name = "Number of Missing Values", sort = TRUE)
As we have no way to distinguish between them, for the scope of this
analysis we will remove them altogether.
We must address that, removing those rows, we will also remove their
information placed on other columns and that are only present in
conjunction with a missing value in the CustomerID
column,
modifying the general characteristics of the data frame.
For example the number of countries will decrease to 37
from 40
(we will lose Bermuda
,
Hong Kong
and Lebanon
) and the number of stock
codes of roughly 200
items.
tibble(CustomerID = "Original Data Frame",
"Number of Countries" = n_distinct(df$Country),
"Number of Stock Codes" = n_distinct(df$StockCode)) %>%
bind_rows(df %>%
mutate(CustomerID = if_else(!is.na(CustomerID), "Removing NAs", "Keeping Only NAs")) %>%
group_by(CustomerID) %>%
summarise("Number of Countries" = n_distinct(Country),
"Number of Stock Codes" = n_distinct(StockCode)) %>%
arrange(desc(CustomerID)))
df <- df %>%
filter(!is.na(CustomerID))
- breakdown by country
We can start by identifying where our customers are located.
df %>%
count(Country, wt = n_distinct(CustomerID), sort = TRUE, name = "Number of Customers") %>%
mutate("In Percentage" = formattable::percent(`Number of Customers` / sum(`Number of Customers`)))
specifying that 4
of them changed location, as we’ve
already seen in a previous document.
We notice that their CustomerID
s are curiously close to
each others.
df %>%
count(CustomerID, Country, name = "Number of Occurrences") %>%
group_by(CustomerID) %>%
filter(n() > 1) %>%
ungroup()
Given the international nature of our UK
based business,
maybe we want to know how many clients are located in
Europe
or not.
EU <- c("Austria", "Belgium", "Channel Islands", "Cyprus", "Denmark", "EIRE", "Finland", "France", "Germany", "Greece", "Iceland", "Italy", "Lithuania", "Malta", "Netherlands", "Norway", "Poland", "Portugal", "Spain", "Sweden", "Switzerland", "United Kingdom")
df %>%
distinct(Country, CustomerID) %>%
mutate("In EU" = case_when(is.na(Country) ~ NA,
!Country %in% EU ~ "No",
Country %in% EU ~ "Yes")) %>%
count(`In EU`, name = "Number of Customers", sort = TRUE) %>%
mutate("In Percentage" = formattable::percent(`Number of Customers` / sum(`Number of Customers`)))
Our clients are mostly located in Europe
, but where in
Europe
?
df %>%
distinct(Country, CustomerID) %>%
mutate("In EU" = case_when(is.na(Country) ~ NA,
!Country %in% EU ~ "No",
Country %in% EU ~ "Yes")) %>%
filter(`In EU` == "Yes") %>%
count(Country, name = "Number of Customers", sort = TRUE) %>%
mutate("In Percentage" = formattable::percent(`Number of Customers` / sum(`Number of Customers`)))
And what about the rest of the world?
df %>%
distinct(Country, CustomerID) %>%
mutate("In EU" = case_when(is.na(Country) ~ NA,
!Country %in% EU ~ "No",
Country %in% EU ~ "Yes")) %>%
filter(`In EU` == "No") %>%
count(Country, name = "Number of Customers", sort = TRUE) %>%
mutate("In Percentage" = formattable::percent(`Number of Customers` / sum(`Number of Customers`)))
- basic breakdown
Let’s now provide some basics information, first the number of total
invoices for every customer, without differentiating between confirmed
and cancelled one, adding as well the number of distinct items for each
of them, to assess how diverse are their purchases.
df %>%
group_by(CustomerID) %>%
summarise("Number of Invoices" = n_distinct(Invoice),
"Number of Distinct Items Invoiced" = n_distinct(StockCode)) %>%
arrange(desc(`Number of Invoices`))
Then the median quantity and revenues per invoice, again ordered by
the largest but here concentrating only on the confirmed ones.
df %>%
filter(!str_starts(Invoice, "C")) %>%
group_by(CustomerID, Invoice) %>%
summarise("Total Quantity per Invoice" = sum(Quantity),
"Total Revenue per Invoice" = sum(Quantity * Price), .groups = "drop_last") %>%
summarise("Rounded Median Quantity per Invoice" = round(median(`Total Quantity per Invoice`)),
"Rounded Median Revenues per Invoice" = round(median(`Total Revenue per Invoice`), 2)) %>%
arrange(desc(`Rounded Median Quantity per Invoice`))
- cancelling customers
The phenomenon of cancelling invoices is strictly related to the
customers, as they are the ones responsible for that, so it is important
to study it in regards to them. Let’s start with a breakdown
differentiating by status and ordered by the clients that invoiced the
most.
df %>%
mutate("Invoice Status" = if_else(str_starts(Invoice, "C"), "Cancelled", "Confirmed")) %>%
count(CustomerID, `Invoice Status`, wt = n_distinct(Invoice), name = "Number of Occurrences") %>%
group_by(CustomerID) %>%
mutate(Percentage = formattable::percent(`Number of Occurrences` / sum(`Number of Occurrences`)),
"Total Number Invoices" = sum(`Number of Occurrences`)) %>%
ungroup() %>%
arrange(desc(`Total Number Invoices`), CustomerID, desc(`Invoice Status`))
Then we can isolate which ones never cancelled an order
(2626
out of 4314
),
df %>%
mutate("Invoice Status" = if_else(str_starts(Invoice, "C"), "Cancelled", "Confirmed")) %>%
count(CustomerID, `Invoice Status`, wt = n_distinct(Invoice), name = "Number of Occurrences") %>%
group_by(CustomerID) %>%
mutate(Percentage = formattable::percent(`Number of Occurrences` / sum(`Number of Occurrences`))) %>%
ungroup() %>%
arrange(desc(`Number of Occurrences`)) %>%
filter(`Invoice Status` == "Confirmed" &
Percentage == 1)
the ones that always cancel orders (29
out of
4314
)
df %>%
mutate("Invoice Status" = if_else(str_starts(Invoice, "C"), "Cancelled", "Confirmed")) %>%
count(CustomerID, `Invoice Status`, wt = n_distinct(Invoice), name = "Number of Occurrences") %>%
group_by(CustomerID) %>%
mutate(Percentage = formattable::percent(`Number of Occurrences` / sum(`Number of Occurrences`))) %>%
ungroup() %>%
arrange(desc(`Number of Occurrences`)) %>%
filter(`Invoice Status` == "Cancelled" &
Percentage == 1)
and the third group, that orders and cancels both (1659
out of 4314
), that can be filtered further by determining a
threshold for the rate of cancellation for example.
df %>%
mutate("Invoice Status" = if_else(str_starts(Invoice, "C"), "Cancelled", "Confirmed")) %>%
count(CustomerID, `Invoice Status`, wt = n_distinct(Invoice), name = "Number of Occurrences") %>%
group_by(CustomerID) %>%
mutate(Percentage = formattable::percent(`Number of Occurrences` / sum(`Number of Occurrences`)),
"Total Number Invoices" = sum(`Number of Occurrences`)) %>%
ungroup() %>%
arrange(desc(`Total Number Invoices`), CustomerID, desc(`Invoice Status`)) %>%
filter(Percentage > 0 &
Percentage < 1)
- ranking the customers
Let’s continue with some more interesting information about their
spending capabilities (excluding the cancelled invoices from these
tables), like the clients that purchased the most in term of
quantity
df %>%
filter(!str_starts(Invoice, "C")) %>%
group_by(CustomerID) %>%
summarise("Number of Invoices" = n_distinct(Invoice),
"Total Quantity Purchased" = sum(Quantity)) %>%
arrange(desc(`Total Quantity Purchased`))
and their total expenditure in British pounds, as both of these
information can be useful to segment the clientele.
df %>%
filter(!str_starts(Invoice, "C")) %>%
mutate(Expense = Quantity * Price) %>%
group_by(CustomerID) %>%
summarise("Number of Invoices" = n_distinct(Invoice),
"Total of Expenses" = round(sum(Expense), 2)) %>%
arrange(desc(`Total of Expenses`))
We can merge the two former information into just one table, where we
added a ranking for each metric and a total ranking (based on the
average of the two, meaning we give them equal importance) by which we
ordered the results. We notice than that we have recurring customers in
both top10
.
df %>%
filter(!str_starts(Invoice, "C")) %>%
mutate(Expense = Quantity * Price) %>%
group_by(CustomerID) %>%
summarise("Number of Invoices" = n_distinct(Invoice),
"Total Quantity Purchased" = sum(Quantity),
"Total of Expenses" = round(sum(Expense), 2)) %>%
mutate("Quantity Rank" = dense_rank(desc(`Total Quantity Purchased`)),
"Expenses Rank" = dense_rank(desc(`Total of Expenses`)),
"Total Rank" = dense_rank((`Quantity Rank` + `Expenses Rank`) / 2)) %>%
arrange(`Total Rank`) %>%
relocate(`Quantity Rank`, .after = `Total Quantity Purchased`)
If we want to consider cancelled invoices as well, we could build a
table with only those
df %>%
filter(str_starts(Invoice, "C")) %>%
mutate(Expense = abs(Quantity) * Price) %>%
group_by(CustomerID) %>%
summarise("Number of Cancelled Invoices" = n_distinct(Invoice),
"Total Quantity Cancelled" = sum(abs(Quantity)),
"Total of Lost Expenses" = round(sum(Expense), 2)) %>%
mutate("Quantity Rank" = dense_rank(desc(`Total Quantity Cancelled`)),
"Expenses Rank" = dense_rank(desc(`Total of Lost Expenses`)),
"Total Rank" = dense_rank((`Quantity Rank` + `Expenses Rank`) / 2)) %>%
arrange(`Total Rank`) %>%
relocate(`Quantity Rank`, .after = `Total Quantity Cancelled`)
or we could subtract the cancelled values from the confirmed
ones,
df %>%
mutate(Expense = Quantity * Price) %>%
group_by(CustomerID) %>%
summarise("Number of Invoices" = n_distinct(Invoice),
"Number of Cancelled Invoices" = n_distinct(Invoice[Quantity < 0]),
"Total Quantity Purchased" = sum(Quantity[Quantity > 0]),
"Total Quantity Cancelled" = abs(sum(Quantity[Quantity < 0])),
"Total of Expenses" = round(sum(Expense[Quantity > 0]), 2),
"Total of Lost Expenses" = round(abs(sum(Expense[Quantity < 0])), 2)) %>%
arrange(desc(`Number of Invoices`))
to show the net values.
df %>%
mutate(Expense = Quantity * Price) %>%
group_by(CustomerID) %>%
summarise("Number of Invoices" = n_distinct(Invoice),
"Number of Cancelled Invoices" = n_distinct(Invoice[Quantity < 0]),
"Total Quantity Purchased" = sum(Quantity[Quantity > 0]),
"Total Quantity Cancelled" = abs(sum(Quantity[Quantity < 0])),
"Total of Expenses" = sum(Expense[Quantity > 0]),
"Total of Lost Expenses" = abs(sum(Expense[Quantity < 0]))) %>%
mutate(CustomerID = CustomerID,
"Number of Invoices" = `Number of Invoices`,
"Number of Cancelled Invoices" = `Number of Cancelled Invoices`,
"Total Net Quantity" = `Total Quantity Purchased` - `Total Quantity Cancelled`,
"Quantity Rank" = dense_rank(desc(`Total Net Quantity`)),
"Total Net Expenses" = round(`Total of Expenses` - `Total of Lost Expenses`, 2),
"Expenses Rank" = dense_rank(desc(`Total Net Expenses`)),
"Total Rank" = dense_rank((`Quantity Rank` + `Expenses Rank`) / 2), .keep = "none") %>%
arrange(`Total Rank`)
- monthly expenses and invoices
We can then combine how much a client spends and how frequently by
constructing a table that shows, for each client, the monthly expenses
together with some summary metrics like their total, the percentage over
the global, the monthly average and the number of months without a
purchase.
df %>%
filter(!str_starts(Invoice, "C")) %>%
mutate(Expense = Quantity * Price,
Month = format(InvoiceDate, "%y/%m")) %>%
group_by(CustomerID, Month) %>%
summarise("Total Expenses" = sum(Expense), .groups = "drop") %>%
tidyr::pivot_wider(names_from = Month, values_from = `Total Expenses`, names_sort = TRUE) %>%
mutate("Total Expenses" = round(rowSums(across(where(is.numeric)), na.rm = TRUE), 2),
"Percentage over Global" = formattable::percent(`Total Expenses` / sum(`Total Expenses`)),
"Rounded Monthly Average" = round(rowMeans(across(where(is.numeric)), na.rm = TRUE), 2),
"Number of Missing Months" = rowSums(is.na(pick(everything()))), .after = "CustomerID") %>%
arrange(desc(`Total Expenses`))
We can also build a similar table also for the number of
invoices.
df %>%
filter(!str_starts(Invoice, "C")) %>%
mutate(Month = format(InvoiceDate, "%y/%m")) %>%
group_by(CustomerID, Month) %>%
summarise("Number of Invoices" = n_distinct(Invoice), .groups = "drop") %>%
tidyr::pivot_wider(names_from = Month, values_from = `Number of Invoices`, names_sort = TRUE) %>%
mutate("Number of Invoices" = rowSums(across(where(is.numeric)), na.rm = TRUE),
"Rounded Monthly Average" = round(rowMeans(across(where(is.numeric)), na.rm = TRUE), 2),
"Number of Missing Months" = rowSums(is.na(pick(everything()))), .after = "CustomerID") %>%
arrange(desc(`Number of Invoices`))
- main takeaways
In this document customers are analyzed and ranked following
different criteria; we singled out the ones that cancels the most, where
most of them are located and the best performing ones in regard to,
among other metrics, number of items purchased or total revenues
generated.