Cancellations of invoices is a practice that we want to limit, as it
can impact revenues and can produce operational difficulties.
In this document we will try to understand when a cancellation takes
place, using a machine learning technique called decision tree.
- pre-validation of the analysis
As previously
seen, the cancellation rate for our cleaned data frame is
16.64%
, meaning that almost 1
out of
5
invoices is being cancelled.
df %>%
mutate(Status = if_else(str_starts(Invoice, "C"), "Cancelled", "Confirmed")) %>%
group_by(Status) %>%
summarize("Number of Distinct Invoices" = n_distinct(Invoice)) %>%
mutate(Percentage = formattable::percent(`Number of Distinct Invoices` / sum(`Number of Distinct Invoices`))) %>%
arrange(desc(`Number of Distinct Invoices`))
From a sales and operations point of view, it being a big or small
number depends on various internal factors, but for the sake of this
analysis it can cause some problems as the number of cancellations can
be so small, compared to the confirmations, that the algorithm is not
able to “learn” their characteristics, thus providing erroneous
results.
To assess that we can use some graphical methods, but first let’s
present the data frame we will feed into the algorithm:
(treedf <- df %>%
group_by(Invoice) %>%
summarise("Number of Distinct Stock Codes" = n_distinct(StockCode),
"Median Quantity" = median(abs(Quantity)),
"Median Price" = median(Price),
"Total Quantity" = sum(abs(Quantity)),
"Total Revenues" = sum(abs(Quantity) * Price)) %>%
left_join(df %>%
mutate(Cancellation = factor(if_else(str_starts(Invoice, "C"), TRUE, FALSE))) %>%
distinct(Invoice, Cancellation), by = "Invoice"))
So, for every Invoice
, we have quantitative information
plus a binary column that deemed if it has been cancelled or not.
From this next series of graphics we can assess on whether the
metrics we chose are appropriate: a metric is appropriate when the
distribution of values between the statuses
Cancellation FALSE
and Cancellation TRUE
is
distinct, in this way the algorithm can use it to distinguish the
two.
library(sm)
"Number of Distinct Stock Codes" <- treedf$`Number of Distinct Stock Codes`
sm.density.compare(`Number of Distinct Stock Codes`, treedf$Cancellation, col = c("black", "darkgrey"))
legend("topright", levels(treedf$Cancellation), lty = c(1, 2), col = c("black", "darkgrey"))
"Median Quantity" <- treedf$`Median Quantity`
sm.density.compare(`Median Quantity`, treedf$Cancellation, col = c("black", "darkgrey"))
legend("topright", levels(treedf$Cancellation), lty = c(1, 2), col = c("black", "darkgrey"))
"Median Price" <- treedf$`Median Price`
sm.density.compare(`Median Price`, treedf$Cancellation, col = c("black", "darkgrey"))
legend("topright", levels(treedf$Cancellation), lty = c(1, 2), col = c("black", "darkgrey"))
"Total Quantity" <- treedf$`Total Quantity`
sm.density.compare(`Total Quantity`, treedf$Cancellation, col = c("black", "darkgrey"))
legend("topright", levels(treedf$Cancellation), lty = c(1, 2), col = c("black", "darkgrey"))
"Total Revenues" <- treedf$`Total Revenues`
sm.density.compare(`Total Revenues`, treedf$Cancellation, col = c("black", "darkgrey"))
legend("topright", levels(treedf$Cancellation), lty = c(1, 2), col = c("black", "darkgrey"))
From the metrics we chose, it seems that only
Number of Distinct Items
and Total Revenues
provide two sufficiently distinct curves.
- running the model
After those preliminary considerations, let’s run the model.
library(rpart)
model<- rpart(Cancellation ~ ., data = treedf[-1])
printcp(model)
Classification tree:
rpart(formula = Cancellation ~ ., data = treedf[-1])
Variables actually used in tree construction:
[1] Number of Distinct Stock Codes Total Quantity
Root node error: 4107/24680 = 0.16641
n= 24680
CP nsplit rel error xerror xstd
1 0.371317 0 1.00000 1.00000 0.014247
2 0.036766 1 0.62868 0.63672 0.011773
3 0.010957 2 0.59192 0.59484 0.011424
4 0.010000 3 0.58096 0.59021 0.011384
The model returned two metrics that can influence the cancellations:
Number of Distinct Stock Codes
and
Total Quantity
.
We can investigate how these metrics affect cancellations in the
following table: for example 70%
of the invoices with a
Total Quantity
lower than 10
and a
Number of Distinct Stock Codes
less than 6
have been cancelled.
library(rpart.plot)
rpart.rules(model, style = "tall")
Cancellation is 0.04 when
Total Quantity >= 16
Cancellation is 0.29 when
Total Quantity < 10
Number of Distinct Stock Codes >= 6
Cancellation is 0.42 when
Total Quantity is 10 to 16
Cancellation is 0.70 when
Total Quantity < 10
Number of Distinct Stock Codes < 6
One of the perks of this method is the possibility to plot the
decision tree itself,
rpart.plot(model, box.palette = 0)
and that really helps in communicating the results.
In the plotted tree, its every node (meaning where the separations
happen) and leaf (the terminal nodes, with no separations) contain
information about:
- the majority of statuses present (
FALSE
for the first
node, meaning the whole data set)
- the percentage of
Cancellation TRUE
(0.17
,
rounded, for the first node, as noted at the beginning of this
document)
- the percentage of invoices present in the node (thus
100%
for the first one).
So the algorithm didn’t find perfect separations between the two
cases as there are, for example in the leftmost leaf, 4%
of
invoices still being cancelled despite having a
Total Quantity
higher than 16
.
- main takeaways and further developments
This basic demonstration of a decision tree yielded the results that
small invoices are more likely to be cancelled.
To progress this analysis we can start by modifying the parameters of
the algorithm from the standard ones, then changing the metrics (for
example by adding chronological and/or geographical ones) or running
more models on the different subsets of customers
and countries
we discovered in previous documents.
When we are satisfied with the model, we can test its reliability by
applying it on future invoices, to estimate their probability of being
cancelled.