Market Basket Analysis allows us to find groups of items that are
frequently bought together, an useful information for a variety of
reasons, both commercially and operationally.
- data preparation and application of the algorithm
Given that its algorithmic implementation is very resource-hungry, we
will here concentrate on non duplicated purchases made in
Belgium
, considering both confirmed and cancelled
invoices,
df %>%
filter(Country == "Belgium") %>%
distinct()
having access then to 56
of them for 479
distinct stock codes.
df %>%
filter(Country == "Belgium") %>%
distinct() %>%
summarise("Number of Invoices" = n_distinct(Invoice),
"Number of Distinct Stock Codes" = n_distinct(StockCode))
In order to feed it into the algorithm, we need to change the data
frame into a transaction format, where we have a row for every invoice
and as many additional columns as the number of distinct stock codes.
The values in the cell are logical values that specify on whether the
invoice contains (TRUE
) or doesn’t contain
(FALSE
) that particular stock code.
(dftr <- df %>%
filter(Country == "Belgium") %>%
distinct() %>%
count(Invoice, StockCode) %>%
mutate(n = TRUE) %>%
tidyr::pivot_wider(id_cols = Invoice, names_from = StockCode, values_from = n, values_fill = FALSE))
library(arules)
dftr <- dftr %>%
select(-Invoice)
tr <- as(dftr, "transactions")
#we transform the data frame into a "transactions" class
rules <- apriori(tr, conf = 0, minlen = 2)
#conf = 0 to not filter out any rules, minlen = 2 to not have rules with one side empty
- rules’ filtering
After its application, the algorithm returns the following
21
“rules”, that must be read as
“if StockCode1
is being bought, then
(=>
) there is a chance that StockCode2
will
be bought as well (in the same invoice)”
tibble(lhs = labels(lhs(rules)),
rhs = labels(rhs(rules)),
rules@quality) %>%
mutate(" " = "=>", .after = "lhs") %>%
rename(StockCode1 = lhs, StockCode2 = rhs)
This “chance” is expressed by a number of metrics
(support
, confidence
, coverage
and lift
), plus we also have the number of times that
particular association of stock codes exists, count
. Notice
how we have many one-to-one associations (22551 => 22554
and then 22554 => 22551
) but for some of them the values
in the metrics are different (like for 22630 => 22629
/
22629 => 22630
in confidence
and
coverage
).
The metrics are what we use to filter the rules, and we will here
explain their meanings, using the first one
(22551 => 22554
)
tibble(lhs = labels(lhs(rules)),
rhs = labels(rhs(rules)),
rules@quality) %>%
mutate(" " = "=>", .after = "lhs") %>%
rename(StockCode1 = lhs, StockCode2 = rhs) %>%
slice(1)
as an example:
support
is the percentage of times all stock codes of
the rule occur together in the same invoice over the totality of
invoices, so we get 0.1785714
which is 10
/
56
df %>%
filter(Country == "Belgium") %>%
distinct() %>%
count(Invoice, StockCode) %>%
mutate(n = TRUE) %>%
tidyr::pivot_wider(id_cols = Invoice, names_from = StockCode, values_from = n) %>%
select(Invoice, `22551`, `22554`) %>%
filter(if_all(everything(), ~ !is.na(.x)))
confidence
is the percentage of times the association
happens, that is with StockCode1
in an invoice there is
also StockCode2
, and that happens for 10
invoices out of 12
, the 0.8333333
df %>%
filter(Country == "Belgium") %>%
distinct() %>%
count(Invoice, StockCode) %>%
mutate(n = TRUE) %>%
tidyr::pivot_wider(id_cols = Invoice, names_from = StockCode, values_from = n, values_fill = FALSE) %>%
select(Invoice, `22551`, `22554`) %>%
filter(`22551`)
coverage
is the percentage of times
StockCode1
is present in the data set, so here its value is
0.2142857
(12
out of 56
)
df %>%
filter(Country == "Belgium") %>%
distinct() %>%
count(Invoice, StockCode) %>%
mutate(n = TRUE) %>%
tidyr::pivot_wider(id_cols = Invoice, names_from = StockCode, values_from = n) %>%
select(Invoice, `22551`) %>%
filter(`22551`)
lift
is calculated by dividing the
confidence
of the rule by the percentage of times
StockCode2
occurs in the data set, so
0.8333333
/ (12
/ 56
), that
equals 3.888889
df %>%
filter(Country == "Belgium") %>%
distinct() %>%
count(Invoice, StockCode) %>%
mutate(n = TRUE) %>%
tidyr::pivot_wider(id_cols = Invoice, names_from = StockCode, values_from = n) %>%
select(Invoice, `22554`) %>%
filter(`22554`)
It measures the “strength” of the rule, so, higher the
lift
, higher the number of times StockCode1
and StockCode2
occur together in a non fortuitous way,
meaning that the purchase of StockCode1
really improves the
chances of StockCode2
being bought. For this reasons, we
want it to be higher than 1.5
, given that having it under
1
means that the two stock codes are substitutes of one
another, while it equal to 1
means that the two stock codes
are independently bought.
Given that all of our rules have a strong lift
, the
factors by which we filter are confidence
(to have
trustworthy rules)
tibble(lhs = labels(lhs(rules)),
rhs = labels(rhs(rules)),
rules@quality) %>%
mutate(" " = "=>", .after = "lhs") %>%
rename(StockCode1 = lhs, StockCode2 = rhs) %>%
arrange(desc(confidence))
and coverage
(to have rules that happen frequently).
tibble(lhs = labels(lhs(rules)),
rhs = labels(rhs(rules)),
rules@quality) %>%
mutate(" " = "=>", .after = "lhs") %>%
rename(StockCode1 = lhs, StockCode2 = rhs) %>%
arrange(desc(coverage))
Cutoff values could be 0.75
for confidence
,
meaning that the rule is true 3
times out of
4
, and 0.20
for coverage
meaning
that StockCode1
is present every 5
stock
codes.
With these values, we preserve 3
rules,
tibble(lhs = labels(lhs(rules)),
rhs = labels(rhs(rules)),
rules@quality) %>%
mutate(" " = "=>", .after = "lhs") %>%
rename(StockCode1 = lhs, StockCode2 = rhs) %>%
filter(confidence >= 0.75 &
coverage >= 0.2)
for 4
distinct stock codes:
df %>%
filter(StockCode %in% c("22551", "22554", "22629", "22630")) %>%
distinct(StockCode, Description)
- analysis of the results
The 4
stock codes have, in the Belgian subset, these
characteristics,
df %>%
filter(Country == "Belgium" &
StockCode %in% c("22551", "22554", "22629", "22630")) %>%
group_by(StockCode) %>%
summarise("Number of Invoices" = n_distinct(Invoice),
"Median Quantity" = median(abs(Quantity)),
"Median Price" = median(Price))
that we can compare to the rest of the inventory, where we can see
that they are generally cheaper
library(ggplot2)
df %>%
filter(Country == "Belgium") %>%
mutate(Status = if_else(StockCode %in% c("22551", "22554", "22629", "22630"), "Selected Rules' Stock Codes", "Non Selected Rules' Stock Codes")) %>%
ggplot(aes(Status, Price)) +
geom_boxplot() +
labs(x = NULL,
y = NULL,
title ="Distribution of the Price column, differentiating by Status")
and that they are bought in roughly the same quantities.
df %>%
filter(Country == "Belgium") %>%
mutate(Status = if_else(StockCode %in% c("22551", "22554", "22629", "22630"), "Selected Rules' Stock Codes", "Non Selected Rules' Stock Codes")) %>%
ggplot(aes(Status, Quantity)) +
geom_boxplot() +
labs(x = NULL,
y = NULL,
title ="Distribution of the Quantity column, differentiating by Status")
Finally, let’s compare the revenues they generate with the overall
country revenues.
df %>%
filter(Country == "Belgium" &
!str_starts(Invoice, "C")) %>%
summarise("Belgian Revenues" = sum(Quantity * Price)) %>%
bind_cols(df %>%
filter(Country == "Belgium" &
StockCode %in% c("22551", "22554", "22629", "22630") &
!str_starts(Invoice, "C")) %>%
summarise("Selected Rules' Revenues" = sum(Quantity * Price))) %>%
mutate("In Percentage" = formattable::percent(`Selected Rules' Revenues` / sum(`Selected Rules' Revenues`, `Belgian Revenues`)))
- main takeaways and further developments
We here applied an association rules algorithm to a subset of our
data, finding, after some filtering, 4
stock codes that, in
pairs of 2
, are frequently bought together.
Different subsets can be examined afterwards or, with more processing
power, we can run the same analysis on the whole data frame.