In this document we will present the data frame used throughout this site, with the mean to have a first understanding of its structure and characteristics.


- source and structure

The data frame was first retrieved, on the 1st of September 2022, from https://archive.ics.uci.edu/ml/datasets/Online+Retail+II and its description states that

“This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.”


The data frame consists of 525461 rows and of these 8 columns, whose class is correct,

library(dplyr)
df %>%
  summarise(across(everything(), ~ class(.x)[1]))


except for Customer ID, that we will transform into character, removing the empty space from the name as well.

df <- df %>%
  mutate(CustomerID = as.character(`Customer ID`), .keep = "unused", .after = Price)

The data frame looks like this:

df

Where we can see that each row contains information about the purchase of an item.

The rows are ordered by InvoiceDate, so all the purchases pertaining to a single Invoice are adjacent.


- checking for tidiness

We see how the data frame respects the three rules of tidy data (every column is a variable, every row is an observation, every cell is a single value). This avoids the need to manipulate the rows and columns before proceeding into successive steps.


- respect of the definitions

In this section we will examine whether or not the data frame respects the definitions stated in the web page it was retrieved from.


- Invoice

Invoice is defined as

“Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter ‘c’, it indicates a cancellation.”


There are 28816 distinct invoices.

library(knitr)
kable(df %>%
        distinct(Invoice) %>%
        tally(name = "Number of Distinct Invoices"), align = "l")
Number of Distinct Invoices
28816


4592 of which have been cancelled, the 15.94%.

library(stringr)
df %>%
  mutate(Status = if_else(str_starts(Invoice, "C"), "Cancelled", "Confirmed")) %>%
  group_by(Status) %>%
  summarize("Number of Distinct Invoices" = n_distinct(Invoice)) %>%
  mutate(Percentage = formattable::percent(`Number of Distinct Invoices` / sum(`Number of Distinct Invoices`))) %>%
  arrange(desc(`Number of Distinct Invoices`))


We confirm that every Invoice is related to just one customer.

kable(df %>%
        count(Invoice, wt = n_distinct(CustomerID), name = "Number of Customers") %>%
        filter(`Number of Customers` > 1) %>%
        tally(`Number of Customers`, name = "Number of Invoices with more than one Customer"), align = "l")
Number of Invoices with more than one Customer
0


There are some invoices that don’t start with C though.

df %>%
  filter(str_length(Invoice) != 6 &
           !str_starts(Invoice, "C"))


- StockCode

StockCode is defined as

“Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.”


In the time frame of the data, 4631 different items have been invoiced.

kable(df %>%
        distinct(StockCode) %>%
        tally(name = "Number of Distinct Stock Codes"), align = "l")
Number of Distinct Stock Codes
4631


Not all the stock codes are numbers of length 5 though.

df %>%
  filter(str_length(StockCode) != 5)


Nor they seem perfectly uniquely assigned.

df %>%
  count(StockCode, Description, name = "Number of Occurrences") %>%
  group_by(StockCode) %>%
  filter(n() > 1) %>%
  ungroup()


- Description

Description is defined as

“Product (item) name. Nominal.”


and there are 4644 distinct ones, a number different from the distinct stock codes (when instead they should be equal), owing to multiple descriptions for some of them, as we’ve seen a little above.

kable(df %>%
        distinct(Description) %>%
        tally(name = "Number of Distinct Descriptions"), align = "l")
Number of Distinct Descriptions
4644


- Quantity

Quantity is defined as

“The quantities of each product (item) per transaction. Numeric.”


Looking at its distribution, we can see that most items are sold in small quantities, but there are larger values as well, both negative and positive.

df %>%
  reframe(Value = summary(Quantity))  %>%
  mutate(Statistic = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max"), .before = Value)
library(ggplot2)
ggplot(df, aes(Quantity, "")) +
  geom_boxplot() +
  scale_x_continuous(trans = scales::modulus_trans(0.5), breaks = scales::breaks_extended(n = 9)) + 
  labs(x = NULL,
       y = NULL,
       title ="Distribution of the Quantity column, non linear x-axis")


- InvoiceDate

InvoiceDate is defined as

“Invoice date and time. Numeric. The day and time when a transaction was generated.”


Here we can take a look at how wide our time frame is and at how many invoices per day there are.

df %>%
  group_by("Invoice Day" = as.Date(InvoiceDate)) %>%
  mutate("Number of Invoices" = n_distinct(Invoice)) %>%
  ungroup() %>%
  ggplot(aes(`Invoice Day`, `Number of Invoices`)) +
  geom_line() + 
  labs(x = "",
       y = "", 
       title = "Number of Invoices per Day")

We notice that the last day is not 09/12/2011 as stated in the data frame definition but one year prior (09/12/2010),

kable(tibble("Last Day Present in the Data Frame" = max(as.Date(df$InvoiceDate))), align = "l")
Last Day Present in the Data Frame
2010-12-09


for a total of 373 days.

kable(tibble("Time Span" = max(as.Date(df$InvoiceDate)) - min(as.Date(df$InvoiceDate))), align = "l")
Time Span
373 days


- Price

Price is defined as

“Unit price. Numeric. Product price per unit in sterling (£).”


As per Quantity, we will take a look at its distribution.

df %>%
  reframe(Value = summary(Price))  %>%
  mutate(Statistic = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max"), .before = Value)
ggplot(df, aes(Price, "")) +
  geom_boxplot() +
  scale_x_continuous(trans = scales::modulus_trans(0.25)) + 
  labs(x = NULL,
       y = NULL,
       title ="Distribution of the Price column, non linear x-axis")


Here we gather that the items we sell are largely cheap with some outliers, both positive and negative.


- Customer ID

Customer ID (previously changed to CustomerID) is defined as

“Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.”


There are 4384 different customers in our data frame.

kable(df %>%
        distinct(CustomerID) %>%
        tally(name = "Number of Distinct Customers"), align = "l")
Number of Distinct Customers
4384


- Country

Country is defined as

“Country name. Nominal. The name of the country where a customer resides.”


The customers are from 40 different countries, mostly European ones.
We notice as well an Unspecified value, occurring for 6 customers.

df %>%
  count(Country, wt = n_distinct(CustomerID), sort = TRUE, name = "Number of Customers")


- actions performed

We changed, together with its class, from numeric to character, Customer ID to CustomerID, for consistency with the other columns’ names and to facilitate typing.


- main takeaways

Besides gaining a general knowledge about the data frame, that we can sum up in the following table,

df %>%
  summarise(across(where(is.character), n_distinct),
            across(where(is.numeric), median))


where we showed the number of distinct values for the characters columns and the median value for the numeric ones, we have assembled some points of interest we will investigate, among other things, during the Data Wrangling sections.
Those are:

  • the Invoice column contains values that don’t start with C
  • the Stockcode column’s values don’t respect the definition
  • the non univocal relation between StockCode and Description
  • the time frame is one year shorter than stated
  • Unspecified values in the Country column