consecutive_id()
is a function that augments a counter
every time there is a change in the sequence of elements, in the form of
a vector, we feed into it. The counter starts from 1.
consecutive_id(c(1, 2, 1, 3, 4, 3))
## [1] 1 2 3 4 5 6
As long there is a change, the value of the elements is irrelevant.
consecutive_id(c(0, 1, 0, 1, 0, 1))
## [1] 1 2 3 4 5 6
NAs are not ignored
consecutive_id(c(0, 1, 0, NA, 1))
## [1] 1 2 3 4 5
and, when consecutive, they are treated as the same value.
consecutive_id(c(0, 1, 0, NA, NA, 1))
## [1] 1 2 3 4 4 5
consecutive_id()
comes useful when we want to
group_by()
on distinct sequences made by the same group,
like the consecutive purchases made by a customer.
%>%
df mutate(Invoice = Invoice,
id = consecutive_id(`Customer ID`), .after = "Invoice", .keep = "used")
In this way id
will only augment when the value in the
Customer ID
column changes, so two consecutive invoices,
made by the same customer, will have the same id
(like
invoice 489434 and 489435 for customer 13085).
We can then use id
as a grouping column to calculate for
example the total quantity of these consecutive purchases.
%>%
df mutate(id = consecutive_id(`Customer ID`), .after = "Invoice") %>%
group_by(id, `Customer ID`) %>%
summarise(Cons_Purch_Total_Quantity = sum(Quantity))
`summarise()` has grouped output by 'id'. You can override using the
`.groups` argument.
If we didn’t group_by(id)
as well, we would have
obtained the global total quantity for each customer.
%>%
df group_by(`Customer ID`) %>%
summarise(Total_Quantity = sum(Quantity))
Notice the difference with customer 13085 for example.
%>%
df group_by(`Customer ID`) %>%
summarise(Total_Quantity = sum(Quantity)) %>%
filter(`Customer ID` == 13085)
With consecutive_id()
we can isolate consecutive runs of
unique values in a column then.
consecutive_id()
accepts multiple sequences as well, as
long as they are of the same length. As they are inspected in parallel
and not one after the other, to increase the counter it is sufficient
the change in value in one of them.
consecutive_id(c(0, 0, 1, 1),
c(0, 1, 0, 1))
## [1] 1 2 3 4
We can take advantage of this feature by imagining that our data
frame didn’t come with the Invoice
column, and somehow
recreate it by feeding InvoiceDate
and
Customer ID
to consecutive_id()
.
%>%
df filter(!is.na(`Customer ID`)) %>%
mutate(Invoice = Invoice,
id = consecutive_id(InvoiceDate, `Customer ID`), .after = "Invoice", .keep = "used")
As long as every Invoice
has a different time stamp (and
there are no consecutive NAs in the columns processed), id
will have the same increments as Invoice
.
%>%
df filter(!is.na(`Customer ID`)) %>%
mutate(id = consecutive_id(InvoiceDate, `Customer ID`),
id_Invoice = consecutive_id(Invoice), .after = "Invoice", .keep = "used") %>%
filter(id != id_Invoice)
On a grouped data frame consecutive_id()
creates a
counter for every group.
%>%
df group_by(Country) %>%
mutate(id = consecutive_id(Invoice), .after = "Invoice", .keep = "used")
%>%
df group_by(Country) %>%
mutate(id = consecutive_id(Invoice), .after = "Invoice") %>%
distinct(Country, id)