- fundamentals

consecutive_id() is a function that augments a counter every time there is a change in the sequence of elements, in the form of a vector, we feed into it. The counter starts from 1.

consecutive_id(c(1, 2, 1, 3, 4, 3))

## [1] 1 2 3 4 5 6

As long there is a change, the value of the elements is irrelevant.

consecutive_id(c(0, 1, 0, 1, 0, 1))

## [1] 1 2 3 4 5 6

NAs are not ignored

consecutive_id(c(0, 1, 0, NA, 1))

## [1] 1 2 3 4 5

and, when consecutive, they are treated as the same value.

consecutive_id(c(0, 1, 0, NA, NA, 1))

## [1] 1 2 3 4 4 5

- usage with a data frame

consecutive_id() comes useful when we want to group_by() on distinct sequences made by the same group, like the consecutive purchases made by a customer.

df %>%
  mutate(Invoice = Invoice,
         id = consecutive_id(`Customer ID`), .after = "Invoice", .keep = "used")

A tibble: 525461 x 3

In this way id will only augment when the value in the Customer ID column changes, so two consecutive invoices, made by the same customer, will have the same id (like invoice 489434 and 489435 for customer 13085).

We can then use id as a grouping column to calculate for example the total quantity of these consecutive purchases.

df %>%
  mutate(id = consecutive_id(`Customer ID`), .after = "Invoice") %>%
  group_by(id, `Customer ID`) %>%
  summarise(Cons_Purch_Total_Quantity = sum(Quantity))

`summarise()` has grouped output by 'id'. You can override using the
`.groups` argument.

A tibble: 24937 x 3

Groups: id [24937]

If we didn’t group_by(id) as well, we would have obtained the global total quantity for each customer.

df %>%
  group_by(`Customer ID`) %>%
  summarise(Total_Quantity = sum(Quantity))

A tibble: 4384 x 2

Notice the difference with customer 13085 for example.

df %>%
  group_by(`Customer ID`) %>%
  summarise(Total_Quantity = sum(Quantity)) %>%
  filter(`Customer ID` == 13085)

A tibble: 1 x 2

With consecutive_id() we can isolate consecutive runs of unique values in a column then.

- multiple sequences

consecutive_id() accepts multiple sequences as well, as long as they are of the same length. As they are inspected in parallel and not one after the other, to increase the counter it is sufficient the change in value in one of them.

consecutive_id(c(0, 0, 1, 1), 
               c(0, 1, 0, 1))

## [1] 1 2 3 4

We can take advantage of this feature by imagining that our data frame didn’t come with the Invoice column, and somehow recreate it by feeding InvoiceDate and Customer ID to consecutive_id().

df %>%
  filter(!is.na(`Customer ID`)) %>%
  mutate(Invoice = Invoice,
         id = consecutive_id(InvoiceDate, `Customer ID`), .after = "Invoice", .keep = "used")

A tibble: 417534 x 4

As long as every Invoice has a different time stamp (and there are no consecutive NAs in the columns processed), id will have the same increments as Invoice.

df %>%
  filter(!is.na(`Customer ID`)) %>%
  mutate(id = consecutive_id(InvoiceDate, `Customer ID`),
         id_Invoice = consecutive_id(Invoice), .after = "Invoice", .keep = "used") %>%
  filter(id != id_Invoice)

A tibble: 415328 x 5

- with group_by()

On a grouped data frame consecutive_id() creates a counter for every group.

df %>%
  group_by(Country) %>%
  mutate(id = consecutive_id(Invoice), .after = "Invoice", .keep = "used")

A tibble: 525461 x 3

Groups: Country [40]

df %>%
  group_by(Country) %>%
  mutate(id = consecutive_id(Invoice), .after = "Invoice") %>%
  distinct(Country, id)

A tibble: 29401 x 2

Groups: Country [40]