- fundamentals

Another family of window functions is the offset one, comprised of lag() and lead(), whose functionality is to shift the values of one vector in one direction.

(x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
##  [1]  1  2  3  4  5  6  7  8  9 10
lag(x)
##  [1] NA  1  2  3  4  5  6  7  8  9
lead(x)
##  [1]  2  3  4  5  6  7  8  9 10 NA

More specifically, lag() “pushes” the values forward while lead() “pulls” them back. Notice how lag() creates an NA at the beginning and removes the last value 10 while lead() creates it at the end and removes the first value 1.

About NAs and +/- Inf, they are not ignored.

lag(c(x, NA, 24, Inf, 77, 66))
##  [1]  NA   1   2   3   4   5   6   7   8   9  10  NA  24 Inf  77

- default

We can use the default argument to change what values, instead of NAs, created elements will have.

lag(x, default = -Inf)
##  [1] -Inf    1    2    3    4    5    6    7    8    9
lead(x, default = Inf)
##  [1]   2   3   4   5   6   7   8   9  10 Inf

The value inserted must be of the same class of the vector though.

lag(x, default = "missing")
## Error in `lag()`:
## ! Can't convert `default` <character> to match type of `x` <double>.

- n

In case we want to shift by an amount greater than 1 we can use the n argument. n must always be a positive integer.

lag(x, n = 2)
##  [1] NA NA  1  2  3  4  5  6  7  8

- order_by

The last argument is order_by, which offsets by an order different than the one of the rows indexes.

Here an example with the order reversed, which is the same as if we used lead() because we just changed the direction of the shift.

rev_order <- sort(x, decreasing = TRUE)
tibble(x, 
       lag(x), 
       rev_order, 
       lag(x, order_by = rev_order), 
       lead(x))
A tibble: 10 x 5

And here a more mind bending example,

scrambled_order <- sample(x, 10)
tibble(x, 
       lag(x), 
       scrambled_order, 
       lag(x, order_by = scrambled_order))
A tibble: 10 x 4

that becomes clearer if we sort the vector by the new order, showing that using order_by we first rearrange by the vector provided and then we offset.

tibble(x, 
       scrambled_order, 
       lag(x, order_by = scrambled_order)) %>%
  arrange(scrambled_order)
A tibble: 10 x 3

- usage with a data frame

With a data frame, offsets functions are particularly useful if we have variations of values through time.

(df_offset <- df %>%
   mutate(Invoice_Day = as.Date(df$InvoiceDate)) %>%
   count(Invoice_Day, wt = Quantity, name = "Total_Quantity"))
A tibble: 307 x 2

We can then extract information like the amount of variation

df_offset %>%
  mutate(Previous_Total_Quantity = lag(Total_Quantity),
         Delta_Quantity = Total_Quantity - Previous_Total_Quantity)
A tibble: 307 x 4

or preserving only the rows where Total_Quantity increased from the previous time stamp.

df_offset %>%
  mutate(Previous_Total_Quantity = lag(Total_Quantity)) %>%
  filter(Total_Quantity > lag(Total_Quantity))
A tibble: 148 x 3

So in general using offsets functions allows us to compare the values in one row with the values of previous or following ones.

We can also use them to know when a stock code has changed Price for example.

df %>%
  filter(StockCode == "10002")
A tibble: 327 x 8
df %>%
  filter(StockCode == "10002") %>%
  filter(Price != lag(Price))
A tibble: 102 x 8

Using lag() in the previous example we only preserve the rows where Price has changed from the one prior. Using lead() instead we show the rows before Price changes.

df %>%
  filter(StockCode == "10002") %>%
  filter(Price != lead(Price))
A tibble: 102 x 8

In this way we keep the starting price as a reference but I feel that using lag() is more correct.

- with group_by()

In case we want to generalize to all stock codes we can create a grouped data frame to have lag() working independently for each group.

df %>%
  group_by(StockCode) %>%
  mutate(Previous_Price = lag(Price), .keep = "used") %>%
  filter(Price != Previous_Price)
A tibble: 171576 x 3
Groups: StockCode [3965]
df %>%
  group_by(StockCode) %>%
  mutate(Previous_Price = lag(Price), .keep = "used") %>%
  filter(Price != Previous_Price) %>%
  arrange(StockCode)
A tibble: 171576 x 3
Groups: StockCode [3965]

Not grouping the data frame and using order_by or arranging before filtering is not correct as lag() will shift the last Price value of a stock code as the first Price value of the following one, so filter() will compare the price of two different items.

df %>%
  filter(Price != lag(Price, order_by = StockCode))
A tibble: 175205 x 8
df %>%
  arrange(StockCode) %>%
  filter(Price != lag(Price))
A tibble: 175205 x 8

Instead of filter() we can use mutate(), to have a column with the previous values, with NAs for the first occurrence of all the unique values of the grouping column.

df %>%
  group_by(StockCode) %>%
  mutate(Lag_Price = lag(Price), .keep = "used")
A tibble: 525461 x 3
Groups: StockCode [4631]

In case of more than one grouping column we will have NAs for every first occurrence of all the existing combinations of the columns.

df %>%
  group_by(StockCode, Country) %>%
  mutate(Lag_Price = lag(Price), .keep = "used")
A tibble: 525461 x 4
Groups: StockCode, Country [18735]

Here row 94 has an NA where in the previous example it didn’t, that is because it is the first occurrence for the combination of stock code 22138 with United Kingdom.