Another family of window functions is the offset one, comprised of
lag()
and lead()
, whose functionality is to
shift the values of one vector in one direction.
<- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) (x
## [1] 1 2 3 4 5 6 7 8 9 10
lag(x)
## [1] NA 1 2 3 4 5 6 7 8 9
lead(x)
## [1] 2 3 4 5 6 7 8 9 10 NA
More specifically, lag()
“pushes” the values forward
while lead()
“pulls” them back. Notice how
lag()
creates an NA at the beginning and removes the last
value 10 while lead()
creates it at the end and removes the
first value 1.
About NAs and +/- Inf, they are not ignored.
lag(c(x, NA, 24, Inf, 77, 66))
## [1] NA 1 2 3 4 5 6 7 8 9 10 NA 24 Inf 77
We can use the default
argument to change what values,
instead of NAs, created elements will have.
lag(x, default = -Inf)
## [1] -Inf 1 2 3 4 5 6 7 8 9
lead(x, default = Inf)
## [1] 2 3 4 5 6 7 8 9 10 Inf
The value inserted must be of the same class of the vector though.
lag(x, default = "missing")
## Error in `lag()`:
## ! Can't convert `default` <character> to match type of `x` <double>.
In case we want to shift by an amount greater than 1 we can use the
n
argument. n
must always be a positive
integer.
lag(x, n = 2)
## [1] NA NA 1 2 3 4 5 6 7 8
The last argument is order_by
, which offsets by an order
different than the one of the rows indexes.
Here an example with the order reversed, which is the same as if we
used lead()
because we just changed the direction of the
shift.
<- sort(x, decreasing = TRUE) rev_order
tibble(x,
lag(x),
rev_order, lag(x, order_by = rev_order),
lead(x))
And here a more mind bending example,
<- sample(x, 10) scrambled_order
tibble(x,
lag(x),
scrambled_order, lag(x, order_by = scrambled_order))
that becomes clearer if we sort the vector by the new order, showing
that using order_by
we first rearrange by the vector
provided and then we offset.
tibble(x,
scrambled_order, lag(x, order_by = scrambled_order)) %>%
arrange(scrambled_order)
With a data frame, offsets functions are particularly useful if we have variations of values through time.
<- df %>%
(df_offset mutate(Invoice_Day = as.Date(df$InvoiceDate)) %>%
count(Invoice_Day, wt = Quantity, name = "Total_Quantity"))
We can then extract information like the amount of variation
%>%
df_offset mutate(Previous_Total_Quantity = lag(Total_Quantity),
Delta_Quantity = Total_Quantity - Previous_Total_Quantity)
or preserving only the rows where Total_Quantity
increased from the previous time stamp.
%>%
df_offset mutate(Previous_Total_Quantity = lag(Total_Quantity)) %>%
filter(Total_Quantity > lag(Total_Quantity))
So in general using offsets functions allows us to compare the values in one row with the values of previous or following ones.
We can also use them to know when a stock code has changed
Price
for example.
%>%
df filter(StockCode == "10002")
%>%
df filter(StockCode == "10002") %>%
filter(Price != lag(Price))
Using lag()
in the previous example we only preserve the
rows where Price
has changed from the one prior. Using
lead()
instead we show the rows before Price
changes.
%>%
df filter(StockCode == "10002") %>%
filter(Price != lead(Price))
In this way we keep the starting price as a reference but I feel that
using lag()
is more correct.
In case we want to generalize to all stock codes we can create a
grouped data frame to have lag()
working independently for
each group.
%>%
df group_by(StockCode) %>%
mutate(Previous_Price = lag(Price), .keep = "used") %>%
filter(Price != Previous_Price)
%>%
df group_by(StockCode) %>%
mutate(Previous_Price = lag(Price), .keep = "used") %>%
filter(Price != Previous_Price) %>%
arrange(StockCode)
Not grouping the data frame and using order_by
or
arranging before filtering is not correct as lag()
will
shift the last Price
value of a stock code as the first
Price
value of the following one, so filter()
will compare the price of two different items.
%>%
df filter(Price != lag(Price, order_by = StockCode))
%>%
df arrange(StockCode) %>%
filter(Price != lag(Price))
Instead of filter()
we can use mutate()
, to
have a column with the previous values, with NAs for the first
occurrence of all the unique values of the grouping column.
%>%
df group_by(StockCode) %>%
mutate(Lag_Price = lag(Price), .keep = "used")
In case of more than one grouping column we will have NAs for every first occurrence of all the existing combinations of the columns.
%>%
df group_by(StockCode, Country) %>%
mutate(Lag_Price = lag(Price), .keep = "used")
Here row 94 has an NA where in the previous example it didn’t, that is because it is the first occurrence for the combination of stock code 22138 with United Kingdom.