This family of functions, offering variations on the base
R
aggregate functions sum()
,
mean()
, max()
, min()
and
prod()
, consists of
cumsum()
cummean()
cummax()
cummin()
cumprod()
cummean()
is a dplyr
function while the
others can be found in base R
.
The variation consists in the fact that, while the original functions output a scalar with just the final result of their calculation, their cumulative versions return a vector with as many elements as the input showing step by step how they reach the final result.
Starting with the variation of sum()
we have
cumsum()
:
<- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) (x
## [1] 1 2 3 4 5 6 7 8 9 10
sum(x)
## [1] 55
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
Then we have cummean()
, cummax()
,
cummin()
and cumprod()
that all behave like
expected:
mean(x)
## [1] 5.5
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
max(x)
## [1] 10
cummax(x)
## [1] 1 2 3 4 5 6 7 8 9 10
min(x)
## [1] 1
cummin(x)
## [1] 1 1 1 1 1 1 1 1 1 1
prod(x)
## [1] 3628800
cumprod(x)
## [1] 1 2 6 24 120 720 5040 40320 362880
## [10] 3628800
The presence of NAs could be a problem as these functions don’t have
a na.rm
argument like their counterparts, so after the
first NA we will get only NAs.
<- c(1, 2, 3, 4, 5, NA, 6, 7, 8, 9, 10)) (xNA
## [1] 1 2 3 4 5 NA 6 7 8 9 10
cumsum(xNA)
## [1] 1 3 6 10 15 NA NA NA NA NA NA
cummean(xNA)
## [1] 1.0 1.5 2.0 2.5 3.0 NA NA NA NA NA NA
cummax(xNA)
## [1] 1 2 3 4 5 NA NA NA NA NA NA
cummin(xNA)
## [1] 1 1 1 1 1 NA NA NA NA NA NA
cumprod(xNA)
## [1] 1 2 6 24 120 NA NA NA NA NA NA
If we want to change the direction of accumulation we can do so by ordering by another vector,
x
## [1] 1 2 3 4 5 6 7 8 9 10
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
<- sort(x, decreasing = TRUE)) (rev_order
## [1] 10 9 8 7 6 5 4 3 2 1
wrapping our functions with order_by()
.
order_by(rev_order, cumsum(x))
## [1] 55 54 52 49 45 40 34 27 19 10
order_by()
takes a vector of indexes as the first
argument and an accumulation function as the second one.
So we will accumulate the elements by the order of the indexes specified in the first argument, in this case reversing the direction of the accumulation.
The output will still respect the order of the elements of the vector to be accumulated though (notice how it starts with the last value calculated (55) and not with the first one (10)).
The output of a scrambled order
<- sample(x, 10) scrambled_order
tibble(x,
cumsum(x),
scrambled_order, "scrambled_order_cumsum(x)" = order_by(scrambled_order, cumsum(x)))
can be made clearer if we sort the vector to be accumulated by the order itself.
tibble(x,
cumsum(x),
scrambled_order, "scrambled_order_cumsum(x)" = order_by(scrambled_order, cumsum(x))) %>%
select(-"cumsum(x)") %>%
arrange(scrambled_order)
Inside a function (like tibble()
), it is advised to use
with_order()
though, which has the order as the first
argument, the accumulation function as the second and the function’s
arguments as the third.
tibble(x,
cumsum(x),
scrambled_order, "scrambled_order_cumsum(x)" = with_order(scrambled_order, cumsum, x))
Let’s see the same example but applied to a data frame.
%>%
df slice(1:10) %>%
mutate(cumsum(Quantity),
Scrambled_Order = scrambled_order,
Scrambled_Order_Cum_Sum = with_order(scrambled_order, cumsum, Quantity), .keep = "used")
More generally the application of these functions with a data frame is quite straightforward: usually we filter by the values generated by the accumulation.
Like we could store the cumulative sum of the quantity progressively ordered by a client
%>%
df filter(`Customer ID` == 15055) %>%
mutate(Cumulative_Quantity = cumsum(Quantity), .keep = "used")
and then for example preserve the rows after a certain threshold has been reached.
%>%
df filter(`Customer ID` == 15055) %>%
mutate(Cumulative_Quantity = cumsum(Quantity), .keep = "used") %>%
filter(Cumulative_Quantity >= 100)
Or maybe we are interested in seeing how the average daily quantity ordered by a client changed over time.
%>%
df filter(`Customer ID` == 15055) %>%
count(Invoice_Day = as.Date(InvoiceDate), wt = Quantity, name = "Total_Daily_Quantity") %>%
mutate(Cumulative_Avg_Quantity = cummean(Total_Daily_Quantity))
In case there are NAs, besides removing them, we can turn them to 0,
if we are using cumsum()
or cummax()
, to avoid
them stopping the accumulation.
tibble(xNA) %>%
mutate(Cum_Sum_NA = cumsum(xNA),
Cum_Max_NA = cummax(xNA),
x0 = coalesce(xNA, 0),
Cum_Sum = cumsum(x0),
Cum_Max = cummax(x0))
With the cummean()
, cummin()
and
cumprod()
functions recoding the NAs to 0 might not give
the desired outputs and the best course of action depends on the user
case (with cummean()
we could replace the NAs with the
cumulative average until that point).
tibble(xNA) %>%
mutate(Cum_Mean_NA = cummean(xNA),
x_Cum_Mean_Lag = coalesce(xNA, lag(Cum_Mean_NA)),
Cum_Mean = cummean(x_Cum_Mean_Lag))
On a grouped data frame these functions work as expected so we can generalize for all the clients an example seen before, preserving the rows after a quantity threshold has been reached.
%>%
df group_by(`Customer ID`) %>%
mutate(Cumulative_Quantity = cumsum(Quantity), .keep = "used") %>%
filter(Cumulative_Quantity >= 100)
dplyr
provides two other functions to this family,
cumany()
and cumall()
, both of which evaluate
logical vectors.
<- c(1, 2, 3, 2)) (x
## [1] 1 2 3 2
> 2 x
## [1] FALSE FALSE TRUE FALSE
cumany()
returns TRUE values after the first TRUE, even
if there are, in the logical vector, some FALSE ones after that.
cumany(x > 2)
## [1] FALSE FALSE TRUE TRUE
cumall()
instead returns FALSE values after the first
FALSE, even if there are some TRUE ones after it.
< 3 x
## [1] TRUE TRUE FALSE TRUE
cumall(x < 3)
## [1] TRUE TRUE FALSE FALSE
With these behaviors, cumany()
and cumall()
can be used inside filter()
calls to preserve rows
hereafter or up to a particular one.
cumany()
is used to preserve rows after a certain
condition is TRUE for the first time, the row where this happens
included, regardless of the condition becoming FALSE afterwards.
Like for example the rows after a stock code had a price higher than 2 for the first time.
%>%
df filter(StockCode == 21232)
%>%
df filter(StockCode == 21232) %>%
filter(cumany(Price > 2))
cumall()
preserves the rows before a certain condition
is no longer TRUE, regardless of it becoming TRUE again, like the rows
before the price of a stock code was no longer less than 2.
%>%
df filter(StockCode == 21232) %>%
filter(cumall(Price < 2))
With that in mind, we can’t use cumall()
with a
condition that returns FALSE as the first outputted value, as that will
preserve no rows.
%>%
df filter(StockCode == 21232) %>%
filter(cumall(Price > 2))
With NAs, ||
(the non-cumulative version of
cumany()
) returns TRUE with TRUE
NA || TRUE
## [1] TRUE
and NA with FALSE.
NA || FALSE
## [1] NA
So, if there are NAs, cumany()
will return NA instead of
FALSE.
<- c(1, 2, 3, NA, 2)) (xNA
## [1] 1 2 3 NA 2
> 3 xNA
## [1] FALSE FALSE FALSE NA FALSE
cumany(xNA > 3)
## [1] FALSE FALSE FALSE NA NA
filter()
treats the rows with NAs like FALSE so with
these manipulations that is not a problem, beside the fact that
cumany()
returns TRUE after the first TRUE, so subsequent
NAs are ininfluential.
%>%
df filter(StockCode == "10123G")
%>%
df filter(StockCode == "10123G") %>%
filter(cumany(`Customer ID` < 17968))
&&
(the non-cumulative version of
cumall()
) instead returns NA with TRUE,
NA && TRUE
## [1] NA
NA && FALSE
## [1] FALSE
and that could pose a problem for cumall()
,
xNA
## [1] 1 2 3 NA 2
> 0 xNA
## [1] TRUE TRUE TRUE NA TRUE
cumall(xNA > 0)
## [1] TRUE TRUE TRUE NA NA
as rows with NAs will not be preserved by filter()
.
%>%
df filter(StockCode == "10123G") %>%
slice(4:n())
%>%
df filter(StockCode == "10123G") %>%
slice(4:n()) %>%
filter(cumall(`Customer ID` < 17920))
The only solution is to remove NAs beforehand.
%>%
df filter(StockCode == "10123G" &
!is.na(`Customer ID`)) %>%
slice(3:n())
%>%
df filter(StockCode == "10123G" &
!is.na(`Customer ID`)) %>%
slice(3:n()) %>%
filter(cumall(`Customer ID` < 17920))
With a grouped data frame the results are as expected with the computation working group-wise.
%>%
df group_by(`Customer ID`)
%>%
df group_by(`Customer ID`) %>%
filter(cumall(Quantity < 9))