- cumsum(), cummean(), cummax(), cummin() & cumprod()

This family of functions, offering variations on the base R aggregate functions sum(), mean(), max(), min() and prod(), consists of

cumsum()
cummean()
cummax()
cummin()
cumprod()

cummean() is a dplyr function while the others can be found in base R.

The variation consists in the fact that, while the original functions output a scalar with just the final result of their calculation, their cumulative versions return a vector with as many elements as the input showing step by step how they reach the final result.

Starting with the variation of sum() we have cumsum():

(x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
##  [1]  1  2  3  4  5  6  7  8  9 10
sum(x)
## [1] 55
cumsum(x)
##  [1]  1  3  6 10 15 21 28 36 45 55

Then we have cummean(), cummax(), cummin() and cumprod() that all behave like expected:

mean(x)
## [1] 5.5
cummean(x)
##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
max(x)
## [1] 10
cummax(x)
##  [1]  1  2  3  4  5  6  7  8  9 10
min(x)
## [1] 1
cummin(x)
##  [1] 1 1 1 1 1 1 1 1 1 1
prod(x)
## [1] 3628800
cumprod(x)
##  [1]       1       2       6      24     120     720    5040   40320  362880
## [10] 3628800

The presence of NAs could be a problem as these functions don’t have a na.rm argument like their counterparts, so after the first NA we will get only NAs.

(xNA <- c(1, 2, 3, 4, 5, NA, 6, 7, 8, 9, 10))
##  [1]  1  2  3  4  5 NA  6  7  8  9 10
cumsum(xNA) 
##  [1]  1  3  6 10 15 NA NA NA NA NA NA
cummean(xNA)
##  [1] 1.0 1.5 2.0 2.5 3.0  NA  NA  NA  NA  NA  NA
cummax(xNA)
##  [1]  1  2  3  4  5 NA NA NA NA NA NA
cummin(xNA)
##  [1]  1  1  1  1  1 NA NA NA NA NA NA
cumprod(xNA)
##  [1]   1   2   6  24 120  NA  NA  NA  NA  NA  NA

If we want to change the direction of accumulation we can do so by ordering by another vector,

x
##  [1]  1  2  3  4  5  6  7  8  9 10
cumsum(x)
##  [1]  1  3  6 10 15 21 28 36 45 55
(rev_order <- sort(x, decreasing = TRUE))
##  [1] 10  9  8  7  6  5  4  3  2  1

wrapping our functions with order_by().

order_by(rev_order, cumsum(x))
##  [1] 55 54 52 49 45 40 34 27 19 10

order_by() takes a vector of indexes as the first argument and an accumulation function as the second one.

So we will accumulate the elements by the order of the indexes specified in the first argument, in this case reversing the direction of the accumulation.

The output will still respect the order of the elements of the vector to be accumulated though (notice how it starts with the last value calculated (55) and not with the first one (10)).

The output of a scrambled order

scrambled_order <- sample(x, 10)
tibble(x, 
       cumsum(x), 
       scrambled_order, 
       "scrambled_order_cumsum(x)" = order_by(scrambled_order, cumsum(x)))
A tibble: 10 x 4

can be made clearer if we sort the vector to be accumulated by the order itself.

tibble(x, 
       cumsum(x), 
       scrambled_order, 
       "scrambled_order_cumsum(x)" = order_by(scrambled_order, cumsum(x))) %>%
  select(-"cumsum(x)") %>%
  arrange(scrambled_order)
A tibble: 10 x 3

Inside a function (like tibble()), it is advised to use with_order() though, which has the order as the first argument, the accumulation function as the second and the function’s arguments as the third.

tibble(x, 
       cumsum(x), 
       scrambled_order, 
       "scrambled_order_cumsum(x)" = with_order(scrambled_order, cumsum, x))
A tibble: 10 x 4

Let’s see the same example but applied to a data frame.

df %>%
  slice(1:10) %>%
  mutate(cumsum(Quantity), 
         Scrambled_Order = scrambled_order,
         Scrambled_Order_Cum_Sum =  with_order(scrambled_order, cumsum, Quantity), .keep = "used")
A tibble: 10 x 4

- usage with a data frame

More generally the application of these functions with a data frame is quite straightforward: usually we filter by the values generated by the accumulation.

Like we could store the cumulative sum of the quantity progressively ordered by a client

df %>%
  filter(`Customer ID` == 15055) %>%
  mutate(Cumulative_Quantity = cumsum(Quantity), .keep = "used")
A tibble: 85 x 2

and then for example preserve the rows after a certain threshold has been reached.

df %>%
  filter(`Customer ID` == 15055) %>%
  mutate(Cumulative_Quantity = cumsum(Quantity), .keep = "used") %>%
  filter(Cumulative_Quantity >= 100)
A tibble: 30 x 2

Or maybe we are interested in seeing how the average daily quantity ordered by a client changed over time.

df %>%
  filter(`Customer ID` == 15055) %>%
  count(Invoice_Day = as.Date(InvoiceDate), wt = Quantity, name = "Total_Daily_Quantity") %>%
  mutate(Cumulative_Avg_Quantity = cummean(Total_Daily_Quantity))
A tibble: 3 x 3

In case there are NAs, besides removing them, we can turn them to 0, if we are using cumsum() or cummax(), to avoid them stopping the accumulation.

tibble(xNA) %>%
  mutate(Cum_Sum_NA = cumsum(xNA),
         Cum_Max_NA = cummax(xNA),
         x0 = coalesce(xNA, 0),
         Cum_Sum = cumsum(x0),
         Cum_Max = cummax(x0))
A tibble: 11 x 6

With the cummean(), cummin() and cumprod() functions recoding the NAs to 0 might not give the desired outputs and the best course of action depends on the user case (with cummean() we could replace the NAs with the cumulative average until that point).

tibble(xNA) %>%
  mutate(Cum_Mean_NA = cummean(xNA),
         x_Cum_Mean_Lag = coalesce(xNA, lag(Cum_Mean_NA)),
         Cum_Mean = cummean(x_Cum_Mean_Lag))
A tibble: 11 x 4

- with group_by()

On a grouped data frame these functions work as expected so we can generalize for all the clients an example seen before, preserving the rows after a quantity threshold has been reached.

df %>%
  group_by(`Customer ID`) %>%
  mutate(Cumulative_Quantity = cumsum(Quantity), .keep = "used") %>%
  filter(Cumulative_Quantity >= 100)
A tibble: 462747 x 3
Groups: Customer ID [3661]

- cumany() & cumall()

dplyr provides two other functions to this family, cumany() and cumall(), both of which evaluate logical vectors.

(x <- c(1, 2, 3, 2))
## [1] 1 2 3 2
x > 2
## [1] FALSE FALSE  TRUE FALSE

cumany() returns TRUE values after the first TRUE, even if there are, in the logical vector, some FALSE ones after that.

cumany(x > 2)
## [1] FALSE FALSE  TRUE  TRUE

cumall() instead returns FALSE values after the first FALSE, even if there are some TRUE ones after it.

x < 3
## [1]  TRUE  TRUE FALSE  TRUE
cumall(x < 3)
## [1]  TRUE  TRUE FALSE FALSE

- usage with a data frame

With these behaviors, cumany() and cumall() can be used inside filter() calls to preserve rows hereafter or up to a particular one.

cumany() is used to preserve rows after a certain condition is TRUE for the first time, the row where this happens included, regardless of the condition becoming FALSE afterwards.

Like for example the rows after a stock code had a price higher than 2 for the first time.

df %>%
  filter(StockCode == 21232)
A tibble: 1843 x 8
df %>%
  filter(StockCode == 21232) %>%
  filter(cumany(Price > 2))
A tibble: 1834 x 8

cumall() preserves the rows before a certain condition is no longer TRUE, regardless of it becoming TRUE again, like the rows before the price of a stock code was no longer less than 2.

df %>%
  filter(StockCode == 21232) %>%
  filter(cumall(Price < 2))
A tibble: 9 x 8

With that in mind, we can’t use cumall() with a condition that returns FALSE as the first outputted value, as that will preserve no rows.

df %>%
  filter(StockCode == 21232) %>%
  filter(cumall(Price > 2))
A tibble: 0 x 8

With NAs, || (the non-cumulative version of cumany()) returns TRUE with TRUE

NA || TRUE
## [1] TRUE

and NA with FALSE.

NA || FALSE
## [1] NA

So, if there are NAs, cumany() will return NA instead of FALSE.

(xNA <- c(1, 2, 3, NA, 2))
## [1]  1  2  3 NA  2
xNA > 3
## [1] FALSE FALSE FALSE    NA FALSE
cumany(xNA > 3)
## [1] FALSE FALSE FALSE    NA    NA

filter() treats the rows with NAs like FALSE so with these manipulations that is not a problem, beside the fact that cumany() returns TRUE after the first TRUE, so subsequent NAs are ininfluential.

df %>%
  filter(StockCode == "10123G")
A tibble: 20 x 8
df %>%
  filter(StockCode == "10123G") %>%
  filter(cumany(`Customer ID` < 17968))
A tibble: 17 x 8

&& (the non-cumulative version of cumall()) instead returns NA with TRUE,

NA && TRUE
## [1] NA
NA && FALSE
## [1] FALSE

and that could pose a problem for cumall(),

xNA
## [1]  1  2  3 NA  2
xNA > 0
## [1] TRUE TRUE TRUE   NA TRUE
cumall(xNA > 0)
## [1] TRUE TRUE TRUE   NA   NA

as rows with NAs will not be preserved by filter().

df %>%
  filter(StockCode == "10123G") %>%
  slice(4:n())
A tibble: 17 x 8
df %>%
  filter(StockCode == "10123G") %>%
  slice(4:n()) %>%
  filter(cumall(`Customer ID` < 17920))
A tibble: 3 x 8

The only solution is to remove NAs beforehand.

df %>%
  filter(StockCode == "10123G" &
           !is.na(`Customer ID`)) %>%
  slice(3:n())
A tibble: 12 x 8
df %>%
  filter(StockCode == "10123G" &
           !is.na(`Customer ID`)) %>%
  slice(3:n()) %>%
  filter(cumall(`Customer ID` < 17920))
A tibble: 10 x 8

- with group_by()

With a grouped data frame the results are as expected with the computation working group-wise.

df %>%
  group_by(`Customer ID`)
A tibble: 525461 x 8
Groups: Customer ID [4384]
df %>%
  group_by(`Customer ID`) %>%
  filter(cumall(Quantity < 9))
A tibble: 27058 x 8
Groups: Customer ID [2562]