data-masking

- fundamentals

with slice() we can specify one or several row indexes and the function will preserve the relative rows.

df %>%
  slice(5)
A tibble: 1 x 8
df %>%
  slice(5, 6)
A tibble: 2 x 8

If we want to preserve an interval of rows we can use a colon (:).

df %>%
  slice(5:10)
A tibble: 6 x 8

If the interval specified exceeds the total number of rows, slice() will not throw an error.

df %>%
  slice(500000:1000000)
A tibble: 25462 x 8

slice() is equivalent to the following expressions (which must be used in case of relational databases).

df %>%
  filter(row_number() == 5)
A tibble: 1 x 8
df %>%
  filter(row_number() %in% c(5, 6))
A tibble: 2 x 8
df %>%
  filter(row_number() %in% 5:10)
A tibble: 6 x 8
df %>%
  filter(row_number() %in% 500000:1000000)
A tibble: 25462 x 8

slice() is very versatile and can perform several additional actions:

  • remove one or several rows with the use of a minus (-)
df %>%
  slice(-5)
A tibble: 525460 x 8
df %>%
  slice(-5, -6)
A tibble: 525459 x 8
df %>%
  slice(-(5:10))
A tibble: 525455 x 8
  • change the order of the output
df %>%
  slice(10:5)
A tibble: 6 x 8

or duplicate rows,

df %>%
  slice(7, 7, 7, 7)
A tibble: 4 x 8

even with complex expressions,

df %>%
  slice(rep(c(1, 2, 3), c(1, 2, 4)))
A tibble: 7 x 8

as long as all the input indexes have all the same sign though.

df %>%
  slice(7, 7, 7, -7)
Error in `slice()`:
! Can't compute indices.
Caused by error:
! Can't subset elements with `7`.
✖ Negative and positive locations can't be mixed.
ℹ Subscript `7` has 3 positive values at locations 1, 2, and 3.
  • by using n() (or nrow(df), it is indifferent here), we can preserve or remove the last row or a number of rows from the bottom.
df %>%
  slice(n())
A tibble: 1 x 8
df %>%
  slice(-n())
A tibble: 525460 x 8
df %>%
  slice(525000:n())
A tibble: 462 x 8
df %>%
  slice(-(525000:n()))
A tibble: 524999 x 8
df %>% 
  slice(1:(n() - 462))
A tibble: 524999 x 8

As we’ve seen in all of these examples, if a row index is present in the expression, that row index is preserved (or removed when we used a -).

df %>%
  slice(5:10)
A tibble: 6 x 8

We can think of it as if slice() uses >= / <= like between().

df %>%
  filter(between(row_number(), 5, 10))
A tibble: 6 x 8

Furthermore, as slice() is a data-masking function, it is possible to provide the indexes with calculations or with the output of a function, like we did in some of the previous examples.

df %>%
  slice(c(2, 4, 6, 8) / 2)
A tibble: 4 x 8
df %>%
  slice(seq(1, 8, by = 2))
A tibble: 4 x 8

With no argument, slice() returns an empty data frame.

df %>%
  slice()
A tibble: 0 x 8

slice() has many helpers that can streamline the code in particular cases:

- slice_head() & slice_tail()

slice_head() and slice_tail() respectively preserve the first and last n rows. n must be an integer.

df %>%
  slice_head(n = 5)
A tibble: 5 x 8
df %>%
  slice_tail(n = 5)
A tibble: 5 x 8

Keep in mind that n must always be explicit with the helpers.

df %>%
  slice_head(5)
Error in `slice_head()`:
! `n` must be explicitly named.
ℹ Did you mean `slice_head(n = 5)`?

While slice() would throw an error if we used it.

df %>%
  slice(n = 5)
Error in `slice()`:
! Arguments in `...` must be passed by position, not name.
✖ Problematic argument:
• n = 5

The default is n = 1.

df %>%
  slice_head()
A tibble: 1 x 8

The example with slice_tail() can be rewritten like this, but it is surely more convoluted, so the presence of this helper is appreciated.

n <- 5
df %>%
  slice((n() - (n - 1)):n())
A tibble: 5 x 8

A similar expression, not possible with slice_tail(), could be used though to select a range of row indexes n to n + n counting from the bottom.

df %>% 
  slice(((n() - (n - 1)) - n):(n() - (n - 1))) 
A tibble: 6 x 8

With slice_head() and slice_tail() we can also use the prop argument (that needs to be explicit, like n) to preserve a user defined proportion of rows.

df %>%
  slice_head(prop = 0.1)
A tibble: 52546 x 8
df %>%
  slice_tail(prop = 0.1)
A tibble: 52546 x 8

The number of rows is rounded down in case of decimals in the output.

df %>%
  slice_head(prop = 0.6)
A tibble: 315276 x 8

We can use a minus (-), like we do in slice(), but the result is counter intuitive in my opinion as slice_head() removes the rows at the bottom and slice_tail() the ones at the top.

df %>%
  slice_head(n = -5)
A tibble: 525456 x 8
df %>%
  slice_tail(n = -5)
A tibble: 525456 x 8
df %>%
  slice_head(prop = -0.1)
A tibble: 472915 x 8
df %>%
  slice_tail(prop = -0.1)
A tibble: 472915 x 8

- slice_sample()

slice_sample() samples a number or a percentage of rows. Like with the previous helpers, n and prop must always be explicit, n an integer and the number of rows is rounded down if the output contains decimals when using prop.

df %>%
  slice_sample(n = 5)
A tibble: 5 x 8
df %>%
  slice_sample(prop = 0.1)
A tibble: 52546 x 8

The default is again n = 1.

df %>%
  slice_sample()
A tibble: 1 x 8

In case we want to remove random rows, we can provide negative integers to n and prop.

df %>%
  slice_sample(n = -5)
A tibble: 525456 x 8
df %>%
  slice_sample(prop = -0.1)
A tibble: 472915 x 8

- replace

The replace argument controls whether the sampling occurs without (the default FALSE) or with replacements (which means that a row can be picked more than once).

df %>%
  slice(1:5)
A tibble: 5 x 8
df %>%
  slice(1:5) %>%
  slice_sample(n = 5, replace = TRUE)
A tibble: 5 x 8

- weight_by

data-masking

If we want to give more representation to a certain segment of the data, we can use the weight_by argument, which increases the probability of preserving rows with higher values in the column specified in it.

In this example we weight by the Price column, so expensive items will be more present in the output compared to a non weighted sample. We can check it with the Price_Rank column, whose lower values indicates a more expensive item.

df %>%
  filter(Price >= 0) %>%
  mutate(Price_Rank = dense_rank(desc(Price))) %>%
  slice_sample(n = 10, weight_by = Price)
A tibble: 10 x 9
df %>%
  filter(Price >= 0) %>%
  mutate(Price_Rank = dense_rank(desc(Price))) %>%
  slice_sample(n = 10)
A tibble: 10 x 9

Notice the filter() call on the column we provide to weight_by, as that must be non-negative. We can also use an external vector, as long as it is of the same size as the number of rows of the data frame fed into slice_sample().

We can’t in fact use inline subsetting on weight_by because it will produce a different one.

df %>%
  mutate(Price_Rank = dense_rank(desc(Price))) %>%
  slice_sample(n = 10, weight_by = Price[Price > 0])
Error in `slice_sample()`:
! Can't compute indices.
Caused by error:
! `weight_by` must have size 525461, not size 521771.

We can use expressions inside weight_by.

df %>%
  filter(Price >= 0) %>%
  mutate(Price_Rank = dense_rank(desc(Price))) %>%
  slice_sample(n = 10, weight_by = Price / 2)
A tibble: 10 x 9

- slice_max() & slice_min()

Last helpers, slice_max() and slice_min() preserve a number of rows with the maximum or minimum values of the column specified in the order_by argument, sorting the output as well.

df %>%
  slice_max(order_by = Quantity, n = 10)
A tibble: 10 x 8
df %>%
  slice_min(order_by = Quantity, n = 10)
A tibble: 10 x 8

Being a data-masking argument, order_by accepts expressions,

df %>%
  slice_max(order_by = Quantity / 2, n = 10)
A tibble: 10 x 8

and can be implicit.

df %>%
  slice_max(Quantity, n = 10)
A tibble: 10 x 8

But n cannot,

df %>%
  slice_max(Quantity, 10)
Error in `slice_max()`:
! `n` must be explicitly named.
ℹ Did you mean `slice_max(n = 10)`?

and, as with other helpers, its default is 1 and it accepts only integer values.

df %>%
  slice_max(Quantity)
A tibble: 1 x 8

In the presence of ties all of the rows are preserved and we assume, as with arrange(), that the rows are ordered by their original row indexes.

df %>%
  slice_min(Quantity, n = 4)
A tibble: 5 x 8

To break the ties we can either specify two columns, that must be in a data frame or tibble format, so for example we can use pick() to do that,

df %>%
  slice_min(pick(Quantity, StockCode), n = 4)
A tibble: 4 x 8

- with_ties

or we can set the with_ties argument to FALSE, so the function will remove the rows with an output index superior to n.

df %>%
  slice_min(Quantity, n = 4, with_ties = FALSE)
A tibble: 4 x 8

We remind that, within ties, the rows of the output are ordered by their original row index so if we are not satisfied by which row gets removed we can use arrange() before slicing, to preserve invoice 530348 instead.

df %>%
  arrange(desc(InvoiceDate)) %>%
  slice_min(Quantity, n = 4, with_ties = FALSE)
A tibble: 4 x 8

NAs are included if they are necessary to reach the number specified in n or prop.

df %>%
  filter(StockCode == "90081B")
A tibble: 3 x 8
df %>%
  filter(StockCode == "90081B") %>%
  slice_max(`Customer ID`, n = 3)
A tibble: 3 x 8

- na_rm

With na_rm = TRUE we can exclude them, it might mean returning less rows than specified in n though.

df %>%
  filter(StockCode == "90081B") %>%
  slice_max(`Customer ID`, n = 3, na_rm = TRUE)
A tibble: 2 x 8

We can obtain the same results provided by slice_min() or slice_max() with a combination of arrange() plus slice().

df %>%
  slice_min(Quantity, n = 10)
A tibble: 10 x 8
df %>%
  arrange(Quantity) %>%
  slice(1:10)
A tibble: 10 x 8

Or by using filter() with a ranking function. The order will be different here as filter() preserves the original row index’s one.

df %>%
  filter(row_number(Quantity) <= 10)
A tibble: 10 x 8

If we wrap the variable with desc(), we can invert the behavior.

df %>%
  slice_max(desc(Quantity), n = 10)
A tibble: 10 x 8
df %>%
  slice_min(desc(Quantity), n = 10)
A tibble: 10 x 8

As with the other helpers, a proportion of rows can be used as well and they are rounded down in case of decimals.

df %>%
  slice_max(Quantity, prop = 0.1)
A tibble: 55395 x 8
df %>%
  slice_min(Quantity, prop = 0.1)
A tibble: 158444 x 8

Just like n, prop must be explicit.

df %>%
  slice_min(Quantity, 0.1)
Error in `slice_min()`:
! `n` must be explicitly named.
ℹ Did you mean `slice_min(n = 0.1)`?

slice_max() and slice_min() can also be used with other classes of columns:

  • with character ones we will rank by the alphabet, as it is defined in your locale.
df %>%
  slice_min(Description, n = 10)
A tibble: 302 x 8
  • with factors by the order of the levels.
df %>%
  mutate(Description_Factor = factor(Description, levels = sample(unique(df$Description), 
                                                                  length(unique(df$Description))), ordered = TRUE)) %>%
  slice_min(Description_Factor, n = 1000)
A tibble: 1033 x 9
  • with datetime objects it depends on the class, with for example POSIXct will be by the number of seconds since 01/01/1970 (in the UTC time zone) as a numeric vector. This is quite resource heavy so I computed on only the first 100 rows.
df %>%
  slice(1:100) %>%
  slice_min(InvoiceDate, n = 10)
A tibble: 12 x 8

A minus before n doesn’t seem to affect the output.

df %>%
  slice(21:30)
A tibble: 10 x 8
df %>%
  slice(21:30) %>%
  slice_min(Quantity, n = 5)
A tibble: 6 x 8
df %>%
  slice(21:30) %>%  
  slice_min(Quantity, n = -5)
A tibble: 6 x 8

- with group_by() / .by

slice() and its helpers work as expected with grouped data frames, preserving as many specified rows indexes, n or prop as there are groups. It is important to notice that they rearrange the output by the grouping column, like summarise() and unlike filter().

df %>%
  group_by(Country) %>%
  slice(2)
A tibble: 40 x 8
Groups: Country [40]
df %>%
  group_by(Country) %>%
  slice_head(n = 5)
A tibble: 200 x 8
Groups: Country [40]

As with the ungrouped case, when we use a proportion the number of rows we preserve is rounded down.

df %>%
  group_by(Country) %>%
  slice(1:5)
A tibble: 200 x 8
Groups: Country [40]
df %>%
  group_by(Country) %>%
  slice(1:5) %>%
  slice_head(prop = 0.5)
A tibble: 80 x 8
Groups: Country [40]

And as already seen at the beginning, slice() doesn’t throw an error if the interval specified crosses over the number of rows.

df %>%
  group_by(Country) %>%
  slice(11:15)
A tibble: 198 x 8
Groups: Country [40]

- .preserve

slice() (but not its helpers) has a .preserve argument that, when set to TRUE, doesn’t recalculate the number of groups when the output doesn’t have any rows from a particular one (Lebanon here).

df %>%
  group_by(Country) %>%
  slice(15) %>%
  n_groups()
## [1] 39
df %>%
  group_by(Country) %>%
  slice(15, .preserve = TRUE) %>%
  n_groups()
## [1] 40

This is useful if further operations need the original number of groups (notice how Lebanon is present in the output even if it has no rows).

df %>%
  group_by(Country) %>%
  slice(15, .preserve = TRUE) %>%
  summarise(n = n())
A tibble: 40 x 2

Besided group_by(), we can also use the .by argument to use slice() on grouped data frames.

df %>%
  group_by(Country) %>%
  slice(2)
A tibble: 40 x 8
Groups: Country [40]
df %>%
  slice(2, .by = Country)
A tibble: 40 x 8

You can see that using .by the output maintains the original row order plus the data frame is returned ungrouped.

Another difference is that .by uses a tidy-select syntax, to keep in mind if we want to select more than one column.

df %>%
  slice(2, .by = c("Country", "Customer ID"))
A tibble: 4266 x 8

That is an useful feature to employ selection helpers.

df %>%
  slice(2, .by = starts_with("C"))
A tibble: 4266 x 8

The helpers can use this argument as well, but with them it is called by.

df %>%
  slice_head(n = 5, by = Country)
A tibble: 200 x 8