data-masking

- fundamentals

count() is a wrapper for group_by() %>% summarise(n = n()) and, like its counterpart, it returns the size, in terms of number of rows, for every group of the data frame.

df %>%
  count(Country)
A tibble: 40 x 2
df %>%
  group_by(Country) %>%
  summarise(n = n())
A tibble: 40 x 2

As such, it can also be seen as a wrapper for group_by() %>% tally() (as tally() is a wrapper for summarise(n = n())).

df %>%
  group_by(Country) %>%
  tally()
A tibble: 40 x 2

But they differ when there is more than one grouping column, as tally(), like summarise(), removes only the most recent grouping,

df %>%
  group_by(Country, `Customer ID`) %>%
  tally() %>%
  group_vars()
## [1] "Country"
df %>%
  group_by(Country, `Customer ID`) %>%
  summarise(n = n()) %>%
  group_vars()
## `summarise()` has grouped output by 'Country'. You can override using
## the `.groups` argument.
## [1] "Country"

while count() ungroups the output even with several columns, saving us from piping an additional ungroup().

df %>%
  count(Country, `Customer ID`) %>%
  group_vars()
## character(0)

When we specify more columns, count() returns the number of rows for all the unique existing combinations between their values.

df %>%
  count(Country, `Customer ID`)
A tibble: 4401 x 3

Like if we specified two columns in group_by().

df %>%
  group_by(Country, `Customer ID`) %>%
  summarise(n = n())
`summarise()` has grouped output by 'Country'. You can override using
the `.groups` argument.
A tibble: 4401 x 3
Groups: Country [40]

As with group_by() %>% summarise(n = n()), NAs are counted as one value.

df %>%
  count(`Customer ID`, sort = TRUE)
A tibble: 4384 x 2
df %>%
  group_by(`Customer ID`) %>% 
  summarise(n = n()) %>%
  arrange(desc(n))
A tibble: 4384 x 2

So with more than one column we can have every existing combination of NAs with other values.

df %>%
  count(`Customer ID`, Description, sort = TRUE)
A tibble: 285350 x 3

Like if we wrote

df %>%
  group_by(`Customer ID`, Description) %>%
  summarise(n = n()) %>%
  arrange(desc(n))
`summarise()` has grouped output by 'Customer ID'. You can override
using the `.groups` argument.
A tibble: 285350 x 3
Groups: Customer ID [4384]

Being a data-masking function, we can use it with expressions.

df %>%
  count(CustomerID = `Customer ID`)
A tibble: 4384 x 2
df %>%
  count(Price_Rank = dense_rank(Price))
A tibble: 1606 x 2

This is equivalent to using those expressions in the group_by() calls.

df %>%
  group_by(CustomerID = `Customer ID`) %>%
  summarise(n = n())
A tibble: 4384 x 2
df %>%
  group_by(Price_Rank = dense_rank(Price)) %>%
  summarise(n = n())
A tibble: 1606 x 2

- sort

The sort argument can be employed if we want the rows with the largest n on top, without having to pipe an arrange() call afterwards.

df %>%
  count(Country, sort = TRUE)
A tibble: 40 x 2
df %>% 
  count(Country, `Customer ID`, sort = TRUE)
A tibble: 4401 x 3

- wt

data-masking

Another useful argument is wt (shorthand for “weight”), which allows us to rank one column by the values of another, like here where we return for each country the summation of their values in the Quantity column and not just how many rows they have.

df %>%
  count(Country, wt = Quantity)
A tibble: 40 x 2

Internally it modifies the summarise() call from n() to sum(wt).

df %>%
  group_by(Country) %>%
  summarise(Total_Quantity = sum(Quantity))
A tibble: 40 x 2

So, instead of counting each row as 1, it counts each row by the values of the column used in wt.

Using columns with values that can’t be summed results with an error.

df %>%
  count(Country, wt = Description)
Error in `count()`:
ℹ In argument: `n = sum(Description, na.rm = TRUE)`.
ℹ In group 1: `Country = "Australia"`.
Caused by error in `sum()`:
! invalid 'type' (character) of argument

If our data has already been aggregated,

df %>%
  count(Country, `Customer ID`)
A tibble: 4401 x 3

wt can be employed to retrieve the original number of rows pertaining to a group.

df %>%
  count(Country, `Customer ID`) %>%
  count(Country, wt = n)
A tibble: 40 x 2

wt also offers the possibility of being used with a function.

df %>%
  count(Country, wt = range(Quantity))
A tibble: 40 x 2

That example translates to this,

df %>%
  group_by(Country) %>%
  summarise(n = sum(range(Quantity)))
A tibble: 40 x 2

where it’s easier to understand that, for every value of the Country column, we sum the two values returned by range(Quantity).

df %>%
  group_by(Country) %>%
  reframe(n = range(Quantity))
A tibble: 80 x 2

We can also use functions that only return one value per group,

df %>%
  count(Country, wt = n_distinct(Invoice))
A tibble: 40 x 2

but in this case the summation is inconsequential as there is only one addend

df %>%
  group_by(Country) %>%
  summarise(n = sum(n_distinct(Invoice)))
A tibble: 40 x 2
df %>%
  group_by(Country) %>%
  summarise(n = n_distinct(Invoice))
A tibble: 40 x 2

and it is mainly just a way to use another function beside n() in the, wrapped, summarise() call of count().

Using a function that doesn’t output a number clearly throws an error.

df %>%
  count(Country, wt = stringr:::str_replace(`Customer ID`, "2", "3"))
Error in `count()`:
ℹ In argument: `n = sum(stringr:::str_replace(`Customer ID`,
  "2", "3"), na.rm = TRUE)`.
ℹ In group 1: `Country = "Australia"`.
Caused by error in `sum()`:
! invalid 'type' (character) of argument

wt can be used with more than one column as well.

df %>%
  count(Country, wt = c(Quantity, Price))
A tibble: 40 x 2

And it would be like summing together, for every value of the grouping column, every value of the columns specified.

df %>%
  group_by(Country) %>%
  summarise(n = sum(Quantity, Price))
A tibble: 40 x 2

Another example, here we return 1 (TRUE) for the countries that have more than 15 rows and 0 (FALSE) for the ones that don’t (only Lebanon here).

df %>%
  count(Country, wt = n() > 15)
A tibble: 40 x 2

As in one of the previous examples, the summation is inconsequential, apart from transforming logical values into integers.

df %>%
  group_by(Country) %>%
  summarise(n = sum(n() > 15))
A tibble: 40 x 2
df %>%
  group_by(Country) %>%
  summarise(n = n() > 15)
A tibble: 40 x 2

- name

The name argument, which modifies the name of the count() output column from the default n,

df %>%
  count(Country, name = "N_Rows")
A tibble: 40 x 2

comes in handy with wt to have more apt denominations, like we can do with summarise().

df %>%
  count(Country, wt = Quantity, name = "Total_Quantity")
A tibble: 40 x 2
df %>%
  group_by(Country) %>%
  summarize(Total_Quantity = sum(Quantity))
A tibble: 40 x 2

Using more than one column is not a problem.

df %>%
  count(Country, `Customer ID`, wt = n_distinct(Invoice), name = "N_Invoices")
A tibble: 4401 x 3

- .drop

The last argument is .drop, which, when we set to FALSE (default is TRUE), has the same utility as in group_by(), alas it doesn’t drop the levels of a factor when their count is equal to 0 (the Unspecified country in this example, that has 0 rows after we modify its name to NA if at the same time we retain it as a level).

df %>%
  mutate(Country = na_if(Country, "Unspecified"),
         Country = factor(Country, levels = unique(df$Country))) %>%
  count(Country, .drop = FALSE)
A tibble: 41 x 2
df %>%
  mutate(Country = na_if(Country, "Unspecified"),
         Country = factor(Country, levels = unique(df$Country))) %>%
  count(Country)
A tibble: 40 x 2

.drop = FALSE will keep the empty levels also when they have already been removed by a previous manipulation (filter(n() > 15) here, which removes the Lebanon group).

df %>%
  group_by(Country = factor(Country)) %>%
  filter(n() > 15) %>%
  count(Country, .drop = FALSE)
A tibble: 40 x 2
Groups: Country [40]
df %>%
  group_by(Country = factor(Country)) %>%
  filter(n() > 15) %>%
  count(Country)
A tibble: 39 x 2
Groups: Country [39]

As with group_by(), we must be careful when using several factor columns as it returns combinations not present in the data frame, like every customer located in every country.

df %>%
  count(factor(Country), factor(`Customer ID`), .drop = FALSE)
A tibble: 175333 x 3
df %>%
  count(factor(Country), factor(`Customer ID`))
A tibble: 4401 x 3

- repetitions spotter

As already seen, with more than one column count() returns all the unique existing combinations between their values, counting with n the number of rows pertaining to each group.

df %>%
  count(Invoice, StockCode)
A tibble: 512126 x 3

If n is higher than 1, like for invoice 489488 with stock code 22125,

df %>%
  count(Invoice, StockCode) %>%
  filter(Invoice == "489488")
A tibble: 29 x 3

it means that there are several rows with the same values in, at least, those two columns.

df %>%
  filter(Invoice == "489488" &
           StockCode == "22125")
A tibble: 2 x 8

Filtering for n > 1 can then be used as a way to spot repetitions in the columns specified in count(), which can be errors if those two columns should have had a single unique combination.

df %>%
  count(Invoice, StockCode) %>%
  filter(n > 1)
A tibble: 12064 x 3

If desired, we can extend this procedure to all the columns to spot duplicated rows.

df %>%
  count(pick(everything()), name = "# of repetitions") %>%
  filter(`# of repetitions` > 1)
A tibble: 6418 x 9

If instead we are interested in spotting the non duplicated rows, we can just use distinct().

df %>%
  distinct()
A tibble: 518596 x 8

- number of unique combinations

The procedure can be extended if we want to know how many unique existing combinations of two columns there are for each value of one of them. In the following example the one column is Invoice and we are looking for the unique existing combinations with StockCode for every value of it.

df %>%
  count(Invoice, StockCode) %>%
  count(Invoice)
A tibble: 28816 x 2

So for instance for invoice 489488 (the invoice from a previous example) there are 29 unique combinations with the StockCode column.

df %>%
  count(Invoice, StockCode) %>%
  count(Invoice) %>%
  filter(Invoice == "489488")
A tibble: 1 x 2

That obviously amounts to the number of rows from the first count() call.

df %>%
  count(Invoice, StockCode) %>%
  filter(Invoice == "489488")
A tibble: 29 x 3

If we add a third column as well (to look for the unique existing combinations of three columns for every value of one of them), Price for example, n stays the same,

df %>%
  count(Invoice, StockCode, Price) %>%
  count(Invoice) %>%
  filter(Invoice == "489488")
A tibble: 1 x 2

as Price doesn’t vary for any stock code, so the number of combinations is constant.

df %>%
  count(Invoice, StockCode, Price) %>%
  filter(Invoice == "489488") %>%
  arrange(StockCode)
A tibble: 29 x 4

This can more easily checked with a grouped filter() on the number of rows per stock code: if no stock code has more than one row per invoice, it means that only one value of Price is assigned to it.

df %>%
  count(Invoice, StockCode, Price) %>%
  filter(Invoice == "489488") %>%
  group_by(StockCode) %>%
  filter(n() > 1)
A tibble: 0 x 4
Groups: StockCode [0]

If we use Quantityinstead of Price though, n increases of one unit,

df %>%
  count(Invoice, StockCode, Quantity) %>%
  count(Invoice) %>%
  filter(Invoice == "489488")
A tibble: 1 x 2

because that column changes values between two rows for stock code 22125, as we’ve seen in a previous example, resulting therefore in an additional combination,

df %>%
  count(Invoice, StockCode, Quantity) %>%
  filter(Invoice == "489488") %>%
  arrange(StockCode)
A tibble: 30 x 4

that, as before, can be more easily spotted when we group filter for n() > 1 on stock codes.

df %>%
  count(Invoice, StockCode, Quantity) %>%
  filter(Invoice == "489488") %>%
  group_by(StockCode) %>%
  filter(n() > 1)
A tibble: 2 x 4
Groups: StockCode [1]

If we wish to generalize the grouped filter() call, we need to add the Invoice column as a grouping one,

df %>%
  count(Invoice, StockCode, Quantity) %>%
  group_by(Invoice, StockCode) %>%
  filter(n() > 1)
A tibble: 12493 x 4
Groups: Invoice, StockCode [6138]

otherwise we would just filter on stock codes regardless of the invoice they pertain to.

df %>%
  count(Invoice, StockCode, Quantity) %>%
  group_by(StockCode) %>%
  filter(n() > 1) %>%
  arrange(StockCode)
A tibble: 518047 x 4
Groups: StockCode [4197]

Instead of piping two count()s in succession we can also use the wt argument with n_distinct() to attain the same results.

df %>%
  count(Invoice, StockCode) %>%
  count(Invoice)
A tibble: 28816 x 2
df %>%
  count(Invoice, wt = n_distinct(StockCode))
A tibble: 28816 x 2

That translates to

df %>%
  group_by(Invoice) %>%
  summarise(n = n_distinct(StockCode))
A tibble: 28816 x 2

We can as well use more than one columns with n_distinct(), always to return the number of all the unique existing combinations of three columns for every value of one of them.

df %>%
  count(Invoice, StockCode, Quantity) %>%
  count(Invoice) %>%
  filter(Invoice == "489488")
A tibble: 1 x 2
df %>%
  count(Invoice, wt = n_distinct(StockCode, Quantity)) %>%
  filter(Invoice == "489488")
A tibble: 1 x 2

So we need to use the one column in count() and the other two in n_distinct().

Without count() we would otherwise write like this.

df %>%
  group_by(Invoice) %>%
  summarise(n = n_distinct(StockCode, Quantity)) %>%
  filter(Invoice == "489488")
A tibble: 1 x 2

But if we instead write like this, with only one column in n_distinct(),

df %>%
  count(Invoice, StockCode, wt = n_distinct(Quantity)) %>% 
  filter(Invoice == "489488")
A tibble: 29 x 3

that translates to this,

df %>%
  group_by(Invoice, StockCode) %>%
  summarise(n = n_distinct(Quantity)) %>%
  filter(Invoice == "489488")
`summarise()` has grouped output by 'Invoice'. You can override using
the `.groups` argument.
A tibble: 29 x 3
Groups: Invoice [1]

we would obtain the number of unique values of the Quantity column for every unique existing combination of Invoice and StockCode (something very different from before then, as shown by the output).

As before, a subsequent grouped filter on n can be useful to more easily pinpoint the stock codes with different quantities in the same invoice.

df %>%
  count(Invoice, StockCode, wt = n_distinct(Quantity)) %>% 
  group_by(Invoice, StockCode) %>%
  filter(n > 1)
A tibble: 6138 x 3
Groups: Invoice, StockCode [6138]

- with group_by()

When used on a grouped data frame, count() appends the groups created by its argument, performs the count and then ungroups while retaining the original grouping.

df %>%
  group_by(Invoice) %>%
  count(StockCode, Quantity)
A tibble: 518481 x 4
Groups: Invoice [28816]

It is equivalent to

df %>%
  group_by(Invoice) %>%
  group_by(StockCode, Quantity, .add = TRUE) %>%
  summarise(n = n()) %>%
  ungroup(StockCode, Quantity)
`summarise()` has grouped output by 'Invoice', 'StockCode'. You can
override using the `.groups` argument.
A tibble: 518481 x 4
Groups: Invoice [28816]

If we add the column from group_by() to count(), as a way to modify the columns’ order of the output, it is important to notice that count() doesn’t overwrite it, as the column from group_by() has the precedence, being the group key of the output.

df %>%
  group_by(Invoice) %>%
  count(StockCode, Invoice, Quantity)
A tibble: 518481 x 4
Groups: Invoice [28816]

So we have to resort to pipe an additional select() call.

df %>%
  group_by(Invoice) %>%
  count(StockCode, Quantity) %>%
  select(StockCode, Invoice, Quantity, n)
A tibble: 518481 x 4
Groups: Invoice [28816]

With sort we can rearrange the rows though, as it overrides the group keys from group_by().

df %>%
  group_by(Invoice) %>%
  count(StockCode, Quantity, sort = TRUE)
A tibble: 518481 x 4
Groups: Invoice [28816]

- add_count()

data-masking

In case we don’t want to summarize our data frame, we can use add_count() which is equivalent to group_by() %>% mutate(n = n()).

df %>%
  add_count(Country)
A tibble: 525461 x 9
df %>%
  group_by(Country) %>% 
  mutate(n = n())
A tibble: 525461 x 9
Groups: Country [40]

Like count(), it ungroups the output also with several columns,

df %>%
  add_count(Country,`Customer ID`)
A tibble: 525461 x 9

while mutate() keeps the grouping as is.

df %>%
  group_by(Country, `Customer ID`) %>% 
  mutate(n = n())
A tibble: 525461 x 9
Groups: Country, Customer ID [4401]

add_count() can be used with expressions

df %>%
  add_count(CustomerID = `Customer ID`)
A tibble: 525461 x 10
df %>%
  add_count(Price_Rank = dense_rank(Price))
A tibble: 525461 x 10

and is consistent with the optional arguments of count(),

df %>%
  add_count(Country, wt = Quantity, name = "Total_Quantity", sort = TRUE)
A tibble: 525461 x 9

except for .drop, deprecated, as it can’t modify the output.

df %>%
  mutate(Country = na_if(Country, "Unspecified"),
         Country = factor(Country, levels = unique(df$Country))) %>%
  add_count(Country, .drop = FALSE)
Warning: The `.drop` argument of `add_count()` is deprecated as of dplyr
1.0.0.
ℹ The deprecated feature was likely used in the dplyr package.
  Please report the issue at
  <https://github.com/tidyverse/dplyr/issues>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning
was generated.
A tibble: 525461 x 9
df %>%
  mutate(Country = na_if(Country, "Unspecified"),
         Country = factor(Country, levels = unique(df$Country))) %>%
  add_count(Country)
A tibble: 525461 x 9

For instance in the following example it can’t reconstruct previously filtered rows.

df %>%
  group_by(Country = factor(Country)) %>%
  filter(n() > 15) %>%
  add_count(Country, .drop = FALSE)
A tibble: 525448 x 9
Groups: Country [39]

- with group_by()

add_count() on a grouped data frame has the same behavior as count() (appending the groups created by its argument, performing the count and then ungrouping while keeping the original grouping).

df %>%
  group_by(Invoice) %>%
  add_count(StockCode, Quantity)
A tibble: 525461 x 9
Groups: Invoice [28816]

Similarly as before, it is equivalent to this.

df %>%
  group_by(Invoice) %>%
  group_by(StockCode, Quantity, .add = TRUE) %>%
  mutate(n = n()) %>%
  ungroup(StockCode, Quantity)
A tibble: 525461 x 9
Groups: Invoice [28816]

Likewise it can’t modify the columns’ order of the output,

df %>%
  group_by(Invoice) %>%
  add_count(StockCode, Invoice, Quantity)
A tibble: 525461 x 9
Groups: Invoice [28816]

but sort can modify the rows’ order.

df %>%
  group_by(Invoice) %>%
  add_count(StockCode, Quantity, sort = TRUE)
A tibble: 525461 x 9
Groups: Invoice [28816]