data-masking

- fundamentals

When there are rows that are repeated,

df %>%
  filter(Invoice == "496431" & 
           StockCode == "84826")

A tibble: 6 x 8

distinct() removes the ones after the first.

df %>%
  filter(Invoice == "496431" & 
           StockCode == "84826") %>%
  distinct()

A tibble: 1 x 8

So it is useful in data wrangling when we want to remove duplicated rows from a data frame.

df %>%
  distinct()

A tibble: 518596 x 8

It is equivalent to these lines of code,

df %>%
  group_by(pick(everything())) %>%
  summarise(n = n()) %>%
  select(-n) %>%
  ungroup()

`summarise()` has grouped output by 'Invoice', 'StockCode',
'Description', 'Quantity', 'InvoiceDate', 'Price', 'Customer ID'. You
can override using the `.groups` argument.

A tibble: 518596 x 8

df %>%
  group_by(pick(everything())) %>%
  slice(1) %>%
  ungroup()

A tibble: 518596 x 8

but it returns the output much faster.

If we are interested in returning the duplicated rows instead, we can use this procedure with count() and filter().

df %>%
  count(pick(everything()), name = "# of repetitions") %>%
  filter(`# of repetitions` > 1)

A tibble: 6418 x 9

We can also use it on a subset of columns and it will return the unique values, in case of one, and the existing combinations of their values with more than one.

df %>%
  distinct(Country)

A tibble: 40 x 1

df %>%
  distinct(Country, `Customer ID`)

A tibble: 4401 x 2

- .keep_all

With .keep_all = TRUE, we keep all other columns as well in the output.

df %>%
  distinct(Country, .keep_all = TRUE)

A tibble: 40 x 8

df %>%
  distinct(Country, `Customer ID`, .keep_all = TRUE)

A tibble: 4401 x 8

If we don’t specify columns, the outputs are equal.

df %>%
  distinct()

A tibble: 518596 x 8

df %>%
  distinct(.keep_all = TRUE)

A tibble: 518596 x 8

distinct() is very similar, beside the added count and the row order of the output, to what count() does.

df %>%
  count(Country)

A tibble: 40 x 2

df %>%
  count(Country, `Customer ID`)

A tibble: 4401 x 3

NAs are treated as one value.

df %>%
  distinct(`Customer ID`) %>%
  arrange(!is.na(`Customer ID`))

A tibble: 4384 x 1

So with more than one column they form a combination with the other values.

df %>%
  mutate(Country = na_if(Country, "Unspecified")) %>%
  distinct(Country, `Customer ID`) %>%
  arrange(!is.na(Country),!is.na(`Customer ID`))

A tibble: 4401 x 2

We can use pick() to simplify the selection of more than one column.

df %>%
  distinct(pick(starts_with("I")))

A tibble: 28857 x 2

It being a data-masking function, we can also use expressions.

df %>%
  distinct(Invoice_Day = as.Date(InvoiceDate))

A tibble: 307 x 1

distinct() only works with data frames (a single column one included) but not with vectors.

df %>%
  pull(Country) %>%
  distinct()

## Error in UseMethod("distinct"): no applicable method for 'distinct' applied to an object of class "character"

distinct(df$Country)

## Error in UseMethod("distinct"): no applicable method for 'distinct' applied to an object of class "character"

So in cases like these we have to rely on unique() from base R.

df %>%
  pull(Country) %>%
  unique()

##  [1] "United Kingdom"       "France"               "USA"                 
##  [4] "Belgium"              "Australia"            "EIRE"                
##  [7] "Germany"              "Portugal"             "Japan"               
## [10] "Denmark"              "Nigeria"              "Netherlands"         
## [13] "Poland"               "Spain"                "Channel Islands"     
## [16] "Italy"                "Cyprus"               "Greece"              
## [19] "Norway"               "Austria"              "Sweden"              
## [22] "United Arab Emirates" "Finland"              "Switzerland"         
## [25] "Unspecified"          "Malta"                "Bahrain"             
## [28] "RSA"                  "Bermuda"              "Hong Kong"           
## [31] "Singapore"            "Thailand"             "Israel"              
## [34] "Lithuania"            "West Indies"          "Lebanon"             
## [37] "Korea"                "Brazil"               "Canada"              
## [40] "Iceland"

- with group_by()

With a grouped data frame, the grouping column is processed as well, as if it was specified first. The rows order is kept from df and not changed following the grouping columns.

df %>%
  group_by(Country) %>%
  distinct(`Customer ID`)

A tibble: 4401 x 2

Groups: Country [40]