data-masking
When there are rows that are repeated,
df %>%
filter(Invoice == "496431" &
StockCode == "84826")distinct() removes the ones after the first.
df %>%
filter(Invoice == "496431" &
StockCode == "84826") %>%
distinct()So it is useful in data wrangling when we want to remove duplicated rows from a data frame.
df %>%
distinct()It is equivalent to these lines of code,
df %>%
group_by(pick(everything())) %>%
summarise(n = n()) %>%
select(-n) %>%
ungroup()`summarise()` has grouped output by 'Invoice', 'StockCode',
'Description', 'Quantity', 'InvoiceDate', 'Price', 'Customer ID'. You
can override using the `.groups` argument.
df %>%
group_by(pick(everything())) %>%
slice(1) %>%
ungroup()but it returns the output much faster.
If we are interested in returning the duplicated rows instead, we can
use this procedure with count() and
filter().
df %>%
count(pick(everything()), name = "# of repetitions") %>%
filter(`# of repetitions` > 1)We can also use it on a subset of columns and it will return the unique values, in case of one, and the existing combinations of their values with more than one.
df %>%
distinct(Country)df %>%
distinct(Country, `Customer ID`)With .keep_all = TRUE, we keep all other columns as well
in the output.
df %>%
distinct(Country, .keep_all = TRUE)df %>%
distinct(Country, `Customer ID`, .keep_all = TRUE)If we don’t specify columns, the outputs are equal.
df %>%
distinct()df %>%
distinct(.keep_all = TRUE)distinct() is very similar, beside the added count and
the row order of the output, to what count() does.
df %>%
count(Country)df %>%
count(Country, `Customer ID`)NAs are treated as one value.
df %>%
distinct(`Customer ID`) %>%
arrange(!is.na(`Customer ID`))So with more than one column they form a combination with the other values.
df %>%
mutate(Country = na_if(Country, "Unspecified")) %>%
distinct(Country, `Customer ID`) %>%
arrange(!is.na(Country),!is.na(`Customer ID`))We can use pick() to simplify the selection of more than
one column.
df %>%
distinct(pick(starts_with("I")))It being a data-masking function, we can also use expressions.
df %>%
distinct(Invoice_Day = as.Date(InvoiceDate))distinct() only works with data frames (a single column
one included) but not with vectors.
df %>%
pull(Country) %>%
distinct()## Error in UseMethod("distinct"): no applicable method for 'distinct' applied to an object of class "character"
distinct(df$Country)## Error in UseMethod("distinct"): no applicable method for 'distinct' applied to an object of class "character"
So in cases like these we have to rely on unique() from
base R.
df %>%
pull(Country) %>%
unique()## [1] "United Kingdom" "France" "USA"
## [4] "Belgium" "Australia" "EIRE"
## [7] "Germany" "Portugal" "Japan"
## [10] "Denmark" "Nigeria" "Netherlands"
## [13] "Poland" "Spain" "Channel Islands"
## [16] "Italy" "Cyprus" "Greece"
## [19] "Norway" "Austria" "Sweden"
## [22] "United Arab Emirates" "Finland" "Switzerland"
## [25] "Unspecified" "Malta" "Bahrain"
## [28] "RSA" "Bermuda" "Hong Kong"
## [31] "Singapore" "Thailand" "Israel"
## [34] "Lithuania" "West Indies" "Lebanon"
## [37] "Korea" "Brazil" "Canada"
## [40] "Iceland"
With a grouped data frame, the grouping column is processed as well,
as if it was specified first. The rows order is kept from
df and not changed following the grouping columns.
df %>%
group_by(Country) %>%
distinct(`Customer ID`)