data-masking
When there are rows that are repeated,
%>%
df filter(Invoice == "496431" &
== "84826") StockCode
distinct()
removes the ones after the first.
%>%
df filter(Invoice == "496431" &
== "84826") %>%
StockCode distinct()
So it is useful in data wrangling when we want to remove duplicated rows from a data frame.
%>%
df distinct()
It is equivalent to these lines of code,
%>%
df group_by(pick(everything())) %>%
summarise(n = n()) %>%
select(-n) %>%
ungroup()
`summarise()` has grouped output by 'Invoice', 'StockCode',
'Description', 'Quantity', 'InvoiceDate', 'Price', 'Customer ID'. You
can override using the `.groups` argument.
%>%
df group_by(pick(everything())) %>%
slice(1) %>%
ungroup()
but it returns the output much faster.
If we are interested in returning the duplicated rows instead, we can
use this procedure with count()
and
filter()
.
%>%
df count(pick(everything()), name = "# of repetitions") %>%
filter(`# of repetitions` > 1)
We can also use it on a subset of columns and it will return the unique values, in case of one, and the existing combinations of their values with more than one.
%>%
df distinct(Country)
%>%
df distinct(Country, `Customer ID`)
With .keep_all = TRUE
, we keep all other columns as well
in the output.
%>%
df distinct(Country, .keep_all = TRUE)
%>%
df distinct(Country, `Customer ID`, .keep_all = TRUE)
If we don’t specify columns, the outputs are equal.
%>%
df distinct()
%>%
df distinct(.keep_all = TRUE)
distinct()
is very similar, beside the added count and
the row order of the output, to what count()
does.
%>%
df count(Country)
%>%
df count(Country, `Customer ID`)
NAs are treated as one value.
%>%
df distinct(`Customer ID`) %>%
arrange(!is.na(`Customer ID`))
So with more than one column they form a combination with the other values.
%>%
df mutate(Country = na_if(Country, "Unspecified")) %>%
distinct(Country, `Customer ID`) %>%
arrange(!is.na(Country),!is.na(`Customer ID`))
We can use pick()
to simplify the selection of more than
one column.
%>%
df distinct(pick(starts_with("I")))
It being a data-masking function, we can also use expressions.
%>%
df distinct(Invoice_Day = as.Date(InvoiceDate))
distinct()
only works with data frames (a single column
one included) but not with vectors.
%>%
df pull(Country) %>%
distinct()
## Error in UseMethod("distinct"): no applicable method for 'distinct' applied to an object of class "character"
distinct(df$Country)
## Error in UseMethod("distinct"): no applicable method for 'distinct' applied to an object of class "character"
So in cases like these we have to rely on unique()
from
base R
.
%>%
df pull(Country) %>%
unique()
## [1] "United Kingdom" "France" "USA"
## [4] "Belgium" "Australia" "EIRE"
## [7] "Germany" "Portugal" "Japan"
## [10] "Denmark" "Nigeria" "Netherlands"
## [13] "Poland" "Spain" "Channel Islands"
## [16] "Italy" "Cyprus" "Greece"
## [19] "Norway" "Austria" "Sweden"
## [22] "United Arab Emirates" "Finland" "Switzerland"
## [25] "Unspecified" "Malta" "Bahrain"
## [28] "RSA" "Bermuda" "Hong Kong"
## [31] "Singapore" "Thailand" "Israel"
## [34] "Lithuania" "West Indies" "Lebanon"
## [37] "Korea" "Brazil" "Canada"
## [40] "Iceland"
With a grouped data frame, the grouping column is processed as well,
as if it was specified first. The rows order is kept from
df
and not changed following the grouping columns.
%>%
df group_by(Country) %>%
distinct(`Customer ID`)