In this handbook we extensively used the pipe operator
%>%
, which permits us to avoid using intermediate
objects or nesting several functions, like in the following example.
slice_head(filter(df, Country == "United Kingdom"), n = 5)
That results in lines of code that are easier to decipher and to extend.
%>%
df filter(Country == "United Kingdom") %>%
slice_head(n = 5)
The pipe operator works by feeding the object on its left as the
first argument of the function on its right. Many dplyr
functions have .data
as the first argument, so we can see
why the pipe is so useful and largely adopted, as with it we can “carry”
the data frame through different transformations in one single “piece”
of code.
The pipe operator can also be nested inside functions, usually to modify the value of an argument.
%>%
df filter(Country == "United Kingdom") %>%
slice_head(n = 5.4 %>% floor)
Notice how we wrote floor
instead of
floor()
. This is another property of the pipe: when only
one argument is needed, we can omit the empty parentheses.
When instead we want to feed the object on the left to an argument
that is not the first, we can use a dot (.
) as a
placeholder.
%>%
df pull(Price) %>%
slice_min(df, order_by = ., n = 10, with_ties = FALSE)
In other cases we might need to use the object on the left several
times, usually as .data
and for one of its properties, like
one of its dimensions,
%>%
df filter(row_number() < nrow(df) / 2)
but it is not a problem to use the dot as many times as we need.
%>%
df filter(., row_number() < nrow(.) / 2)
Plus it is necessary in case we performed some transformations on the
object, as df
will refer to the unmodified one.
%>%
df slice(1:5) %>%
filter(., row_number() < nrow(df) / 2)
%>%
df slice(1:5) %>%
filter(., row_number() < nrow(.) / 2)
In the last examples we specified the dot also as the first argument
but it is not really necessary as, when we use the placeholder inside a
nested function (nrow()
in our case), the default is use
the object on the left as the first argument of the nesting function as
well. So we usually remove it.
%>%
df slice(1:5) %>%
filter(row_number() < nrow(.) / 2)
This default can be problematic when we use functions that don’t need
.data
as the first argument but we can override this
behavior by embracing the function on the right with curly braces
({})
.
%>%
df slice(1:5) %>%
pull(Price) %>%
c(mean(.), sd(.))} {
## [1] 4.760000 2.833373
Without the curly braces dplyr
will use the object on
the left as the first argument, effectively concatenating it in this
case with the wished output.
%>%
df slice(1:5) %>%
pull(Price) %>%
c(mean(.), sd(.))
## [1] 6.950000 6.750000 6.750000 2.100000 1.250000 4.760000 2.833373
Also, we can freely pipe functions like lm()
and use the
dot notation in their formulas, as it will not be mistaken for the
placeholder (that we here used for the data
argument).
%>%
df select(Quantity, Price) %>%
lm(Price ~ ., .)
##
## Call:
## lm(formula = Price ~ ., data = .)
##
## Coefficients:
## (Intercept) Quantity
## 4.715993 -0.002627
We must be careful with grouped data frames though, as the dot placeholder doesn’t refer to each specific group but to the whole data frame, so in this case it is not the the number of rows of each group that is divided by 2 but the overall total number of rows of the whole data frame.
%>%
df group_by(Country) %>%
filter(., row_number() < nrow(.) / 2)
Just like if we used df
instead of .
inside
nrow()
.
%>%
df group_by(Country) %>%
filter(row_number() < nrow(df) / 2)
If we wanted to preserve, for each group, the rows whose index is
less than half its number of rows, we can use n()
, which
instead refers to the number of rows of the group.
%>%
df group_by(Country) %>%
filter(row_number() < n() / 2)
The pipe has other functionalities but, as they are not strictly
related to dplyr
, they will not be discussed here. The
reference manual (https://cran.r-project.org/web/packages/magrittr/magrittr.pdf)
is a good place to start to investigate them.
The pipe operator %>%
is from the
magrittr
package and it is loaded when loading
dplyr
. magrittr
has other operators though and
when we want to use them we need to load the package.
library(magrittr)
The “tee” pipe %T>%
lets you “bypass” a function in
the chain while still outputting its results, essentially “carrying” the
object on its left to the function after the immediately next one. This
can be useful when we want an output from a function but this output it
is not usable by the following one. For example we might want to output
both a graph and a summary table from a data frame.
%>%
df filter(Country == "Korea") %>%
select(Quantity) %T>%
plot() %>%
table()
## Quantity
## -48 -12 -8 -6 -5 -4 -3 3 4 5 6 8 9 10 12 24 36 48
## 1 1 1 3 1 2 1 1 2 2 17 3 1 4 11 8 1 3
If we want to pipe functions, like many base R
ones,
that work with vectors and don’t have .data
as their first
argument, we can use the “exposition” pipe %$%
, which
“exposes” the column’s names of the data frame to made them usable, for
example, by a function like table()
.
%>%
df table(Country)
## Error: object 'Country' not found
%$%
df table(Country)
## Country
## Australia Austria Bahrain
## 654 537 107
## Belgium Bermuda Brazil
## 1054 34 62
## Canada Channel Islands Cyprus
## 77 906 554
## Denmark EIRE Finland
## 428 9670 354
## France Germany Greece
## 5772 8129 517
## Hong Kong Iceland Israel
## 76 71 74
## Italy Japan Korea
## 731 224 63
## Lebanon Lithuania Malta
## 13 154 172
## Netherlands Nigeria Norway
## 2769 32 369
## Poland Portugal RSA
## 194 1101 111
## Singapore Spain Sweden
## 117 1278 902
## Switzerland Thailand United Arab Emirates
## 1187 76 432
## United Kingdom Unspecified USA
## 485852 310 244
## West Indies
## 54
Another way to circumvent the issue is by pulling or selecting the vector from the data frame with the appropriate function.
%>%
df pull(Country) %>%
table()
## .
## Australia Austria Bahrain
## 654 537 107
## Belgium Bermuda Brazil
## 1054 34 62
## Canada Channel Islands Cyprus
## 77 906 554
## Denmark EIRE Finland
## 428 9670 354
## France Germany Greece
## 5772 8129 517
## Hong Kong Iceland Israel
## 76 71 74
## Italy Japan Korea
## 731 224 63
## Lebanon Lithuania Malta
## 13 154 172
## Netherlands Nigeria Norway
## 2769 32 369
## Poland Portugal RSA
## 194 1101 111
## Singapore Spain Sweden
## 117 1278 902
## Switzerland Thailand United Arab Emirates
## 1187 76 432
## United Kingdom Unspecified USA
## 485852 310 244
## West Indies
## 54
%>%
df select(Country) %>%
table()
## Country
## Australia Austria Bahrain
## 654 537 107
## Belgium Bermuda Brazil
## 1054 34 62
## Canada Channel Islands Cyprus
## 77 906 554
## Denmark EIRE Finland
## 428 9670 354
## France Germany Greece
## 5772 8129 517
## Hong Kong Iceland Israel
## 76 71 74
## Italy Japan Korea
## 731 224 63
## Lebanon Lithuania Malta
## 13 154 172
## Netherlands Nigeria Norway
## 2769 32 369
## Poland Portugal RSA
## 194 1101 111
## Singapore Spain Sweden
## 117 1278 902
## Switzerland Thailand United Arab Emirates
## 1187 76 432
## United Kingdom Unspecified USA
## 485852 310 244
## West Indies
## 54
This pipe is used to assign the output of a chain to its first element, besides outputting it. It must be used as the first pipe in the chain.
library(ggplot2)
<- df
UK_clients_plot %<>%
UK_clients_plot filter(Country == "United Kingdom") %>%
ggplot(aes(`Customer ID`)) +
geom_bar()
## Warning: Removed 106429 rows containing non-finite values (`stat_count()`).
Without it we would have written
<- df %>%
UK_clients_plot filter(Country == "United Kingdom") %>%
ggplot(aes(`Customer ID`)) +
geom_bar()
but that doesn’t show the graph, unless we wrap everything with parentheses.
<- df %>%
(UK_clients_plot filter(Country == "United Kingdom") %>%
ggplot(aes(`Customer ID`)) +
geom_bar())
## Warning: Removed 106429 rows containing non-finite values (`stat_count()`).
Be aware that it can be quite dangerous to use as it rewrites the
first element of the chain (that is why we first copied df
to UK_clients_plot
, to not overwrite our data frame with a
plot).
It can be useful to quickly update a column though,
%$%
df table(Country, useNA = "ifany")
## Country
## Australia Austria Bahrain
## 654 537 107
## Belgium Bermuda Brazil
## 1054 34 62
## Canada Channel Islands Cyprus
## 77 906 554
## Denmark EIRE Finland
## 428 9670 354
## France Germany Greece
## 5772 8129 517
## Hong Kong Iceland Israel
## 76 71 74
## Italy Japan Korea
## 731 224 63
## Lebanon Lithuania Malta
## 13 154 172
## Netherlands Nigeria Norway
## 2769 32 369
## Poland Portugal RSA
## 194 1101 111
## Singapore Spain Sweden
## 117 1278 902
## Switzerland Thailand United Arab Emirates
## 1187 76 432
## United Kingdom Unspecified USA
## 485852 310 244
## West Indies
## 54
$Country %<>% na_if("Unspecified")
df%$%
df table(Country, useNA = "ifany")
## Country
## Australia Austria Bahrain
## 654 537 107
## Belgium Bermuda Brazil
## 1054 34 62
## Canada Channel Islands Cyprus
## 77 906 554
## Denmark EIRE Finland
## 428 9670 354
## France Germany Greece
## 5772 8129 517
## Hong Kong Iceland Israel
## 76 71 74
## Italy Japan Korea
## 731 224 63
## Lebanon Lithuania Malta
## 13 154 172
## Netherlands Nigeria Norway
## 2769 32 369
## Poland Portugal RSA
## 194 1101 111
## Singapore Spain Sweden
## 117 1278 902
## Switzerland Thailand United Arab Emirates
## 1187 76 432
## United Kingdom USA West Indies
## 485852 244 54
## <NA>
## 310
instead of writing
<- df %>%
df mutate(Country = na_if(Country, "Unspecified"))
Since version 4.1, also base R
has its pipe operator,
|>
. Its purpose is the same as the magrittr pipe
%>%
but it has less functionalities (no dots as
placeholders for example). In case of interest, there are further
information at this link: https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/