With the bind family of functions, consisting of
bind_rows()
and bind_cols()
, we can merge data
frames.
bind_rows()
stacks data frames vertically, by the order
they are specified in the function.
<- slice_head(df, n = 5)) (df_first_rows
<- slice_tail(df, n = 5)) (df_second_rows
bind_rows(df_first_rows,
df_second_rows)
bind_rows()
uses their names to bind the columns with
each other, so if there are columns’ names that are not shared, like
Invoice_copy
in df_first_rows
for example,
that column will get NAs in the rows of the data frame where it is not
present.
bind_rows(df_first_rows %>%
mutate(Invoice_Copy = Invoice, .after = "Invoice"),
df_second_rows)
One frequent occasion of this event is when one data frame has less columns.
bind_rows(df_first_rows %>%
select(2:8),
df_second_rows)
bind_rows(df_first_rows,
%>%
df_second_rows select(2:8))
Notice how in the first example the missing column has been moved last: this is because the output’s columns’ order is taken from the first data frame.
In the most extreme case, where there are no shared columns’ names, we will still get an output even if the data frames don’t share any column.
bind_rows(df_first_rows %>%
select(Country),
%>%
df_second_rows select(Invoice))
The type of columns must be the same to do the merging.
bind_rows(df_first_rows %>%
mutate(`Customer ID` = as.character(`Customer ID`)),
df_second_rows)
Error in `bind_rows()`:
! Can't combine `..1$Customer ID` <character> and `..2$Customer ID` <double>.
As already seen, the order of the columns is inherited from the first data frame.
bind_rows(df_first_rows %>%
select(8:1),
df_second_rows)
bind_rows(df_first_rows,
%>%
df_second_rows select(8:1))
bind_rows()
can also be used to merge together a list of
data frames.
group_split(df, Country) %>%
bind_rows()
We can also merge data frames with vectors
bind_rows(df_first_rows,
c(Country = "222", StockCode = "222222"))
or vectors with vectors,
bind_rows(c(Country = "111", StockCode = "111111"),
c(Country = "222", StockCode = "222222"))
just as long as they are named though.
bind_rows(df_first_rows,
c("222", "222222"))
Error in `bind_rows()`:
! Argument 2 must be a data frame or a named atomic vector.
bind_rows(c("111", "111111"),
c("222", "222222"))
Error in `bind_rows()`:
! Argument 1 must be a data frame or a named atomic vector.
If they don’t share names we will get NAs for the non-matching columns, as with data frames.
bind_rows(df_first_rows,
c(Country = "222", StockPosition = "222222"))
bind_rows(df_first_rows,
c(StockPosition = "222222"))
bind_rows(c(Country = "111", StockPosition = "111111"),
c(Country = "222", StockCode = "222222"))
We can identify from which data frame every row comes from with the
.id
argument, which will create a new column with a name of
our choice.
bind_rows(df_first_rows, df_second_rows, .id = "N_of_df")
The default way to differentiate the data frames is a progressive
number but if the arguments of bind_rows()
are named those
will be used instead.
bind_rows("number_one" = df_first_rows,
"number_two" = df_second_rows, .id = "N_of_df")
bind_cols()
merges data frames horizontally.
<- select(df, 1:4)) (df_first_cols
<- select(df, 5:8)) (df_second_cols
bind_cols(df_first_cols,
df_second_cols)
With this function the matching happens by row indexes, meaning that
row index 1 of df_first_cols
will bind with row index 1 of
df_second_cols
.
Consequently, the number of rows must either be the same or at least one data frame’s number of rows must be equal to one, in which case that row will be recycled.
bind_cols(df_first_cols %>%
slice(1),
df_second_cols)
Divisors are not recycled.
bind_cols(df_first_cols %>%
slice(1:2),
%>%
df_second_cols slice(1:4))
Error in `bind_cols()`:
! Can't recycle `..1` (size 2) to match `..2` (size 4).
If we want to bind_cols()
two data frames that have the
same columns’ names, the function will add numeric suffixes to
differentiate.
bind_cols(df_first_rows %>%
slice(1:4),
%>%
df_second_rows slice(1:4))
New names:
• `Invoice` -> `Invoice...1`
• `StockCode` -> `StockCode...2`
• `Description` -> `Description...3`
• `Quantity` -> `Quantity...4`
• `InvoiceDate` -> `InvoiceDate...5`
• `Price` -> `Price...6`
• `Customer ID` -> `Customer ID...7`
• `Country` -> `Country...8`
• `Invoice` -> `Invoice...9`
• `StockCode` -> `StockCode...10`
• `Description` -> `Description...11`
• `Quantity` -> `Quantity...12`
• `InvoiceDate` -> `InvoiceDate...13`
• `Price` -> `Price...14`
• `Customer ID` -> `Customer ID...15`
• `Country` -> `Country...16`
the .name_repair
argument can change how this works
(check https://tibble.tidyverse.org/reference/as_tibble.html
and https://vctrs.r-lib.org/reference/vec_as_names.html for
further details).
We can merge a data frame with a column, but that has to be named.
bind_cols(df_first_cols,
"Customer ID" = df$`Customer ID`)
Otherwise the column name will be dictated by the
.name_repair
default.
bind_cols(df_first_cols,
$`Customer ID`) df
New names:
• `` -> `...5`
In case of grouped data frames, the output will inherit the groups of the first one.
bind_rows(df_first_rows %>%
group_by(Country),
%>%
df_second_rows group_by(`Customer ID`))
bind_cols(df_first_cols %>%
group_by(StockCode),
%>%
df_second_cols group_by(Country))