The rows_* family of functions performs manipulations on the rows of our data frame by using the possible intersections with the rows of another one.

It consists of

rows_insert()
rows_append()
rows_update()
rows_patch()
rows_upsert()
rows_delete()

The intersections are determined by the by argument, with which we define one or several columns present in both data frames.

One shared property of these functions is that our data frame remains unchanged for what it concerns the order of the rows and of the columns.

- rows_insert()

With rows_insert() we can add the rows of another data frame to the tail of our own.

df %>%
  rows_insert(tibble(Invoice = "xxxxxx",
                     StockCode = "xxxxxx",
                     Description = "xxxxxx",
                     Quantity = Inf,
                     InvoiceDate = as.POSIXct("9999-01-01 00:00:00", tz = "UTC"),
                     Price = Inf,
                     `Customer ID` = Inf,
                     Country = "xxx"), by = "Invoice") %>%
  slice_tail(n = 10)

A tibble: 10 x 8

These rows are not meant to be already present in our data frame, so if they have a value in the shared column specified in by that is not extraneous, the function will throw an error.

df %>%
  rows_insert(tibble(Invoice = "538171",
                     StockCode = "xxxxxx",
                     Description = "xxxxxx",
                     Quantity = Inf,
                     InvoiceDate = as.POSIXct("9999-01-01 00:00:00", tz = "UTC"),
                     Price = Inf,
                     `Customer ID` = Inf,
                     Country = "xxx"), by = "Invoice") %>%
  slice_tail(n = 10)

Error in `rows_insert()`:
! `y` can't contain keys that already exist in `x`.
ℹ The following rows in `y` have keys that already exist in `x`:
  `c(1)`.
ℹ Use `conflict = "ignore"` if you want to ignore these `y` rows.

The values in the other columns added can be already present instead.

df %>%
  rows_insert(tibble(Invoice = "xxxxxx",
                     StockCode = "21931",
                     Description = "JUMBO STORAGE BAG SUKI",
                     Quantity = 2,
                     InvoiceDate = as.POSIXct("2010-12-09 20:01:00", tz = "UTC"),
                     Price = 1.95,
                     `Customer ID` = 17530,
                     Country = "United Kingdom"), by = "Invoice") %>%
  slice_tail(n = 10)

A tibble: 10 x 8

If we don’t specify a column in by, the function will use the first column that is shared between the two data frames, informing us of which one with a message.

df %>%
  rows_insert(tibble(StockCode = "xxxxxx",
                     Description = "xxxxxx",
                     Quantity = Inf,
                     InvoiceDate = as.POSIXct("9999-01-01 00:00:00", tz = "UTC"),
                     Price = Inf,
                     `Customer ID` = Inf,
                     Country = "xxx")) %>%
  slice_tail(n = 10)

Matching, by = "StockCode"

A tibble: 10 x 8

As the Invoice column is absent in the data frame we add rows from, we got an NAs in the Invoice column. This occurrence can be more evident in the following more extreme example.

df %>%
  rows_insert(tibble(Invoice = "xxxxxx"), by = "Invoice") %>%
  slice_tail(n = 10)

A tibble: 10 x 8

But it happens every time we add rows from a data frame that has a subset of columns of our own.

df %>%
  rows_insert(tibble(Invoice = "xxxxxx",
                     StockCode = "xxxxxx",
                     Description = "xxxxxx"), by = "Invoice") %>%
  slice_tail(n = 10)

A tibble: 10 x 8

rows_insert() can add duplicated rows.

df %>%
  rows_insert(tibble(Invoice = "xxxxxx", Description = c("xxx", "xxx")), by = "Invoice") %>% 
  slice_tail(n = 10)

A tibble: 10 x 8

As said in the introduction, the column we specify in by must be one that is present in both data frames, otherwise we will get an error.

df %>%
  rows_insert(tibble(Invoice = "xxxxxx",
                     StockPosition = "xxxxxx"), by = "StockPosition") %>%
  slice_tail(n = 10)

Error in `rows_insert()`:
! All columns in `y` must exist in `x`.
ℹ The following columns only exist in `y`: `StockPosition`.

Likewise we can’t have, in the data frame we add rows from, columns not present in our own.

df %>%
  rows_insert(tibble(Invoice = "xxxxxx",
                     StockPosition = "xxxxxx"), by = "Invoice") %>%
  slice_tail(n = 10)

Error in `rows_insert()`:
! All columns in `y` must exist in `x`.
ℹ The following columns only exist in `y`: `StockPosition`.

The function will return an error as well if the types of the columns are different.

df %>%
  rows_insert(tibble(Invoice = 999999), by = "Invoice")

Error in `rows_insert()`:
! Can't convert `y$Invoice` <double> to match type of `x$Invoice` <character>.

- conflict

As said in the beginning, rows_insert() is meant to add new rows, so we can’t add rows that have a value in the column specified in by already present in our data frame.

df %>%
  rows_insert(tibble(Invoice = c("538171", "xxxxxx")), by = "Invoice")

Error in `rows_insert()`:
! `y` can't contain keys that already exist in `x`.
ℹ The following rows in `y` have keys that already exist in `x`:
  `c(1)`.
ℹ Use `conflict = "ignore"` if you want to ignore these `y` rows.

We can either remove them beforehand.

df %>%
  rows_insert(tibble(Invoice = c("538171", "xxxxxx")) %>%
                filter(Invoice != "538171"), by = "Invoice") %>%
  slice_tail(n = 10)

A tibble: 10 x 8

Or use conflict = "ignore" for the rows with the same value to not be added.

df %>%
  rows_insert(tibble(Invoice = c("538171", "xxxxxx")), by = "Invoice", conflict = "ignore") %>%
  slice_tail(n = 10)

A tibble: 10 x 8

- rows_append()

Another solution, to add rows with the same value, is to use rows_append(), which works like rows_insert() but it doesn’t have the by argument.

df %>%
  rows_append(tibble(Invoice = c("538171", "xxxxxx"))) %>%
  slice_tail(n = 10)

A tibble: 10 x 8

In the most extreme case, it adds rows even if they are duplicates of the rows of our data frame.

df %>%
  rows_append(tibble(Invoice = "538171",
                     StockCode = "21931",
                     Description = "JUMBO STORAGE BAG SUKI",
                     Quantity = 2,
                     InvoiceDate = as.POSIXct("2010-12-09 20:01:00", tz = "UTC"),
                     Price = 1.95,
                     `Customer ID` = 17530,
                     Country = "United Kingdom")) %>%
  slice_tail(n = 10)

A tibble: 10 x 8

- rows_update()

rows_update() permits us to modify the values of the rows of our data frame, by substituting them with values from another one.

The rows with the values to be substituted are identified by the values in the column specified in by, so at a minimum we need a tibble with a column to identify the rows and a column with the substituting values.

df %>%
  rows_update(tibble(Invoice = "489434", 
                     Description = "xxxxxx"), by = "Invoice")

A tibble: 525461 x 8

Without the latter nothing will be substituted (the function doesn’t add NAs).

df %>%
  rows_update(tibble(Invoice = "489434"), by = "Invoice")

A tibble: 525461 x 8

And without the former, if we use it in by, we miss the requirement of it existing in both data frames.

df %>%
  rows_update(tibble(Description = "xxxxxx"), by = "Invoice")

Error in `rows_update()`:
! All columns specified through `by` must exist in `x` and
  `y`.
ℹ The following columns are missing from `y`: `Invoice`.

If we use the latter in by it exists in both data frames but in our own it does without the value that identifies which rows to update.

df %>%
  rows_update(tibble(Description = "xxxxxx"), by = "Description")

Error in `rows_update()`:
! `y` must contain keys that already exist in `x`.
ℹ The following rows in `y` have keys that don't exist in `x`:
  `c(1)`.
ℹ Use `unmatched = "ignore"` if you want to ignore these `y` rows.

- unmatched

That was the most extreme case, but there can be instances where the data frame we retrieve the updating values from has values in the column specified in by that are both present and absent in our own.

tibble(Invoice = c("489434", "xxxxxx"),
       Description = "xxxxxx")

A tibble: 2 x 2

The absent values, as before, will trigger an error so in these cases we can set the unmatched argument to “ignore” for the updating to go through successfully.

df %>%
  rows_update(tibble(Invoice = c("489434", "xxxxxx"), 
                     Description = "xxxxxx"), by = "Invoice", unmatched = "ignore")

A tibble: 525461 x 8

We can’t provide a data frame with two values in the column to be modified, even if the values are the same, as here Invoice doesn’t uniquely identify a row of Description.

df %>%
  rows_update(tibble(Invoice = "489434", 
                     Description = c("xxxxxx", "yyyyyy")), by = "Invoice")

Error in `rows_update()`:
! `y` key values must be unique.
ℹ The following rows contain duplicate key values: `c(1, 2)`.

df %>%
  rows_update(tibble(Invoice = "489434", 
                     Description = c("xxxxxx", "xxxxxx")), by = "Invoice")

Error in `rows_update()`:
! `y` key values must be unique.
ℹ The following rows contain duplicate key values: `c(1, 2)`.

We can modify the values of several rows though if we provide a data frame with several values to substitute with.

df %>%
  rows_update(tibble(Invoice = c("489434", "489435"),
                     Description = c("xxxxxx", "yyyyyy")), by = "Invoice")

A tibble: 525461 x 8

by accepts more than one column, in this way we can be more precise with what values we want to substitute.

df %>%
  rows_update(tibble(Invoice = c("489434", "489435"),
                     StockCode = c("85048", "22350"),
                     Description = c("xxxxxx", "yyyyyy")), by = c("Invoice", "StockCode"))

A tibble: 525461 x 8

Like rows_insert(), in case we don’t specify by, the function will use the first shared column.

df %>%
  rows_update(tibble(Invoice = "489434", 
                     Description = "xxxxxx"))

Matching, by = "Invoice"

A tibble: 525461 x 8

So in case of two columns by is mandatory otherwise the function will use only the first shared one, replacing also the StockCode values.

df %>%
  rows_update(tibble(Invoice = c("489434", "489435"),
                     StockCode = c("85048", "22350"),
                     Description = c("xxxxxx", "yyyyyy")))

Matching, by = "Invoice"

A tibble: 525461 x 8

As with rows_insert(), the function will return an error if the types of the columns are different.

df %>%
  rows_update(tibble(Invoice = 489434, 
                     Country = "xxx"), by = "Invoice")

Error in `rows_update()`:
! Can't combine `x$Invoice` <character> and `y$Invoice` <double>.

- rows_patch()

If we wish to only modify NA values, we can use rows_patch(), that works like rows_update(), therefore the data frame we provide needs a column to identify the rows with NAs (the one specified in by) and one or several columns with the values to substitute the NAs with.

df %>%
  filter(Invoice == "489521")

A tibble: 1 x 8

df %>%
  filter(Invoice == "489521") %>%
  rows_patch(tibble(Invoice = "489521",
                    `Customer ID` =  Inf), by = "Invoice")

A tibble: 1 x 8

df %>%
  filter(Invoice == "489521") %>%
  rows_patch(tibble(Invoice = "489521",
                    Description = "xxxxxx",
                    `Customer ID` =  Inf), by = "Invoice")

A tibble: 1 x 8

So we can see it like a safer version of rows_update(), to avoid to accidentally modify non NAs values.

df %>%
  filter(Invoice %in% c("489521", "489522"))

A tibble: 105 x 8

df %>%
  filter(Invoice %in% c("489521", "489522")) %>%
  rows_patch(tibble(Invoice = c("489521", "489522"),
                    Description = "xxxxxx",
                    `Customer ID` =  Inf), by = "Invoice")

A tibble: 105 x 8

df %>%
  filter(Invoice %in% c("489521", "489522")) %>%
  rows_update(tibble(Invoice = c("489521", "489522"),
                     Description = "xxxxxx",
                     `Customer ID` =  Inf), by = "Invoice")

A tibble: 105 x 8

Like rows_update() we can omit the by argument, and the function will use the first shared column.

df %>%
  filter(Invoice == "489521") %>%
  rows_patch(tibble(Invoice = "489521",
                    `Customer ID` =  Inf))

Matching, by = "Invoice"

A tibble: 1 x 8

Likewise we can use two columns to be more precise (and as with rows_update() here by is logically necessary).

df %>%
  filter(Invoice == "537434") %>%
  rows_patch(tibble(Invoice = "537434",
                    StockCode = "20725",
                    `Customer ID` =  Inf), by = c("Invoice", "StockCode"))

A tibble: 675 x 8

We can set unmatched to “ignore” in case the data frame that contains the values that will replace the NAs in our own has rows not present in that.

df %>%
  filter(Invoice == "489521") %>%
  rows_patch(tibble(Invoice = c("489521", "xxxxxx"),
                    `Customer ID` =  Inf), by = "Invoice", unmatched = "ignore")

A tibble: 1 x 8

- rows_upsert()

rows_upsert() is a function that has two different applications.

It works like rows_insert() when the value in the columns specified in by is not present, effectively adding rows.

df %>%
  rows_upsert(tibble(Invoice = "xxxxxx"), by = "Invoice") %>%
  slice_tail(n = 10)

A tibble: 10 x 8

And like rows_update() if it is, updating values.

df %>%
  rows_upsert(tibble(Invoice = "489434", 
                     Description = "xxxxxx"), by = "Invoice")

A tibble: 525461 x 8

So it can be useful to perform two manipulations at the same time.

df %>%
  rows_upsert(tibble(Invoice = c("538171", "xxxxxx"),
                     Description = c("xxxxxx", "yyyyyy")), by = "Invoice") %>%
  slice_tail(n = 10)

A tibble: 10 x 8

- rows_delete()

rows_delete() removes the rows that share the same values in the column specified in by.

df %>%
  rows_delete(tibble(Invoice = "489434"), by = "Invoice")

A tibble: 525453 x 8

df %>%
  rows_delete(tibble(Invoice = c("489434", "489435")), by = "Invoice")

A tibble: 525449 x 8

To use this function, a one column data frame is sufficient, as other columns will be regarded as extra and ignored.

df %>%
  rows_delete(tibble(Invoice = "489434", StockCode = "85048"), by = "Invoice")

Ignoring extra `y` columns: StockCode

A tibble: 525453 x 8

Like the other functions, when we don’t specify a column in by rows_delete() will use the first shared column.

df %>%
  rows_delete(tibble(StockCode = "85048"))

Matching, by = "StockCode"

A tibble: 525142 x 8

And the function will return an error if the types of the columns are different.

df %>%
  rows_delete(tibble(Invoice = 489434), by = "Invoice")

Error in `rows_delete()`:
! Can't combine `x$Invoice` <character> and `y$Invoice` <double>.

Like rows_update() and rows_patch(), also rows_delete() has the unmatched argument that can be set to ignore when we want to use a data frame that has values non present in our own in the column specified in by.

df %>%
  rows_delete(tibble(Invoice = c("489434", "999999")), by = "Invoice", unmatched = "ignore")

A tibble: 525453 x 8

- with group_by()

These functions don’t present specific behaviors when applied to a grouped data frame.

df %>%
  group_by(Invoice) %>%
  rows_insert(tibble(Invoice = "xxxxxx"), by = "Invoice") %>%
  ungroup() %>%
  slice_tail(n = 10)

A tibble: 10 x 8

df %>%
  group_by(Invoice) %>%
  rows_update(tibble(Invoice = "489434", 
                     Description = "xxxxxx"), by = "Invoice")

A tibble: 525461 x 8

Groups: Invoice [28816]

df %>%
  group_by(Invoice) %>%
  filter(Invoice == "489521") %>%
  rows_patch(tibble(Invoice = "489521",
                    `Customer ID` =  Inf), by = "Invoice")

A tibble: 1 x 8

Groups: Invoice [1]

df %>%
  group_by(Invoice) %>%
  rows_upsert(tibble(Invoice = c("538171", "xxxxxx"),
                     Description = "xxxxxx"), by = "Invoice") %>%
  ungroup() %>%
  slice_tail(n = 10)

A tibble: 10 x 8

df %>%
  group_by(Invoice) %>%
  rows_delete(tibble(Invoice = "489434"), by = "Invoice")

A tibble: 525453 x 8

Groups: Invoice [28815]