Search code examples
rtidyrdata-cleaning

Is there an R function to select specific row numbers?


I'm a beginner in the process of learning R and I have a dataset containing book names and authors that I'm using as practice for cleaning data. In part of the process, I want to seperate the column "author" into "first_name" and "last_name".

When I perform this task, I get the following message:

library(tidyr)

#   - Seperate author first and last name
separate(df, col="author", into = c("first_name","last_name"), sep=" ")

Warning message:   
Expected 2 pieces. Additional pieces discarded in
18 rows [4, 12, 16, 17, 21, 23, 28, 34, 41, 43, 46, 60, 65, 67, 73,
79, 82, 84].

Looking back at the data, I can see that the error mostly occurs with Authors who have a middle name or middle initial (ex. Louisa May Alcot). I want to look up the rows stated in the warning message to remove additional spaces before cleaning. Is there a way to do this?

I tried using

df[4,12,16,17,21,23,28,34,41,43,36,60,65,67,73,79,82,84]

but I got an error message because R was trying to pull those up as columns and the dataset does not have that many columns. I've looked through some search boards for the code I would need to perform this task, but I'm being met with solutions that don't quite apply to the problem and seem more involved than they need to be.

Is there a simple way to resolve this problem? Or a different way I should be approaching the data?


Solution

  • Expanding on my comment, here is an example of what I mean:

    library(dplyr)
    
    df <- data.frame(name = c("Jane Doe", "Louisa May Alcot")) 
    
    df |>
      mutate(last_name = sub(".*\\s", "", name),
             first_name = sub("(.*)\\s(.*)$", "\\1", name))
    
    #               name last_name first_name
    # 1         Jane Doe       Doe       Jane
    # 2 Louisa May Alcot     Alcot Louisa May
    

    Using regular expressions we can capture everything after the last space (last_name) and everything before the last space (first name). In some cases this will not work, for example "Vincent van Gogh". Here there is a space in the last name -- just something to be aware of if using this.

    I think separate works nicely when you have consistent delimiters, but here some folks have three spaces while others only have two. Still you could force it using some logic like so:

    library(dplyr)
    library(tidyr)
    
    df |>
      separate(name, into = c("first", "middle", "last")) |>
      mutate(last = coalesce(last, middle),
             middle = ifelse(last == middle, "", middle),
             first = trimws(paste(first, middle))) |>
      select(-middle)
    

    Lastly, if as you mentioned in your post it makes sense to remove the space in this circumstance you could do that like so:

    library(stringr)
    library(tidyr)
    
    df |>
      mutate(name = ifelse(str_count(name, "\\s") > 1, str_remove(name, "\\s"), name)) |>
      separate(name, into = c("first", "last"))
    #       first  last
    # 1      Jane   Doe
    # 2 LouisaMay Alcot