Search code examples
rtidyrspread

Separate value in field by character, create multiple columns to the right based on the number of splits possible


I've asked a series of questions.

Separate contents of field

Separate variable in field by character.

Which I think contained multiple questions around the same topic.

I've had excellent answers on how to use separate_rows and then a great answer on how to separate the first and last authors from a character vector.

What I'd like to know now is the final bit:

In this answer Splitting column by separator from right to left in R

the number of columns is known. How do I say "split this string at commas, and throw them into an unknown number of columns based on the number of names in the author list to the right of the original field"?

So that each author becomes a value in a separate field. Initially, I thought it would be cast/spread. BUT!

While this is the example I've used: Author

Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.

in many cases the number of authors on the paper (not including et. al) will be >1 and could be somewhere around 30 at the most.

So. Final question on this three part saga... How do I separate out all authors to a new field and if I could title the new fields something like First author, second author, etc and so on up to Last Author.

Is that sensible/clear?

appreciate there's two or three people who are helping very quickly.


Solution

  • You can split your author column into a list with str_split and then use unnest to get long format dataframe with a new author on each line. Then you use spread and an ID column to get the data into wide format.

    library(dplyr)
    library(tidyr)
    df <- data.frame(publication = c("pub1","pub2","pub3"),author = c("Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P","test author","test arthur, another author"))
    df
    #  publication                                                   author
    #1        pub1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P
    #2        pub2                                              test author
    #3        pub3                              test arthur, another author
    
    
    df %>% group_by(publication) %>% mutate(author = str_split(author,", ")) %>% unnest %>% mutate(ID = paste0("author_",row_number())) %>% spread(ID,author)
    # A tibble: 3 x 6
    # Groups:   publication [3]
    #  publication author_1    author_2       author_3     author_4 author_5
    #  <fct>       <chr>       <chr>          <chr>        <chr>    <chr>   
    #1 pub1        Drijgers RL Verhey FR      Leentjens AF Kahler S Aalten P
    #2 pub2        test author NA             NA           NA       NA      
    #3 pub3        test arthur another author NA           NA       NA