Search code examples
rregextidyversestringr

Recombine characters strings and separate initals with a period using R and Regex


I have a list of authors that are all slightly formatted differently. My goal is to extract the different components of every character string. The different components are:

  • initials (usually all capital letters followed by a period

  • last name (all words that start with a capital letter)

  • If present, prefixes of last name (all words that consist only of lowercase characters)

Next I want to put everything back together so that every author and list of authors are formatted in the exact same way. I prefer a solution using stringr.

Below is a small sample of my dataset.

library(tidyverse)

sample_list <- structure(list(Id = c(22667, 33807, 3625, 35531, 32051, 27721), auteur = c("van Bar, A.J.M.",  "de Vogel, J.J., R. Robbing, M.J. Smit & H.L.T. Bergsma-John", "Stark, M. ten", "Eeden, F.W. van", "Bouman, F. & F.D. Parker", "Sullock Enzlin, R.A.F. & J. Hoenselaar"),      jaartal = c(1938, 2016, 2002, 1889, 1997, 1991)), row.names = c(NA, -6L), class = c("tbl_df",  "tbl", "data.frame"))  

Because a single character value can have multiple authors, I first split the data so that every row refers to a single name. Then, using several regex statements, I extract the different components of each name. See below for my code so far.

sample_list |>
 mutate(
  auteur = str_remove(auteur, pattern = ",")) |>
 separate_longer_delim(cols = auteur, delim = regex("\\, | &"),) |>
 mutate(
  auteur = str_squish(auteur),
  initials = auteur |>  
              str_extract(pattern = "(\\b[A-Z\\.]+\\b)") |> 
              str_remove_all(pattern = "\\.") |> 
              str_squish(),     
  last_name = sapply(
               str_extract_all(string = auteur, pattern = "\\b\\p{Lu}(?:\\p{Lu}*\\p{lu})?\\w+"), 
               paste, collapse= ' '),  
  prefix = auteur |> 
            str_extract(pattern = "\\b[a-z](?:[a-z ]*[a-z])?\\b") |> 
            str_squish()
)

This works for me (also taking into account special characters as is common in Eastern or Nothern Europe), but I have trouble putting everything back together. I first need to standardise all the names. I want to do this by separting the initials by a period followed by the prefix (if present) and lastly the last name. For example: "A.J.M. van Bar" "J.J. de Vogel" "R. Robbing"

Expected output

In the end every row needs to represent all the authors of a single publication (by using the Id-column) where all the authors are separated by a comma or '&' symbol depending on the number of authors. I prefer a solution using stringr. The final expected output should be something like:

"A.J.M. van Bar"

"J.J. de Vogel, R. Robbing, M.J. Smit & H.L.T. Bergsma John"

"M. ten Stark"

"F.W. van Eeden"

"F. Bouman & F.D. Parker"

"R.A.F. Sullock Enzlin & J. Hoenselaar"


Solution

  • You can do it in a single line (albeit using a complex regex):

    sample_list %>%
      mutate(auteur = str_replace_all(auteur, "([^,]+),\\s([A-Z]\\.[A-Z]\\.([A-Z]\\.)?(?=$|,\\s?))", "\\2 \\1"))
    # A tibble: 2 × 3
         Id auteur                                                     jaartal
      <dbl> <chr>                                                        <dbl>
    1 22667 A.J.M. van Bar                                                1938
    2 33807 J.J. de Vogel, R. Robbing, M.J. Smit & H.L.T. Bergsma-John    2016
    

    How this works:

    Essentially what you do is divide the strings into two capture groups: the first for the family names plus suffix, the second for the initials, based on the constraint that the initials must be followed by either a comma or the end of the string, and finally use backreference \\1 and \\2 to flip the two components:

    • ([^,]+): 1st capture group for any character occurring multiple times except a , (this is called a negative character class) to capture the family name plus suffix,

    • \\s: intervening whitespace (which is omittted as it is not enclosed in the capturing groups)

    • ([A-Z]\\.[A-Z]\\.([A-Z]\\.)?(?=$|,\\s?)): 2nd capture group for initials; this expression falls into the following components:

    -[A-Z]\\.[A-Z]\\.([A-Z]\\.)?: at least two and possibly three capital letter each followed by a dot, iff ...

    -(?=$|,\\s?): ... they are followed by either a comma plus optional whitespace or end of string position (this constraint is imposed by a positive lookahead (?=...))