I have a list of authors that are all slightly formatted differently. My goal is to extract the different components of every character string. The different components are:
initials (usually all capital letters followed by a period
last name (all words that start with a capital letter)
If present, prefixes of last name (all words that consist only of lowercase characters)
Next I want to put everything back together so that every author and list of authors are formatted in the exact same way. I prefer a solution using stringr.
Below is a small sample of my dataset.
library(tidyverse)
sample_list <- structure(list(Id = c(22667, 33807, 3625, 35531, 32051, 27721), auteur = c("van Bar, A.J.M.", "de Vogel, J.J., R. Robbing, M.J. Smit & H.L.T. Bergsma-John", "Stark, M. ten", "Eeden, F.W. van", "Bouman, F. & F.D. Parker", "Sullock Enzlin, R.A.F. & J. Hoenselaar"), jaartal = c(1938, 2016, 2002, 1889, 1997, 1991)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
Because a single character value can have multiple authors, I first split the data so that every row refers to a single name. Then, using several regex statements, I extract the different components of each name. See below for my code so far.
sample_list |>
mutate(
auteur = str_remove(auteur, pattern = ",")) |>
separate_longer_delim(cols = auteur, delim = regex("\\, | &"),) |>
mutate(
auteur = str_squish(auteur),
initials = auteur |>
str_extract(pattern = "(\\b[A-Z\\.]+\\b)") |>
str_remove_all(pattern = "\\.") |>
str_squish(),
last_name = sapply(
str_extract_all(string = auteur, pattern = "\\b\\p{Lu}(?:\\p{Lu}*\\p{lu})?\\w+"),
paste, collapse= ' '),
prefix = auteur |>
str_extract(pattern = "\\b[a-z](?:[a-z ]*[a-z])?\\b") |>
str_squish()
)
This works for me (also taking into account special characters as is common in Eastern or Nothern Europe), but I have trouble putting everything back together. I first need to standardise all the names. I want to do this by separting the initials by a period followed by the prefix (if present) and lastly the last name. For example: "A.J.M. van Bar" "J.J. de Vogel" "R. Robbing"
Expected output
In the end every row needs to represent all the authors of a single publication (by using the Id-column) where all the authors are separated by a comma or '&' symbol depending on the number of authors. I prefer a solution using stringr. The final expected output should be something like:
"A.J.M. van Bar"
"J.J. de Vogel, R. Robbing, M.J. Smit & H.L.T. Bergsma John"
"M. ten Stark"
"F.W. van Eeden"
"F. Bouman & F.D. Parker"
"R.A.F. Sullock Enzlin & J. Hoenselaar"
You can do it in a single line (albeit using a complex regex):
sample_list %>%
mutate(auteur = str_replace_all(auteur, "([^,]+),\\s([A-Z]\\.[A-Z]\\.([A-Z]\\.)?(?=$|,\\s?))", "\\2 \\1"))
# A tibble: 2 × 3
Id auteur jaartal
<dbl> <chr> <dbl>
1 22667 A.J.M. van Bar 1938
2 33807 J.J. de Vogel, R. Robbing, M.J. Smit & H.L.T. Bergsma-John 2016
How this works:
Essentially what you do is divide the strings into two capture groups: the first for the family names plus suffix, the second for the initials, based on the constraint that the initials must be followed by either a comma or the end of the string, and finally use backreference \\1
and \\2
to flip the two components:
([^,]+)
: 1st capture group for any character occurring multiple times except a ,
(this is called a negative character class) to capture the family name plus suffix,
\\s
: intervening whitespace (which is omittted as it is not enclosed in the capturing groups)
([A-Z]\\.[A-Z]\\.([A-Z]\\.)?(?=$|,\\s?))
: 2nd capture group for initials; this expression falls into the following components:
-[A-Z]\\.[A-Z]\\.([A-Z]\\.)?
: at least two and possibly three capital letter each followed by a dot, iff ...
-(?=$|,\\s?)
: ... they are followed by either a comma plus optional whitespace or end of string position (this constraint is imposed by a positive lookahead (?=...)
)