I've been using the extremely useful rentrez package in R to get information about author, article ID and author affiliation from the Pubmed database. This works fine but now I would like to extract information from the affiliation field. Unfortunately the affiliation field is widely unstructured, not standardized string with various types of information such as the name of university, name of department, address and more delimited by commas. Therefore text mining approach is necessary to get any useful information from this field.
I tried the package easyPubmed in combination with rentrez, and even though easyPubmed package can extract some information from the affiliation field (e.g. email address, which is very useful), to my knowledge it cannot extract university name. I also tried the package pubmed.mineR, but unfortunately this also does not provide university name extraction. I startet to experiment with grep and regex functions but as I am no R expert I could not make this work.
I was able to find very similar threads solving the issue with python:
Regex for extracting names of colleges, universities, and institutes?
How to extract university/school/college name from string in python using regular expression?
But unfortunately I do not know how to convert the python regex function to an R regex function as I am not familiar with python.
Here is some example data:
PMID = c(121,122,123,124,125)
author=c("author1","author2","author3","author4","author5")
Affiliation = c("blabla,University Ghent,blablabla", "University Washington, blabla, blablabla, blablabalbalba","blabla,University of Florence,blabla", "University Chicago, Harvard University", "Oxford University")
df = as.data.frame(cbind(PMID,author,Affiliation))
df
PMID author Affiliation
1 121 author1 blabla,University Ghent,blablabla
2 122 author2 University Washington, blabla, blablabla, blablabalbalba
3 123 author3 blabla,University of Florence,blabla
4 124 author4 University Chicago, Harvard University
5 125 author5 Oxford University
What I would like to get:
PMID author Affiliation University
1 121 author1 blabla,University Ghent,blablabla University Ghent
2 122 author2 University Washington,ba, bla, bla University Washington
3 123 author3 blabla,University Florence,blabla University of Florence
4 124 author4 University Chicago, Harvard Univ University Chicago, Harvard University
5 125 author5 Oxford University Oxford University
Please sorry if there is already a solution online, but I honestly googled a lot and did not find any clear solution for R. I would be very thankful for any hints and solutions to this task.
In general, regex expressions can be ported to R with some changes. For example, using the php link you included, you can create a new variable with extracted text using that regex expression, and only changing the escape character ("\\" instead "\"). So, using dplyr
and stringr
packages:
library(dplyr)
library(stringr)
df <- df %>%
mutate(Organization=str_extract(Affiliation,
"([A-Z][^\\s,.]+[.]?\\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\\d]*(?=,|\\d)"))