Search code examples
rregextexttext-miningpubmed

Extracting university names from affiliation in Pubmed data with R


I've been using the extremely useful rentrez package in R to get information about author, article ID and author affiliation from the Pubmed database. This works fine but now I would like to extract information from the affiliation field. Unfortunately the affiliation field is widely unstructured, not standardized string with various types of information such as the name of university, name of department, address and more delimited by commas. Therefore text mining approach is necessary to get any useful information from this field.

I tried the package easyPubmed in combination with rentrez, and even though easyPubmed package can extract some information from the affiliation field (e.g. email address, which is very useful), to my knowledge it cannot extract university name. I also tried the package pubmed.mineR, but unfortunately this also does not provide university name extraction. I startet to experiment with grep and regex functions but as I am no R expert I could not make this work.

I was able to find very similar threads solving the issue with python:

Regex for extracting names of colleges, universities, and institutes?

How to extract university/school/college name from string in python using regular expression?

But unfortunately I do not know how to convert the python regex function to an R regex function as I am not familiar with python.

Here is some example data:

PMID = c(121,122,123,124,125)
author=c("author1","author2","author3","author4","author5")
Affiliation = c("blabla,University Ghent,blablabla", "University Washington, blabla, blablabla, blablabalbalba","blabla,University of Florence,blabla", "University Chicago, Harvard University", "Oxford University")
df = as.data.frame(cbind(PMID,author,Affiliation))

df
PMID  author                                              Affiliation
1  121 author1                        blabla,University Ghent,blablabla
2  122 author2 University Washington, blabla, blablabla, blablabalbalba
3  123 author3                        blabla,University of Florence,blabla
4  124 author4                        University Chicago, Harvard University
5  125 author5                        Oxford University

What I would like to get:

PMID  author    Affiliation                        University
1  121 author1  blabla,University Ghent,blablabla  University Ghent
2  122 author2  University Washington,ba, bla, bla University Washington
3  123 author3  blabla,University Florence,blabla  University of Florence
4  124 author4  University Chicago, Harvard Univ   University Chicago, Harvard University
5  125 author5  Oxford University                  Oxford University

Please sorry if there is already a solution online, but I honestly googled a lot and did not find any clear solution for R. I would be very thankful for any hints and solutions to this task.


Solution

  • In general, regex expressions can be ported to R with some changes. For example, using the php link you included, you can create a new variable with extracted text using that regex expression, and only changing the escape character ("\\" instead "\"). So, using dplyr and stringr packages:

    library(dplyr)
    library(stringr)
    df <- df %>% 
      mutate(Organization=str_extract(Affiliation,
          "([A-Z][^\\s,.]+[.]?\\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\\d]*(?=,|\\d)"))