Using R to analyse pubmed articles. Trying to create wordcloud but also association with year of publication

MOST RECENT EDIT:

I have successfully created my required data frames containing pmid,year and abstract as columns from a literature search on pubmed. I then split this data frame into many separate ones by year. So I have multiple data frames containing 3 columns,pmid, year and abstract.In total there are 4000 rows across all data frames.

Now I need to run tm package to cleanup my abstract columns and remove words I don't need and punctuations etc. But I don't know how to do this on data frame. I get how it works on a text file.

I want to output frequencies of words appearing in the text. This is so I can create a graph of words by year. I then want to create a wordlcloud using wordclou2.

I am happy to use any other suggested packages.

Here is my code:

library(easyPubMed)
library(dplyr)
library(kableExtra)

# Query PubMed
qr1 <- get_pubmed_ids("platinum resistant AND cancer")

# How many records are there?
print(qr1$Count)

# Query pubmed and fetch many results
my_query <- 'platinum resistant AND cancer' 
my_query <- get_pubmed_ids(my_query)

# Fetch data, note retmax is 7000 as for some reason we need a value and a higher value returns errors
my_abstracts_xml <- fetch_pubmed_data(my_query, retstart = 0, retmax = 7000)  

# Store Pubmed Records as elements of a list
all_xml <- articles_to_list(my_abstracts_xml)

# Starting time: record
t.start <- Sys.time()

# Perform operation (use lapply here, no further parameters)
final_df <- do.call(rbind, lapply(all_xml, article_to_df, 
                                  max_chars = -1, getAuthors = FALSE))

# Final time: record
t.stop <- Sys.time()

# How long did it take?
print(t.stop - t.start)

# Show an excerpt of the results
final_df[,c("pmid", "year", "abstract")]  %>%
  head() %>% kable() %>% kable_styling(bootstrap_options = 'striped')

#redue columns to those requiredfor overall wordcloud
wordcloud_df <- final_df[,c('pmid','year','abstract')]

#split df by year for analysis by year
df2022 <- wordcloud_df[which(wordcloud_df$year == "2022"),]
df2021 <- wordcloud_df[which(wordcloud_df$year == "2021"),]
df2020 <- wordcloud_df[which(wordcloud_df$year == "2020"),]
df2019 <- wordcloud_df[which(wordcloud_df$year == "2019"),]
df2018 <- wordcloud_df[which(wordcloud_df$year == "2018"),]
df2017 <- wordcloud_df[which(wordcloud_df$year == "2017"),]
df2016 <- wordcloud_df[which(wordcloud_df$year == "2016"),]
df2015 <- wordcloud_df[which(wordcloud_df$year == "2015"),]
df2014 <- wordcloud_df[which(wordcloud_df$year == "2014"),]
df2013 <- wordcloud_df[which(wordcloud_df$year == "2013"),]
df2012 <- wordcloud_df[which(wordcloud_df$year == "2012"),]
df2011 <- wordcloud_df[which(wordcloud_df$year == "2011"),]
df2010 <- wordcloud_df[which(wordcloud_df$year == "2010"),]
df2009 <- wordcloud_df[which(wordcloud_df$year == "2009"),]
df2008 <- wordcloud_df[which(wordcloud_df$year == "2008"),]
df2007 <- wordcloud_df[which(wordcloud_df$year == "2007"),]
df2006 <- wordcloud_df[which(wordcloud_df$year == "2006"),]
df2005 <- wordcloud_df[which(wordcloud_df$year == "2005"),]
df2004 <- wordcloud_df[which(wordcloud_df$year == "2004"),]
df2003 <- wordcloud_df[which(wordcloud_df$year == "2003"),]
df2002 <- wordcloud_df[which(wordcloud_df$year == "2002"),]
df2001 <- wordcloud_df[which(wordcloud_df$year == "2001"),]
df2000 <- wordcloud_df[which(wordcloud_df$year == "2000"),]
df1999 <- wordcloud_df[which(wordcloud_df$year == "1999"),]
df1998 <- wordcloud_df[which(wordcloud_df$year == "1998"),]
df1997 <- wordcloud_df[which(wordcloud_df$year == "1997"),]
df1996 <- wordcloud_df[which(wordcloud_df$year == "1996"),]
df1995 <- wordcloud_df[which(wordcloud_df$year == "1995"),]
df1994 <- wordcloud_df[which(wordcloud_df$year == "1994"),]
df1993 <- wordcloud_df[which(wordcloud_df$year == "1993"),]
df1992 <- wordcloud_df[which(wordcloud_df$year == "1992"),]
df1991 <- wordcloud_df[which(wordcloud_df$year == "1991"),]
df1990 <- wordcloud_df[which(wordcloud_df$year == "1990"),]
df1989 <- wordcloud_df[which(wordcloud_df$year == "1989"),]
df1988 <- wordcloud_df[which(wordcloud_df$year == "1988"),]
df1987 <- wordcloud_df[which(wordcloud_df$year == "1987"),]
df1986 <- wordcloud_df[which(wordcloud_df$year == "1986"),]
df1985 <- wordcloud_df[which(wordcloud_df$year == "1985"),]
df1984 <- wordcloud_df[which(wordcloud_df$year == "1984"),]
df1983 <- wordcloud_df[which(wordcloud_df$year == "1983"),]
df1982 <- wordcloud_df[which(wordcloud_df$year == "1982"),]
df1981 <- wordcloud_df[which(wordcloud_df$year == "1981"),]
df1980 <- wordcloud_df[which(wordcloud_df$year == "1980"),]
df1979 <- wordcloud_df[which(wordcloud_df$year == "1979"),]
df1978 <- wordcloud_df[which(wordcloud_df$year == "1978"),]
df1977 <- wordcloud_df[which(wordcloud_df$year == "1977"),]
df1976 <- wordcloud_df[which(wordcloud_df$year == "1976"),]

ORIGINAL POST:I am very new to programming and R in general. As part of my project, I would like to create a wordcloud which I have managed to test and get working (need to clean it properly still). But I want to now do something different.

If I were to search for my terms on pubmed, I will get roughly 7000 articles. I'm able to download all abstracts to my computer, stick them in a txt file and then make my wordcloud (just about).

However now I want to correlate the terms I find with frequency of said terms over the years. This way I can see how research is directed/changing over the years. This is where I am stuck however.

Whilst I can get the abstracts, how do I somehow associate each abstract with a year then get a frequency per year?

I found the easypubmed package but I don't think I'm able to do what I want with it. Any suggestions?

Thank you!

(I'm using wordcloud2 +tm currently)

I have tried to run easypubmed but I'm not quite sure how to get it to do what I want. it may not even be the right package. I have tried to download directly from pubmed but I cannot download both the abstract + year and as a separate file. There is an option not download an excel file but this will only contain year author, pubmedID and a couple other bits. Not the abstract. Otherwise I probably could have used the excel file?

Solution

Here is a possible implementation using regular expressions to extract the PMID from the text, for later joining with the other csv file:

library(tidyverse)

#Fill below in
txtpath <- "/Users/davidcsuka/Downloads/abstract-coaptite-set.txt"

textdf <- read_file(txtpath) %>%
  str_split("(?=\\n\\d\\. )") %>%
  unlist() %>%
  setNames(str_extract(., "(?<=PMID: )\\d+")) %>%
  enframe(name = "PMID")

#Fill below in
csvpath <- "/Users/davidcsuka/Downloads/csv-coaptite-set.csv"

df <- read_csv("/Users/davidcsuka/Downloads/csv-coaptite-set.csv") %>%
  mutate(PMID = as.character(PMID)) %>%
  left_join(textdf, by = "PMID")

view(df)

You could later run the text mining functions on the text column of the dataframe. Or you could just process the text files as a vector, then make it a dataframe later. Let me know if this works.

EDIT:

dfnew <- df %>%
  group_by(year) %>%
  summarise(newtext = paste(abstract))



textmine <- function(onetext) {
  #Write text mining function for one article
}

#verify function with:
textmine(dfnew$newtext[1])

#get all results with:

results <- lapply(dfnew$newtext, textmine)