Search code examples
rparsingtextdplyrtidytext

Parsing text for analysis in R


I have a .txt file that includes short articles, and I want to use R to create a data set that parses each article and extracts the date, author, journal, title, line number, and text for each line of text in each article in a data frame. For example, the text data for each article repeats the same structure and takes the following format:

This is a Title  
December 15, 2005 | Publisher  
Author: JANE DOE  
Section: Movies and More  
2554 Words
Page: C3  
OpenURL  
Link  

Text Text Text Text   
Another line of text  
One more thing  
End of article.   

Citation (asa Style)  
DOE, JANE. 2005. "This is a Title," Publisher, December 15, pp.C3.

Different Title  
December 18, 2005 | Publisher  
Author: JOHN DOE  
Section: News 
662 Words
Page: C8  
OpenURL  
Link  

Here is more text   
It is still text
But also shorter.  

Citation (asa Style)  
DOE, JOHN. 2005. "Different Title," Publisher, December 18, pp.C8.

For each article, I want to extract the author, the data published, the journal, and each line to create a data frame that looks like this:

Date           Journal       Title             Author            Line              Text
15-Dec-2005    Publication   This is a title   Doe, Jane         1                 Text Text Text Text
15-Dec-2005    Publication   This is a title   Doe, Jane         2                 Another line of text
15-Dec-2005    Publication   This is a title   Doe, Jane         3                 One more thing
15-Dec-2005    Publication   This is a title   Doe, Jane         4                 End of article.
18-Dec-2005    Publication   Different Title   Doe, John         1                 Here is more text 
18-Dec-2005    Publication   Different Title   Doe, John         2                 It is still text
18-Dec-2005    Publication   Different Title   Doe, John         3                 But also shorter.

I want to use the code below to transform the data frame above (let's call it text_df) into tidy text format, restructured in a one-token-per-row format,

library(tidytext)
tidy_dat <- text_df %>%
  unnest_tokens(word, text)

I understand this is a big ask. Any help would be greatly appreciated.


Solution

  • Because you have many fields that you wish to extract, I created the main idea - from here you can follow :)

    First, loading the tidyverse and the article:

    library(tidyverse)
    articles <- read.delim("C:/Your_Path/temp.txt",
                          stringsAsFactors = FALSE, header = FALSE)
    

    We can use grep to get the positions of "Link" and "Citation" in the text.

    positions <- grep(pattern = "Link|Citation", 
                     x = articles$V1)
    

    Because Link always comes before Citation we can split the positions into a list of twos.

    positions <- split(positions, ceiling(seq_along(positions)/2))
    

    Now, we can extract the authors positions, using the same idea (grep).

    authors <- grep(pattern = "Author", 
                   x = articles$V1)
    

    It's always good to check if your vectors/list are in the same length. That way, you can see if you extracted more authors than link and citation.

    length(authors) == length(positions)
    

    As I'm assuming you have more than two arguments to run with (text, authors, publishing, year, etc.), I use purrr::pmap. pmap, similarly to map/map2 runs over one or more lists/vectors and preforms a function. In this case, the function takes (each time) the corresponding line the articles to positions and author

    books <- purrr::pmap(
        list(positions, authors), 
      function(position, author) {
        cbind(
          data.frame(text = articles[seq(position[1] + 1, position[2] - 1, 1), ]),
          data.frame(author = articles[author,]))})
    

    As pmap returns a list, we can bind the data.frames within the list into one data.frame.

    do.call(rbind.data.frame, books)
    

    Result:

                          text             author
    1.1 Text Text Text Text    Author: JANE DOE  
    1.2 Another line of text   Author: JANE DOE  
    1.3       One more thing   Author: JANE DOE  
    1.4     End of article.    Author: JANE DOE  
    2.1   Here is more text    Author: JOHN DOE  
    2.2       It is still text Author: JOHN DOE  
    2.3    But also shorter.   Author: JOHN DOE  
    

    Now, you can do whatever, tidytext analyses you want.