I have a .txt file that includes short articles, and I want to use R to create a data set that parses each article and extracts the date, author, journal, title, line number, and text for each line of text in each article in a data frame. For example, the text data for each article repeats the same structure and takes the following format:
This is a Title
December 15, 2005 | Publisher
Author: JANE DOE
Section: Movies and More
2554 Words
Page: C3
OpenURL
Link
Text Text Text Text
Another line of text
One more thing
End of article.
Citation (asa Style)
DOE, JANE. 2005. "This is a Title," Publisher, December 15, pp.C3.
Different Title
December 18, 2005 | Publisher
Author: JOHN DOE
Section: News
662 Words
Page: C8
OpenURL
Link
Here is more text
It is still text
But also shorter.
Citation (asa Style)
DOE, JOHN. 2005. "Different Title," Publisher, December 18, pp.C8.
For each article, I want to extract the author, the data published, the journal, and each line to create a data frame that looks like this:
Date Journal Title Author Line Text
15-Dec-2005 Publication This is a title Doe, Jane 1 Text Text Text Text
15-Dec-2005 Publication This is a title Doe, Jane 2 Another line of text
15-Dec-2005 Publication This is a title Doe, Jane 3 One more thing
15-Dec-2005 Publication This is a title Doe, Jane 4 End of article.
18-Dec-2005 Publication Different Title Doe, John 1 Here is more text
18-Dec-2005 Publication Different Title Doe, John 2 It is still text
18-Dec-2005 Publication Different Title Doe, John 3 But also shorter.
I want to use the code below to transform the data frame above (let's call it text_df) into tidy text format, restructured in a one-token-per-row format,
library(tidytext)
tidy_dat <- text_df %>%
unnest_tokens(word, text)
I understand this is a big ask. Any help would be greatly appreciated.
Because you have many fields that you wish to extract, I created the main idea - from here you can follow :)
First, loading the tidyverse
and the article:
library(tidyverse)
articles <- read.delim("C:/Your_Path/temp.txt",
stringsAsFactors = FALSE, header = FALSE)
We can use grep
to get the positions of "Link" and "Citation" in the text.
positions <- grep(pattern = "Link|Citation",
x = articles$V1)
Because Link always comes before Citation we can split
the positions
into a list of twos.
positions <- split(positions, ceiling(seq_along(positions)/2))
Now, we can extract the authors positions, using the same idea (grep
).
authors <- grep(pattern = "Author",
x = articles$V1)
It's always good to check if your vectors/list are in the same length. That way, you can see if you extracted more authors than link and citation.
length(authors) == length(positions)
As I'm assuming you have more than two arguments to run with (text, authors, publishing, year, etc.), I use purrr::pmap
. pmap
, similarly to map
/map2
runs over one or more lists/vectors and preforms a function. In this case, the function takes (each time) the corresponding line the articles
to positions
and author
books <- purrr::pmap(
list(positions, authors),
function(position, author) {
cbind(
data.frame(text = articles[seq(position[1] + 1, position[2] - 1, 1), ]),
data.frame(author = articles[author,]))})
As pmap
returns a list
, we can bind the data.frame
s within the list into one data.frame
.
do.call(rbind.data.frame, books)
Result:
text author
1.1 Text Text Text Text Author: JANE DOE
1.2 Another line of text Author: JANE DOE
1.3 One more thing Author: JANE DOE
1.4 End of article. Author: JANE DOE
2.1 Here is more text Author: JOHN DOE
2.2 It is still text Author: JOHN DOE
2.3 But also shorter. Author: JOHN DOE
Now, you can do whatever, tidytext
analyses you want.