I have journal data overview
with year, issue and title information. I have scraped some_content
and am looking for a possible merge. There are two problems for me:
overview
can be considered complete. But my scrape has several missings. Whenever there is information on a title in an issue in a year, it should be merged.The data pretty much looks like this:
# A title like "A" can re-occur in different issues, in different years.
# A title is unique within one year-issue.
overview <- data.frame(Year = c(rep(2018,4), rep(2019,4)),
Issue = c(1,1,1,2,1,2,3,3),
Title = c("A", "B", "F", "A", "F", "L", "A", "F"))
Year Issue Title
1 2018 1 A
2 2018 1 B
3 2018 1 F
4 2018 2 A
5 2019 1 F
6 2019 2 L
7 2019 3 A
8 2019 3 F
# The scraped titles include punctuation, like . ! ?
some_content <- data.frame(Year = c(2018, 2018, 2019, 2019, 2019),
Issue = c(1,1,2,3,3),
Title = c("A.", "B!", "L?", "A.", "F"),
Content = c("helloworld", NA, "match", "lorem", NA))
Year Issue Title Content
1 2018 1 A. helloworld
2 2018 1 B! <NA>
3 2019 2 L? match
4 2019 3 A. lorem
5 2019 3 F <NA>
Let me tell you the story of "helloworld". In year 2018, issue 1 there are multiple title. Title A
in overview
definitely corresponds to A.
in some_content
even though it is not completely identical. Whenever a title per year-issue combination from overview
can be string detected in the year-issue combination from some_content
the Content
from some_content
should be merged to the overview
dataframe. The result should look like this:
merge_data <- data.frame(Year = c(rep(2018,4), rep(2019,4)),
Issue = c(1,1,1,2,1,2,3,3),
Title = c("A", "B", "F", "A", "F", "L", "A", "F"),
Content = c("helloworld", NA, NA, NA, NA, "match", "lorem", NA))
Year Issue Title Content
1 2018 1 A helloworld
2 2018 1 B <NA>
3 2018 1 F <NA>
4 2018 2 A <NA>
5 2019 1 F <NA>
6 2019 2 L match
7 2019 3 A lorem
8 2019 3 F <NA>
First I would suggest removing the punctuations using this:
some_content$Title <- gsub("[[:punct:]]", "", some_content$Title)
After that you can do a simple left_join
like this:
library(dplyr)
left_join(overview, some_content, by = c("Year", "Issue", "Title"))
Output:
Year Issue Title Content
1 2018 1 A helloworld
2 2018 1 B <NA>
3 2018 1 F <NA>
4 2018 2 A <NA>
5 2019 1 F <NA>
6 2019 2 L match
7 2019 3 A lorem
8 2019 3 F <NA>