Search code examples
rmergeleft-join

Merge data by group if available ignore otherwise


I have journal data overview with year, issue and title information. I have scraped some_content and am looking for a possible merge. There are two problems for me:

  • Missing data. The overview can be considered complete. But my scrape has several missings. Whenever there is information on a title in an issue in a year, it should be merged.
  • No perfect title match. Titles are shorter in my complete overview in the sense, they don't have punctuation (in one version I removed punctuation, but I feel it's not save since punctuation is not always at the end of a title).

The data pretty much looks like this:

# A title like "A" can re-occur in different issues, in different years.
# A title is unique within one year-issue.
overview <- data.frame(Year = c(rep(2018,4), rep(2019,4)), 
                        Issue = c(1,1,1,2,1,2,3,3), 
                        Title = c("A", "B", "F", "A", "F", "L", "A", "F"))

  Year Issue Title
1 2018     1     A
2 2018     1     B
3 2018     1     F
4 2018     2     A
5 2019     1     F
6 2019     2     L
7 2019     3     A
8 2019     3     F

# The scraped titles include punctuation, like .  !  ?
some_content <- data.frame(Year = c(2018, 2018, 2019, 2019, 2019), 
                             Issue = c(1,1,2,3,3), 
                             Title = c("A.", "B!", "L?", "A.", "F"),
                             Content = c("helloworld", NA, "match", "lorem", NA))

  Year Issue Title    Content
1 2018     1    A. helloworld
2 2018     1    B!       <NA>
3 2019     2    L?      match
4 2019     3    A.      lorem
5 2019     3     F       <NA>

Let me tell you the story of "helloworld". In year 2018, issue 1 there are multiple title. Title A in overview definitely corresponds to A. in some_content even though it is not completely identical. Whenever a title per year-issue combination from overview can be string detected in the year-issue combination from some_content the Content from some_content should be merged to the overview dataframe. The result should look like this:

merge_data <- data.frame(Year = c(rep(2018,4), rep(2019,4)), 
                         Issue = c(1,1,1,2,1,2,3,3), 
                         Title = c("A", "B", "F", "A", "F", "L", "A", "F"),
                         Content = c("helloworld", NA, NA, NA, NA, "match", "lorem", NA))

  Year Issue Title    Content
1 2018     1     A helloworld
2 2018     1     B       <NA>
3 2018     1     F       <NA>
4 2018     2     A       <NA>
5 2019     1     F       <NA>
6 2019     2     L      match
7 2019     3     A      lorem
8 2019     3     F       <NA>

Solution

  • First I would suggest removing the punctuations using this:

    some_content$Title <- gsub("[[:punct:]]", "", some_content$Title)
    

    After that you can do a simple left_join like this:

    library(dplyr)
    left_join(overview, some_content, by = c("Year", "Issue", "Title"))
    

    Output:

      Year Issue Title    Content
    1 2018     1     A helloworld
    2 2018     1     B       <NA>
    3 2018     1     F       <NA>
    4 2018     2     A       <NA>
    5 2019     1     F       <NA>
    6 2019     2     L      match
    7 2019     3     A      lorem
    8 2019     3     F       <NA>