Search code examples
rdataframemerging-data

Matching up responses of two different data sets and adding parts of one to the other


I have two unrelated datasets in R.

Dataset A contains hundreds of words given as responses in an experiment. Some of the words come up several times, there are 25 responses per participant, with 112 participants. It is formatted like this:

ID  Duration  Gender  Age  response1  response1_rt  response2  response2_rt  response3  response3_rt ...
1     2910      2     26     word        3.55          apple      2.89          house       4.21     ...
2     1873      1     20     paper       2.53          hungry     3.25          tent        2.55     ...
...   ...         ...           ...        ...           ...         ...      ...
112   3420      2     18     pen         1.83          salad      4.26          family      3.22     ...

Dataset B contains thousands of words which have been rated on various dimensions. It is formatted like this:

        Word        measure1  measure2  measure3 ...
1       aardvark    6.26       2.21      19.00   ...
2       abalone     5.3        1.59      20.00   ...
...     ...         ...        ...       ...     ...
8690    paper       5.42       1.62      40      ...
...     ...         ...        ...       ...     ... 
13915   zucchini    6.3        2.36      20      ...

I want to add new columns to Dataset A beside each response, showing the value given to the word in Dataset B measure1, or NA if the word doesn't appear in the dataset...

So for example:

ID   Duration  Gender  Age  response1   measure1.r1   response1_rt  response2  measure1.r2  response2_rt  response3...
1     2910       2     26    word         5.05         3.55          apple        NA        2.89          house
2     1873       1     20    paper        5.42         2.53          hungry       3.54      3.25          tent
...
112   3420       2     18    pen          5.63         1.83          salad        6.35      4.26          family

Any advice on how I can do this?

EDIT: Example data

dataset_a <- structure(list(Duration = c("2910", "3515", "5096", 
"2702", "2175", "2162"), Gender = c("2", "2", "2", "2", "2", "3"
), Age = c("19", "18", "38", "27", "20", "20"), response1 = c("Clock", 
"clock", "clock", "space", "clock", "clock"), response1_rt = c("4.781", 
"4.24", "12.351", "3.841", "5.912", "6.658"), response2 = c("Dress", 
"human", "cat", "woman", "stick figure", "hood"), response2_rt = c("4.77", 
"4.476", "4.214", "4.494", "3.735", "4.998"), response3 = c("2002", 
"date", "party", "2022", "calendar", "month"), response3_rt = c("3.557", 
"2.82", "7.517", "4.371", "4.868", "4.219")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))


dataset_b <- structure(list(Word = c("abandon", "alarm_clock", "calendar", "chip", 
"doll", "dress", "fig", "hotel", "human", "kayak", "light", "part", 
"string", "woman", "zone"), measure1 = c("6.26", 
"5.3", "2.84", "2.63", "5.85", "5.43", "4.48", "2.42", "2.05", 
"5.52", "5.57", "4", "5.15", "3.53", "3.05"), measure2 = c(2.21, 
1.59, 1.54, 1.74, 1.69, 1.75, 1.59, 1.61, 1.31, 1.75, 1.75, 1.29, 
1.79, 1.22, 1.81), measure3 = c(19, 20, 19, 30, 20, 21, 23, 
19, 36, 21, 23, 19, 20, 40, 19)), row.names = c(NA, -15L), class = c("tbl_df", 
"tbl", "data.frame"), problems = structure(list(expected = c("a double", "1/0/T/F/TRUE/FALSE", "a double", 
"1/0/T/F/TRUE/FALSE", "a double", "1/0/T/F/TRUE/FALSE", "a double", 
"1/0/T/F/TRUE/FALSE"), actual = c("entering", "12", "auto", "10", 
"B", "8", "State", "13")), row.names = c(NA, -8L
), class = c("tbl_df", "tbl", "data.frame")))

Solution

  • Here's a solution in the tidyverse. To avoid some tortuous parsing, I have left the columns with slightly different names and in a different order from what you depicted. However, the fundamental structure and content of the data is as requested.

    Solution

    library(tidyverse)
    
    
    # ...
    # Code to generate your 'responses' and 'words' datasets.
    # ...
    
    
    # Filter the desired measures.
    words <- words %>%
      # Pivot the 'measure*' columns into the two columns:
      # - 'm': which measure for that word
      # - 'measure': the value measured
      pivot_longer(
        cols = matches("measure\\d+"),
        names_to = c(".value", "m"),
        names_pattern = "(measure)(\\d+)",
        # Homogenize everything within the new columns:
        values_transform = list(
          # Decimals within 'measure'
          measure = as.double,
          # Integers within 'm'
          m = as.integer
        )
      ) %>%
      # Filter down to desired measures by number.
      filter(m %in% c(
        # You want only measure #1.
        1 #, 2, 3, ...
      )) %>%
      # Pivot back to the original 'measure*' format, now with only the desired
      # columns.
      pivot_wider(
        names_from = m,
        values_from = measure,
        names_glue = "{.value}{.name}"
      )
    
    
    # Match the measures into the responses.
    measured_responses <- responses %>%
      # Pivot the 'response*' columns into two columns:
      # - 'n': which response for that ID
      # - 'response': the word itself
      pivot_longer(
        cols = matches("^response\\d+$"),
        names_to = c(".value", "n"),
        names_pattern = "^(response)(\\d+)$",
        # Homogenize everything within the new columns:
        values_transform = list(
          # Strings within 'response'
          response = as.character,
          # Integers within 'n'
          n = as.integer
        )
      ) %>%
      # Match by word the responses with their measures; and retain (via LEFT join)
      # the responses that lack matches.
      left_join(
        words,
        by = c("response" = "Word")
      ) %>%
      # Pivot the dataset back into its original 'response*' format, now with the
      # new 'measure*.*' columns included.
      pivot_wider(
        names_from = n,
        values_from = c(response, matches("measure\\d+")),
        names_sep = "."
      )
    
    
    # View the results.
    measured_responses
    

    Results

    Given a responses dataset like this

    responses <- tibble::tribble(
      ~ID, ~Duration, ~Gender, ~Age, ~response1, ~response1_rt, ~response2, ~response2_rt, ~response3, ~response3_rt, # ...
      1,   2910,      2,       26,   "word",     3.55,          "apple",    2.89,          "house",    4.21,          # ...
      2,   1873,      1,       20,   "paper",    2.53,          "hungry",   3.25,          "tent",     2.55,          # ...
      112, 3420,      2,       18,   "pen",      1.83,          "salad",    4.26,          "family",   3.22           # ...
    )
    

    and a words dataset like this

    words <- tibble::tribble(
      ~Word,      ~measure1, ~measure2, ~measure3, # ...
      "apple",    5.19,      1.84,      25.00,     # ...
      "paper",    5.42,      1.62,      40,        # ...
      "pen",      5.63,      1.71,      20,        # ...
      "salad",    6.07,      2.19,      40,        # ...
      "tent",     4.96,      1.55,      30,        # ...
      "word",     5.05,      2.26,      19.00,     # ...
    )
    

    this solution should produce a measured_responses dataset like this:

    # A tibble: 3 x 13
         ID Duration Gender   Age response1_rt response2_rt response3_rt response.1 response.2 response.3 measure1.1 measure1.2 measure1.3
      <dbl>    <dbl>  <dbl> <dbl>        <dbl>        <dbl>        <dbl> <chr>      <chr>      <chr>           <dbl>      <dbl>      <dbl>
    1     1     2910      2    26         3.55         2.89         4.21 word       apple      house            5.05       5.19      NA   
    2     2     1873      1    20         2.53         3.25         2.55 paper      hungry     tent             5.42      NA          4.96
    3   112     3420      2    18         1.83         4.26         3.22 pen        salad      family           5.63       6.07      NA