Search code examples
rloopsfor-loopimdb

How to efficiently replace these imdb movie title IDs with the actual title based on a column match?


I've been working in R with some IMDB data that they publish, and have been stuck on this for embarrassingly long today.

primaryName     tconst                 primaryTitle                          knownForTitles
1          Aaron Lim  tt2317744            My Friend Bernard tt0268228,tt0891369,tt2317744,tt3709694
2      Aaron Woodley  tt3228088          Spark: A Space Tail tt0326065,tt1650535,tt4426464,tt3228088
3 Abdelkader Belhedi tt11069302       The Carthage Castaways         tt11698758,tt11069302,tt0485746

I am struggling to come up with a way to match the knownForTitles IDs with the ID in the tconst column. After matching, I would like to replace the IDs in knownForTitles with the actual title names from primaryTitle, like below.

primaryName     tconst                 primaryTitle                          knownForTitles
1          Aaron Lim  tt2317744            My Friend Bernard Movie Title,Movie Title,Movie Title,Movie Title
2      Aaron Woodley  tt3228088          Spark: A Space Tail Movie Title,Movie Title,Movie Title,Movie Title
3 Abdelkader Belhedi tt11069302       The Carthage Castaways         Movie Title,Movie Title,Movie Title

I can only think of using a bunch of for loops which may be very inefficient for thousands of rows. If anyone can point me in a better direction that'd be awesome.


Solution

  • The code is this. The explanation is below.

    Code.

    df = data.frame(primaryName = c("Aaron Lim", "Aaron Woodley"), tconst = c("tt2317744", "tt3228088"), primaryTitle = c("My friend Ron", "Spark: Some Title"), knownForTitles = c("tt0268228,tt0891369,tt2317744,tt3709694", "tt0326065,tt1650535,tt4426464,tt3228088"))
    df$tconst = as.character(df$tconst)
    Names = df %>%  
      mutate(V2 = strsplit(as.character(knownForTitles), ",")) %>% 
      tidyr::unnest(V2) %>% 
      select(-knownForTitles) %>% 
      as.data.frame(.) 
    Movies = df[,2:3]
    Modi = left_join(Names, Movies, by = c("V2" = "tconst")) 
    Modi$primaryTitle.y = as.character(Modi$primaryTitle.y)
    Modi[is.na(Modi$primaryTitle.y), "primaryTitle.y"] = "Test"
    
    Modi %>% 
      group_by(tconst) %>%  
      summarise(primNew = stringr::str_c(primaryTitle.y, collapse = ", ")) %>% 
      inner_join(df, .)
    

    Output.

        primaryName    tconst      primaryTitle                          knownForTitles
    1     Aaron Lim tt2317744     My friend Ron tt0268228,tt0891369,tt2317744,tt3709694
    2 Aaron Woodley tt3228088 Spark: Some Title tt0326065,tt1650535,tt4426464,tt3228088
                                  primNew
    1     Test, Test, My friend Ron, Test
    2 Test, Test, Test, Spark: Some Title
    

    Explanation.

    Let's define some toy data.

        df = data.frame(primaryName = c("Aaron Lim", "Aaron Woodley"), 
                        tconst = c("tt2317744", "tt3228088"), 
                        primaryTitle = c("My friend", "Spark"), 
                        knownForTitles = c("tt0268228,tt0891369,tt2317744,tt3709694", "tt0326065,tt1650535,tt4426464,tt3228088"))
        df$tconst = as.character(df$tconst)
    

    Then you can take tidyr's unnest function to split all the column strings into rows, like this

    Names = df %>%  
      mutate(V2 = strsplit(as.character(knownForTitles), ",")) %>% 
      tidyr::unnest(V2) %>% 
      select(-knownForTitles) %>% 
      as.data.frame(.) 
    

    with the result

    > Names
        primaryName    tconst      primaryTitle        V2
    1     Aaron Lim tt2317744     My friend Ron tt0268228
    2     Aaron Lim tt2317744     My friend Ron tt0891369
    3     Aaron Lim tt2317744     My friend Ron tt2317744
    4     Aaron Lim tt2317744     My friend Ron tt3709694
    5 Aaron Woodley tt3228088 Spark: Some Title tt0326065
    6 Aaron Woodley tt3228088 Spark: Some Title tt1650535
    7 Aaron Woodley tt3228088 Spark: Some Title tt4426464
    8 Aaron Woodley tt3228088 Spark: Some Title tt3228088
    

    Then you get the movie names for all tconstants with

    Movies = df[,2:3]
    Modi = left_join(Names, Movies, by = c("V2" = "tconst")) 
    

    and the result

        primaryName    tconst    primaryTitle.x        V2    primaryTitle.y
    1     Aaron Lim tt2317744     My friend Ron tt0268228              <NA>
    2     Aaron Lim tt2317744     My friend Ron tt0891369              <NA>
    3     Aaron Lim tt2317744     My friend Ron tt2317744     My friend Ron
    4     Aaron Lim tt2317744     My friend Ron tt3709694              <NA>
    5 Aaron Woodley tt3228088 Spark: Some Title tt0326065              <NA>
    6 Aaron Woodley tt3228088 Spark: Some Title tt1650535              <NA>
    7 Aaron Woodley tt3228088 Spark: Some Title tt4426464              <NA>
    8 Aaron Woodley tt3228088 Spark: Some Title tt3228088 Spark: Some Title
    

    As this is toy data there are NA values which cause some trouble, so we do

    Modi$primaryTitle.y = as.character(Modi$primaryTitle.y)
    Modi[is.na(Modi$primaryTitle.y), "primaryTitle.y"] = "Test"
    

    to cope with that.

    Finally, you modify the matched movies and collapse them in one row with

    Modi %>% 
      group_by(tconst) %>%  
      summarise(primNew = stringr::str_c(primaryTitle.y, collapse = ", ")) %>% 
      inner_join(df, .)
    

    and the result

        primaryName    tconst      primaryTitle                          knownForTitles
    1     Aaron Lim tt2317744     My friend Ron tt0268228,tt0891369,tt2317744,tt3709694
    2 Aaron Woodley tt3228088 Spark: Some Title tt0326065,tt1650535,tt4426464,tt3228088
                                  primNew
    1     Test, Test, My friend Ron, Test
    2 Test, Test, Test, Spark: Some Title