Search code examples
rdplyrdata.tablestringrreadr

Finding the cause of an unwanted deletion within an lappy function


I uploaded a .txt file in to R as follows: Election_Parties <- readr::read_lines("Election_Parties.txt") The following text is in the file: pastebin link.

The text more or less looks as follows (Please use actual file for solution!):

BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento 
Nacionalista Revolucionario [MNR])
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])

COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas 
de Colombia)

I would like to have all information about a party on one line, no matter how long it is.

DESIRED OUTPUT:

BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento Nacionalista Revolucionario 
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])

COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colombia)

I have a solution that almost completely does the trick by @JBGruber, which can be found here:

lines <- readr::read_lines("https://pastebin.com/raw/jSrvTa7G")
head(lines)
entries <- split(lines, cumsum(grepl("^$|^ $", lines)))

library(stringr)
library(dplyr)
df <- lapply(entries, function(entry) {
  entry <- entry[!grepl("^$|^ $", entry)] # remove empty elements
  header <- entry[1] # first non empty is the header
  entry <- tail(entry, -1)  # remove header from entry
  desc <- str_extract(entry, "^P\\d+-")  # extract description

  for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
    entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
  }

  entry <- entry[!is.na(desc)]
  desc <- desc[!is.na(desc)]

  # turn into nice format
  df <- tibble::tibble(
    header,
    desc,
    entry
  )
  df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
  return(df)
}) %>% 
  bind_rows() # turn list into one data.frame

But it somehow deletes information. For example, this information:

P1-Movement for a Prosperous Czechoslovakia (Hnutie za prosperujúce Česko + Slovensko
[HZPČS])
P2-Social Democracy (Sociálna demokracia [SD])
P3-Association for Workers in Slovakia (Združenie robotníkov Slovenska [ZRS])

I don't understand the code well enough to see where this deletion might occur, or how to check step by step where it occurs (as everything happens within lapply). Can anyone help?

Please note that solutions using data.table are just as welcome.

EDIT:

enter image description here


Solution

  • The reason the answer doesn't work properly anymore is that the file has changed slightly. The original answer was based on the fact that entries were separated by an empty line. These lines are gone. But entries are now separated by a line that only contains "P00-". We can use this as the separator instead.

    lines <- readr::read_lines("https://pastebin.com/raw/KKu9FmF6")
    
    entries <- split(lines, cumsum(grepl("P00-$", lines)))
    
    library(stringr)
    library(dplyr)
    
    df <- lapply(entries, function(entry) {
      entry <- entry[!grepl("P00-$", entry)] # remove empty elements
      header <- entry[1] # first non empty is the header
      entry <- tail(entry, -1)  # remove header from entry
      desc <- str_extract(entry, "^P\\d+-")  # extract description
    
      for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
        entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
      }
    
      entry <- entry[!is.na(desc)]
      desc <- desc[!is.na(desc)]
    
      # turn into nice format
      df <- tibble::tibble(
        header,
        desc,
        entry
      )
      df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
      return(df)
    }) %>% 
      bind_rows() # turn list into one data.frame
    

    I checked if the information you listed above is still missing and this is not the case:

    df %>% 
      filter(str_detect(entry, "Movement for a Prosperous Czechoslovakia|Sociálna demokraci|Association for Workers in Slovakia"))
    #> # A tibble: 3 x 3
    #>   header      desc  entry                                                       
    #>   <chr>       <chr> <chr>                                                       
    #> 1 P00-SLOVAK… P1-   Movement for a Prosperous Czechoslovakia (Hnutie za prosper…
    #> 2 P00-SLOVAK… P2-   Social Democracy (Sociálna demokracia [SD])                 
    #> 3 P00-SLOVAK… P3-   Association for Workers in Slovakia (Združenie robotníkov S…
    

    Created on 2019-12-16 by the reprex package (v0.3.0)

    I tried to make the answer as clear as possible, but I understand that it is often hard to wrap your head around other people's code. One thing that always helps me is to run the solution line by line and check how the objects change. Since most of the important stuff is hidden in the loop, you can simulate one run of lapply by creating an example entry like this: entry <- entries[[1]]. Now you can the lines inside lapply.