I uploaded a .txt file in to R as follows: Election_Parties <- readr::read_lines("Election_Parties.txt")
The following text is in the file: pastebin link.
The text more or less looks as follows (Please use actual file for solution!):
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento
Nacionalista Revolucionario [MNR])
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas
de Colombia)
I would like to have all information about a party on one line, no matter how long it is.
DESIRED OUTPUT:
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento Nacionalista Revolucionario
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colombia)
I have a solution that almost completely does the trick by @JBGruber, which can be found here:
lines <- readr::read_lines("https://pastebin.com/raw/jSrvTa7G")
head(lines)
entries <- split(lines, cumsum(grepl("^$|^ $", lines)))
library(stringr)
library(dplyr)
df <- lapply(entries, function(entry) {
entry <- entry[!grepl("^$|^ $", entry)] # remove empty elements
header <- entry[1] # first non empty is the header
entry <- tail(entry, -1) # remove header from entry
desc <- str_extract(entry, "^P\\d+-") # extract description
for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
}
entry <- entry[!is.na(desc)]
desc <- desc[!is.na(desc)]
# turn into nice format
df <- tibble::tibble(
header,
desc,
entry
)
df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
return(df)
}) %>%
bind_rows() # turn list into one data.frame
But it somehow deletes information. For example, this information:
P1-Movement for a Prosperous Czechoslovakia (Hnutie za prosperujúce Česko + Slovensko
[HZPČS])
P2-Social Democracy (Sociálna demokracia [SD])
P3-Association for Workers in Slovakia (Združenie robotníkov Slovenska [ZRS])
I don't understand the code well enough to see where this deletion might occur, or how to check step by step where it occurs (as everything happens within lapply
). Can anyone help?
Please note that solutions using data.table
are just as welcome.
EDIT:
The reason the answer doesn't work properly anymore is that the file has changed slightly. The original answer was based on the fact that entries were separated by an empty line. These lines are gone. But entries are now separated by a line that only contains "P00-". We can use this as the separator instead.
lines <- readr::read_lines("https://pastebin.com/raw/KKu9FmF6")
entries <- split(lines, cumsum(grepl("P00-$", lines)))
library(stringr)
library(dplyr)
df <- lapply(entries, function(entry) {
entry <- entry[!grepl("P00-$", entry)] # remove empty elements
header <- entry[1] # first non empty is the header
entry <- tail(entry, -1) # remove header from entry
desc <- str_extract(entry, "^P\\d+-") # extract description
for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
}
entry <- entry[!is.na(desc)]
desc <- desc[!is.na(desc)]
# turn into nice format
df <- tibble::tibble(
header,
desc,
entry
)
df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
return(df)
}) %>%
bind_rows() # turn list into one data.frame
I checked if the information you listed above is still missing and this is not the case:
df %>%
filter(str_detect(entry, "Movement for a Prosperous Czechoslovakia|Sociálna demokraci|Association for Workers in Slovakia"))
#> # A tibble: 3 x 3
#> header desc entry
#> <chr> <chr> <chr>
#> 1 P00-SLOVAK… P1- Movement for a Prosperous Czechoslovakia (Hnutie za prosper…
#> 2 P00-SLOVAK… P2- Social Democracy (Sociálna demokracia [SD])
#> 3 P00-SLOVAK… P3- Association for Workers in Slovakia (Združenie robotníkov S…
Created on 2019-12-16 by the reprex package (v0.3.0)
I tried to make the answer as clear as possible, but I understand that it is often hard to wrap your head around other people's code. One thing that always helps me is to run the solution line by line and check how the objects change. Since most of the important stuff is hidden in the loop, you can simulate one run of lapply
by creating an example entry like this: entry <- entries[[1]]
. Now you can the lines inside lapply
.