I am reading a file that contains data about amino acid sequences for approx. 600000 proteins. for whomever this might be of interest, here the source
I am using data.table::fread
due to the file size and for convenience.
The "problem" is that the file contains a new entry only every 2nd line, introduced with a ">". It's not a biggie, because I can just do some minor wrangling and I have it as I want. (see desired output, or even "ideal output").
I wondered if there is a direct way to read in a file that has this kind of structure. Any other package also welcome of course, but it should handle that type of size well.
library(tidyverse)
# "text = ..." contains a shortened version of the first two entries of the downloaded txt file
prot <- data.table::fread(text =
">101m_A mol:protein length:154 MYOGLOBIN
QGAMNKALEL
>102l_A mol:protein length:165 T4 LYSOZYME
RAKRVITTFR",
header = FALSE
)
prot <- as.data.frame(prot)
# expected output
exp_out <- bind_cols(prot = prot[c(T, F), ], aminoseq = prot[c(F, T), ] )
exp_out
#> # A tibble: 2 x 2
#> prot aminoseq
#> <chr> <chr>
#> 1 >101m_A mol:protein length:154 MYOGLOBIN QGAMNKALEL
#> 2 >102l_A mol:protein length:165 T4 LYSOZYME RAKRVITTFR
# ideal output
exp_out %>%
separate(prot, c("mol", "length"), sep = ":protein length:") %>%
separate(length, c("length", "name"), sep = "\\s{2}+")
#> # A tibble: 2 x 4
#> mol length name aminoseq
#> <chr> <chr> <chr> <chr>
#> 1 >101m_A mol 154 MYOGLOBIN QGAMNKALEL
#> 2 >102l_A mol 165 T4 LYSOZYME RAKRVITTFR
Created on 2021-01-07 by the reprex package (v0.3.0)
Read odd and even rows separately, using sed, then fread with column bind, this will get you to "expected output", it is pretty fast, too, around 2 seconds with unzipped input:
# get the data
# wget ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz
library(data.table)
# unzip on the fly
started.at = proc.time()
d <- cbind(
fread(cmd = "zcat pdb_seqres.txt.gz | sed -n 'p;n'", sep = "|"),
fread(cmd = "zcat pdb_seqres.txt.gz | sed -n 'n;p'"))
cat("Finished in", timetaken(started.at), "\n")
# Finished in 4.585s elapsed (1.788s cpu)
# read unzipped input
started.at = proc.time()
d <- cbind(
fread(cmd = "sed -n 'p;n' pdb_seqres.txt", sep = "|"),
fread(cmd = "sed -n 'n;p' pdb_seqres.txt"))
cat("Finished in", timetaken(started.at), "\n")
# Finished in 1.796s elapsed (1.111s cpu)
In theory below should work, i.e. we are column binding using bash paste before freading, but it keeps giving me errors about tempfile permissions, might work on your set up.
fread(cmd = "paste -d'|' <(sed -n 'p;n' pdb_seqres.txt) <(sed -n 'n;p' pdb_seqres.txt)",
sep = "|")