Search code examples
rdataframereadline

Create a dataframe from a file with different delimiters


I need to create a table based on different delimiters and keywords. I have the following file:

    >>>ENST00000370225_4
    >7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 
    [Homo sapiens]
     7E7O_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 
    [Homo sapiens]
    Length=2317
    
     Score = 4711 bits (12220),  Expect = 0.0, Method: Compositional matrix adjust.
     Identities = 2273/2273 (100%), Positives = 2273/2273 (100%), Gaps = 0/2273 (0%)
    
    >NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo 
    sapiens]
     P78363.3 RecName: Full=Retinal-specific phospholipid-transporting ATPase 
    ABCA4; AltName: Full=ATP-binding cassette sub-family A member 
    4; AltName: Full=RIM ABC transporter; Short=RIM proteinv; 
    Short=RmP; AltName: Full=Retinal-specific ATP-binding cassette 
    transporter; AltName: Full=Stargardt disease protein [Homo 
    sapiens]
     7LKP_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 
    [Homo sapiens]
     7M1P_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 
    [Homo sapiens]
     7M1Q_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 
    [Homo sapiens]
     EAW73056.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform 
    CRA_a [Homo sapiens]
    Length=2273
    
     Score = 4711 bits (12219),  Expect = 0.0, Method: Compositional matrix adjust.
     Identities = 2273/2273 (100%), Positives = 2273/2273 (100%), Gaps = 0/2273 (0%)
    
    >>>ENST00000460514_1
    >CAH10486.1 hypothetical protein [Homo sapiens]
    Length=1065
    
     Score = 301 bits (772),  Expect = 2e-96, Method: Compositional matrix adjust.
     Identities = 146/146 (100%), Positives = 146/146 (100%), Gaps = 0/146 (0%)
    
    >CAA75729.1 ABCR [Homo sapiens]
    Length=2273
    
     Score = 300 bits (769),  Expect = 2e-94, Method: Compositional matrix adjust.
     Identities = 146/146 (100%), Positives = 146/146 (100%), Gaps = 0/146 (0%)

And the desired output is:

Transcript        Protein                                                                 Length  Score             Identity  Percent
1 ENST00000370225_4 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4   2317 4711 bits (12220) 2273/2273       100%
2 ENST00000370225_4 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4       2273 4711 bits (12219) 2273/2273       100%
3 ENST00000460514_1 CAH10486.1 hypothetical protein                                           1065 301 bits (772)    146/146         100%
4 ENST00000460514_1 CAA75729.1 ABCR                                                           2273 300 bits (769)    146/146         100%

Each desired column is delimited in the original file by keywords like length, identities, scores, ">" and ">>>" I tried the following script, but I'm missing to add of the transcript (the first column), which is the first column delimited by ">>>".

my_txt <- readLines(con = "gene_filt_perc.txt")

transcript <-my_txt[grepl("^\\s*>>>", my_txt)]
lengths <-my_txt[grepl("^\\s*Length", my_txt)]
lengths <- gsub("Length=", "", lengths)
scores <-my_txt[grepl("^\\s*Score", my_txt)]
scores <- gsub(" Score = ", "", scores)
scores <- gsub("\\,  Expect = ..*", "", scores)
identities <-my_txt[grepl("^\\s*Identities", my_txt)]
identities <- gsub(" Identities = ", "", identities)
identities <- gsub("\\, Positives = ..*", "", identities)
protein <-my_txt[grepl("^\\s*>[[:alnum:]]", my_txt)]
   
result <-data.frame("protein"=protein, "identities"=identities, "scores"=scores, "lengths"=lengths)
result
                                                                          protein       identities            scores lengths
1        7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4  2273/2273 (100%) 4711 bits (12220)    2317
2        7LKZ_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4   2271/2273 (99%) 4711 bits (12220)    2273
3      NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo  2273/2273 (100%) 4711 bits (12219)    2273
4                                 BAE06122.2 ABCA4 variant protein [Homo sapiens]  2272/2273 (99%) 4710 bits (12218)    2273
5        7E7Q_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4   2271/2273 (99%) 4709 bits (12214)    2317

Is there an easier way to construct the data.frame?


Solution

  • Doing gsub on all data will destroy the hierarchical nesting. Keep your data together. Start with one big string stored in data.txt and split them to get one element per transcript. Fields can be extracted using relative expressions. Lists can be converted into multiple rows in a table using unnest. This automatically repeats transcript IDs for each corresponding protein.

    library(tidyverse)
    
    read_file("data.txt") %>%
      str_split(">>>") %>%
      simplify() %>%
      discard(~ .x == "") %>%
      # parse all transcripts
      map(function(Transcript) {
        list(
          Transcript = Transcript %>% str_extract("ENST.*"),
          Protein = {
            Transcript %>%
            str_split(">") %>%
            simplify() %>%
            discard(~ .x %>% str_detect("^ENST"))
          }
        )
      }) %>%
      enframe() %>%
      select(value) %>%
      unnest_wider(value) %>%
      unnest(Protein) %>%
      mutate(
        # parse all proteins
        Protein = Protein %>% map(function(Protein) {
          list(
            # first line
            Protein = Protein %>% str_split("\n") %>% simplify() %>% first() %>% str_trim(),
            # numbers after pattern 'Length='
            Length = Protein %>% str_extract("(?<=Length=)[0-9 ]+") %>% as.numeric(),
            # numbers after pattern 'Score = '
            Score = Protein %>% str_extract("(?<=Score = )[0-9 ]+") %>% as.numeric()
          )
        })
      ) %>%
      unnest_wider(Protein)
    
    #>  Transcript        Protein                                                                   Length Score
    #>  <chr>             <chr>                                                                      <dbl> <dbl>
    #>1 ENST00000370225_4 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4     2317  4711
    #>2 ENST00000370225_4 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo   2273  4711
    #>3 ENST00000460514_1 CAH10486.1 hypothetical protein [Homo sapiens]                              1065   301
    #>4 ENST00000460514_1 CAA75729.1 ABCR [Homo sapiens]                                              2273   300