Search code examples
rsequencefastalongest-substring

Extracting the longest sequence from the tab delim file


I have tab delim file file which contains which contains the following information

>fasta 
    >ss_23_122_0_1
    MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS
    >ss_23_167_0_1
    WEIURIOWERWKLEJDSAJFASDGASZDTTQZWTEZQWTEZUQWEZQWTEZQTWEZTQW
    >ss_23_167_0_1
    MAASDASDWEPWERIWERIWER
    >ss_23_167_0_1
    QWEKCKLSDOIEOWIOWEUWWEUWEZURZEWURZUWEUZUQZUWZUE
    >ss_45_201_0_1
    HZTMKSKDIUWZUWEZTZWERWUEOIRUOEROOWEWERSDFSDFRRRETERTER
    >ss_45_201_0_1
    ZTTRASOIIDIFOSDIOFISDOFSDFQAWTZETQWE
    >ss_89_10_0_2
    NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP

For the ids like ss_45_201_0_1 and ss_23_167_0_1 there were multiple entries,I would like to retain only those entry which has maximum length of all. I would like to get output like following:

>fasta
    >ss_23_122_0_1
    MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS
    >ss_23_167_0_1
    WEIURIOWERWKLEJDSAJFASDGASZDTTQZWTEZQWTEZUQWEZQWTEZQTWEZTQW
    >ss_45_201_0_1
    HZTMKSKDIUWZUWEZTZWERWUEOIRUOEROOWEWERSDFSDFRRRETERTER
    >ss_89_10_0_2
    NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP

I tried the following code in R but it fails

Unique(fasta)

Can anyone guide me. How can I get only the longest sequence for those same ids which has multiple entries with different length.


Solution

  • Here are three options to consider.

    Option 1: Base R

    Unlist the list, use nchar on that, and use ave to figure out the values to keep.

    x <- nchar(unlist(l))
    l[as.logical(ave(x, names(x), FUN = function(x) x == max(x)))]
    # $ss_23_122_0_1
    # [1] "MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS"
    # 
    # $ss_23_167_0_1
    # [1] "WEIURIOWERWKLEJDSAJFASDGASZDTTQZWTEZQWTEZUQWEZQWTEZQTWEZTQW"
    # 
    # $ss_45_201_0_1
    # [1] "HZTMKSKDIUWZUWEZTZWERWUEOIRUOEROOWEWERSDFSDFRRRETERTER"
    # 
    # $ss_89_10_0_2
    # [1] "NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP"
    

    Option 2: "data.table"

    Use melt from "reshape2" to create a data.frame. Use rank along with nchar to subset. (I used rank instead of == so that I didn't have to use nchar twice--haven't checked for comparative efficiency.)

    library(data.table)
    library(reshape2)
    as.data.table(melt(l))[, Rnk := rank(nchar(as.character(value))), 
                           by = L1][Rnk == 1]
    #                                                 value            L1 Rnk
    # 1: MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS ss_23_122_0_1   1
    # 2:                             MAASDASDWEPWERIWERIWER ss_23_167_0_1   1
    # 3:               ZTTRASOIIDIFOSDIOFISDOFSDFQAWTZETQWE ss_45_201_0_1   1
    # 4:      NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP  ss_89_10_0_2   1
    

    Option 3: "dplyr"

    Similar approach to "data.table".

    library(dplyr)
    library(reshape2)
    melt(l) %>%
      group_by(L1) %>%
      mutate(Rnk = dense_rank(nchar(as.character(value)))) %>%
      filter(Rnk == 1)
    # Source: local data frame [4 x 3]
    # Groups: L1
    # 
    #                                                value            L1 Rnk
    # 1 MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS ss_23_122_0_1   1
    # 2                             MAASDASDWEPWERIWERIWER ss_23_167_0_1   1
    # 3               ZTTRASOIIDIFOSDIOFISDOFSDFQAWTZETQWE ss_45_201_0_1   1
    # 4      NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP  ss_89_10_0_2   1