I have tab delim file file which contains which contains the following information
>fasta
>ss_23_122_0_1
MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS
>ss_23_167_0_1
WEIURIOWERWKLEJDSAJFASDGASZDTTQZWTEZQWTEZUQWEZQWTEZQTWEZTQW
>ss_23_167_0_1
MAASDASDWEPWERIWERIWER
>ss_23_167_0_1
QWEKCKLSDOIEOWIOWEUWWEUWEZURZEWURZUWEUZUQZUWZUE
>ss_45_201_0_1
HZTMKSKDIUWZUWEZTZWERWUEOIRUOEROOWEWERSDFSDFRRRETERTER
>ss_45_201_0_1
ZTTRASOIIDIFOSDIOFISDOFSDFQAWTZETQWE
>ss_89_10_0_2
NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP
For the ids like ss_45_201_0_1
and ss_23_167_0_1
there were multiple entries,I would like to retain only those entry which has maximum length of all. I would like to get output like following:
>fasta
>ss_23_122_0_1
MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS
>ss_23_167_0_1
WEIURIOWERWKLEJDSAJFASDGASZDTTQZWTEZQWTEZUQWEZQWTEZQTWEZTQW
>ss_45_201_0_1
HZTMKSKDIUWZUWEZTZWERWUEOIRUOEROOWEWERSDFSDFRRRETERTER
>ss_89_10_0_2
NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP
I tried the following code in R but it fails
Unique(fasta)
Can anyone guide me. How can I get only the longest sequence for those same ids which has multiple entries with different length.
Here are three options to consider.
Unlist the list, use nchar
on that, and use ave
to figure out the values to keep.
x <- nchar(unlist(l))
l[as.logical(ave(x, names(x), FUN = function(x) x == max(x)))]
# $ss_23_122_0_1
# [1] "MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS"
#
# $ss_23_167_0_1
# [1] "WEIURIOWERWKLEJDSAJFASDGASZDTTQZWTEZQWTEZUQWEZQWTEZQTWEZTQW"
#
# $ss_45_201_0_1
# [1] "HZTMKSKDIUWZUWEZTZWERWUEOIRUOEROOWEWERSDFSDFRRRETERTER"
#
# $ss_89_10_0_2
# [1] "NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP"
Use melt
from "reshape2" to create a data.frame
. Use rank
along with nchar
to subset. (I used rank instead of ==
so that I didn't have to use nchar
twice--haven't checked for comparative efficiency.)
library(data.table)
library(reshape2)
as.data.table(melt(l))[, Rnk := rank(nchar(as.character(value))),
by = L1][Rnk == 1]
# value L1 Rnk
# 1: MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS ss_23_122_0_1 1
# 2: MAASDASDWEPWERIWERIWER ss_23_167_0_1 1
# 3: ZTTRASOIIDIFOSDIOFISDOFSDFQAWTZETQWE ss_45_201_0_1 1
# 4: NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP ss_89_10_0_2 1
Similar approach to "data.table".
library(dplyr)
library(reshape2)
melt(l) %>%
group_by(L1) %>%
mutate(Rnk = dense_rank(nchar(as.character(value)))) %>%
filter(Rnk == 1)
# Source: local data frame [4 x 3]
# Groups: L1
#
# value L1 Rnk
# 1 MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS ss_23_122_0_1 1
# 2 MAASDASDWEPWERIWERIWER ss_23_167_0_1 1
# 3 ZTTRASOIIDIFOSDIOFISDOFSDFQAWTZETQWE ss_45_201_0_1 1
# 4 NJZTIWEIOIOIPIEPWIQPOEIQWIEPOQWIEPOQWIEPQIWEP ss_89_10_0_2 1