What sort of algorithm should be used to rearrange the FASTA sequences into length order (shortest first)? It needs to sort the sequences into length order, but with all the information displayed, not just the lengths.
I can sort the 'length' of the sequences using Bio::FastaFormat#length
, put lengths into an array, then sort:
require 'rubygems'
require 'bio'
file = Bio::FastaFormat.open(ARGV.shift)
seqarray = []
file.each do |seq|
a = seq.length
seqarray.push a
end
puts seqarray.sort
This displays the sequence lengths in order, but what I need to be able to see is the original FASTA format, in length order.
I can't add the seq.length
(length of each sequence) to the seq.entry
(entire fasta format) then sort, because seq.length
is an integer and seq.entry
gives strings. I tried converting seq.length.to_s
, adding this to seq.entry
, then sorting. This is the closest I've got, unfortunately the lengths are in a string so they order 1,11,111
instead of 1,2,3
etc.:
require 'rubygems'
require 'bio'
file = Bio::FastaFormat.open(ARGV.shift)
seqarray = []
file.each do |seq|
a = (seq.length).to_s + ' = length' + seq.entry
seqarray.push a
end
puts seqarray.sort
After doing this I tried the above using the sequence_id
instead of the entire entry, and not converting the length to strings, but the id
has letters in it, so I can't add to the length integers without getting an error message.
So yeah, any suggestions?
I think you can use "how to sort a ruby array of strings by length".
Map the array into a new one using the lambda described in the link.
Like this:
require 'rubygems'
require 'bio'
file = Bio::FastaFormat.open(ARGV.shift)
seqarray = []
file.each do |seq|
seqarray.push seq
end
puts seqarray.sort_by {|x| x.length}