My text file data looks like this:(protein-protein interaction data)
transcription_factor protein
Myc Rilpl1
Mycn Rilpl1
Mycn "Wdhd1,Socs4"
Sox2 Rilpl1
Sox2 "Wdhd1,Socs4"
Nanog "Wdhd1,Socs4"
I want it to look like this:( To see each protein has how many transcription_factor interact with)
protein transcription_factor
Rilpl1 Myc, Mycn, Sox2
Wdhd1 Mycn, Sox2, Nanog
Socs4 Mycn, Sox2, Nanog
After using my code, what I got is this:(how can I get rid off the "" and separate the two protein to new line)
protein transcription_factor
Rilpl1 Myc, Mycn, Sox2
"Wdhd1,Socs4" Mycn, Nanog, Sox2
Here is my code:
input_file = ARGV[0]
hash = {}
File.readlines(input_file, "\r").each do |line|
transcription_factor, protein = line.chomp.split("\t")
if hash.has_key? protein
hash[protein] << transcription_factor
else
hash[protein] = [transcription_factor]
end
end
hash.each do |key, value|
if value.count > 2
string = value.join(', ')
puts "#{key}\t#{string}"
end
end
Here is a quick way to fix your problem:
...
transcription_factor, proteins = line.chomp.split("\t")
proteins.to_s.gsub(/"/,'').split(',').each do |protein|
if hash.has_key? protein
hash[protein] << transcription_factor
else
hash[protein] = [transcription_factor]
end
end
...
The above snippet basically removes the quotes from the proteins if there are any and then for each protein found it does what you had already written.
Also if you would like to eliminate the if you can define the hash like this:
hash = Hash.new {|hash,key| hash[key]= []}
which means that for every new key
it will return a new array. So now you can skip the if
and write
hash[protein] << transcription_factor