I'm trying to parse as a 0,1
table a MedLine file to perform some statistical downstream analysis: PCA, GWAS, etc. I formatted it using a Python module called Bio.Medline with some additional shell commands. Now, I don't know how to continue.
I need to transform File 1
, - a key-value file with one paper per line and tab-separated keywords - into a file with collapsed keywords and presence/absence of keywords shown as 1 or 0 values.
I would like to do this with Perl but other solutions are welcome.
Thanks, Bernardo
File 1
:
19801464 Animals Biodiversity Computational Biology/methods DNA
19696045 Environmental Microbiology Computational Biology/methods Software
Desired output:
Animals Biodiversity Computational Biology/methods DNA Environmental Microbiology Software
19801464 1 1 1 0 0
19696045 0 1 0 1 1
This perl
script will build a hash that you should be able to work with. For convenience I used List::MoreUtils
for uniq
and Data::Printer
for dumping the data structure:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw(uniq);
use DDP;
my %paper ;
my @categories;
while (<DATA>){
chomp;
my @record = split /\t/ ;
$paper{$record[0]} = { map { $_ => 1 } @record[1..$#record] } ;
push @categories , @record[1..$#record] ;
}
@categories = uniq @categories;
foreach (keys %paper) {
foreach my $category(@categories) {
$paper{$_}{$category} //= 0 ;
}
};
p %paper ;
__DATA__
19801464 Animals Biodiversity Computational Biology/methods DNA
19696045 Environmental Microbiology Computational Biology/methods Software
Output
{
19696045 {
'Animals Biodiversity' 0,
'Computational Biology/methods' 1,
DNA 0,
'Environmental Microbiology' 1,
Software 1
},
19801464 {
'Animals Biodiversity' 1,
'Computational Biology/methods' 1,
DNA 1,
'Environmental Microbiology' 0,
Software 0
}
}
From there to producing the output you want may require printf
to format the lines properly. The following might be enough for your purposes:
print "\t", (join " ", @categories);
for (keys %paper) {
print "\n", $_, "\t\t" ;
for my $category(@categories) {
print $paper{$_}{$category}," "x17 ;
}
}
Edit
A few alternatives for formatting your output ... (we use x
to multiply the format sections by the length, or number of elements, in the @categories
array so they match):
Using format
my $format_line = 'format STDOUT =' ."\n"
. '@# 'x ~~@categories . "\n"
. 'values %{ $paper{$num} }' . "\n"
. '.'."\n";
for $num (keys %paper) {
print $num ;
no warnings 'redefine';
eval $format_line;
write;
}
Using printf
:
print (" "x9, join " ", @categories, "\n");
for $num (keys %paper) {
print $num ;
map{ printf "%19d", $_ } values %{ $paper{$num} } ;
print "\n";
}
Using form
:
use Perl6::Form;
for $num (keys %paper) {
print form
"{<<<<<<<<}" . "{>}" x ~~@categories ,
$num , values %{ $paper{$num} }
}
Depending on what you plan on doing with the data, you may be able to do the rest your of analysis in perl, so perhaps precise formatting for printing might not be a priority until a later stage in your workflow. See BioPerl for ideas.