Search code examples
pythonperlparsingbioinformaticsbioperl

Parse MEDLINE file for GWAS mining


I'm trying to parse as a 0,1 table a MedLine file to perform some statistical downstream analysis: PCA, GWAS, etc. I formatted it using a Python module called Bio.Medline with some additional shell commands. Now, I don't know how to continue.

I need to transform File 1, - a key-value file with one paper per line and tab-separated keywords - into a file with collapsed keywords and presence/absence of keywords shown as 1 or 0 values.

I would like to do this with Perl but other solutions are welcome.

Thanks, Bernardo

File 1:

19801464    Animals Biodiversity    Computational Biology/methods   DNA
19696045    Environmental Microbiology  Computational Biology/methods   Software

Desired output:

    Animals Biodiversity    Computational Biology/methods   DNA Environmental Microbiology  Software
19801464    1   1   1   0   0
19696045    0   1   0   1   1

Solution

  • This perl script will build a hash that you should be able to work with. For convenience I used List::MoreUtils for uniq and Data::Printer for dumping the data structure:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    use List::MoreUtils qw(uniq);
    use DDP;
    
    my %paper ;
    my @categories;
    
    while (<DATA>){
      chomp;
      my @record = split /\t/ ;
      $paper{$record[0]}  = { map { $_ => 1 } @record[1..$#record] } ;
      push @categories , @record[1..$#record] ;
    }
    
    @categories = uniq @categories; 
    
    foreach (keys %paper) {
      foreach my $category(@categories) {
        $paper{$_}{$category} //= 0 ;
      } 
    }; 
    
    p %paper ;
    
    __DATA__
    19801464   Animals Biodiversity  Computational Biology/methods  DNA     
    19696045   Environmental Microbiology   Computational Biology/methods Software
    

    Output

    {
        19696045   {
            'Animals Biodiversity'            0,
            'Computational Biology/methods'   1,
            DNA                               0,
            'Environmental Microbiology'      1,
            Software                          1
        },
        19801464   {
            'Animals Biodiversity'            1,
            'Computational Biology/methods'   1,
            DNA                               1,
            'Environmental Microbiology'      0,
            Software                          0
        }
    }
    

    From there to producing the output you want may require printf to format the lines properly. The following might be enough for your purposes:

    print "\t", (join "  ", @categories); 
    for (keys %paper) {
      print "\n", $_, "\t\t" ;
      for my $category(@categories) { 
        print $paper{$_}{$category}," "x17 ; 
      }  
    }
    

    Edit

    A few alternatives for formatting your output ... (we use x to multiply the format sections by the length, or number of elements, in the @categories array so they match):

    Using format

    my $format_line = 'format STDOUT =' ."\n"
                    . '@# 'x ~~@categories . "\n" 
                    . 'values %{ $paper{$num} }' . "\n"
                    . '.'."\n"; 
    for $num (keys %paper) {
      print $num ;
      no warnings 'redefine'; 
      eval $format_line;
    write;
    }
    

    Using printf:

    print (" "x9, join "  ", @categories, "\n"); 
    for $num (keys %paper) {
      print $num  ;
      map{ printf "%19d", $_ }  values %{ $paper{$num} } ;
      print "\n";   
    }
    

    Using form:

    use Perl6::Form;                                                              
    for $num (keys %paper) {                                                       
      print form                                                         
      "{<<<<<<<<}" . "{>}" x ~~@categories ,                                      
        $num       , values %{ $paper{$num} }                                      
    }
    

    Depending on what you plan on doing with the data, you may be able to do the rest your of analysis in perl, so perhaps precise formatting for printing might not be a priority until a later stage in your workflow. See BioPerl for ideas.