Search code examples
regexperljournal

Extract journal title from Genbank file using perl without using $1, $2 etc


This is a part of my input Genbank file:

LOCUS       AC_000005              34125 bp    DNA     linear   VRL 03-OCT-2005
DEFINITION  Human adenovirus type 12, complete genome.
ACCESSION   AC_000005 BK000405
VERSION     AC_000005.1  GI:56160436
KEYWORDS    .
SOURCE      Human adenovirus type 12
  ORGANISM  Human adenovirus type 12
            Viruses; dsDNA viruses, no RNA stage; Adenoviridae; Mastadenovirus.
REFERENCE   1  (bases 1 to 34125)
  AUTHORS   Davison,A.J., Benko,M. and Harrach,B.
  TITLE     Genetic content and evolution of adenoviruses
  JOURNAL   J. Gen. Virol. 84 (Pt 11), 2895-2908 (2003)
   PUBMED   14573794

And I want to extract the journal title for example J. Gen. Virol. (not including the issue number and pages)

This is my code and it doesn't give any result so I am wondering what goes wrong. I did use parentheses for $1, $2 etc... And though it worked my tutor told me to try without using that method, use substr instead.

foreach my $line (@lines) {
    if ( $line =~ m/JOURNAL/g ) {
        $journal_line = $line;
        $character = substr( $line, $index, 2 );
        if ( $character =~ m/\s\d/ ) {
            print substr( $line, 12, $index - 13 );
            print "\n";
        }
        $index++;
    }
}

Solution

  • Rather than matching and using substr, it is much easier to use a single regex to capture the whole JOURNAL line and use brackets to capture the text representing the journal information:

    foreach my $line (@lines) {
        if ($line =~ /JOURNAL\s+(.+)/) {
            print "Journal information: $1\n";
        }
    }
    

    The regular expression looks for JOURNAL followed by one or more whitespace characters, and (.+) captures the rest of the characters in the line.

    To get the text without using $1, I think you're trying to do something like this:

    if ($line =~ /JOURNAL/) {
        my $ix = length('JOURNAL');
        # variable containing the journal name
        my $j_name;
        # while the journal name is not defined...
        while (! $j_name) {
            # starting with $ix = the length of the word JOURNAL, get character $ix in the string
            if (substr($line, $ix, 1) =~ /\s/) {
                # if it is whitespace, increase $ix by one
                $ix++;
            }
            else {
                # if it isn't whitespace, we've found the text!!!!!
                $j_name = substr($line, $ix);
            }
        }
    

    If you already know how many characters there are in the left-hand column, you can just do substr($line, 12) (or whatever) to retrieve a substring of $line starting at character 12:

    foreach my $line (@lines) {
        if ($line =~ /JOURNAL/) {
            print "Journal information: " . substr($line, 12) . "\n";
        }
    }
    

    You can combine the two techniques to eliminate the issue number and dates from the journal data:

    if ($line =~ /JOURNAL/) {
        my $j_name;
        my $digit;
        my $indent = 12; # the width of the left-hand column
        my $ix = $indent; # we'll use this to track the characters in our loop
        while (! $digit) {
            # starting with $ix = the length of the indent,
            # get character $ix in the string
            if (substr($line, $ix, 1) =~ /\d/) {
                # if it is a digit, we've found the number of the journal
                # we can stop looping now. Whew!
                $digit = $ix;
                # set j_name
                # get a substring of $line starting at $indent going to $digit
                # (i.e. of length $digit - $indent)
                $j_name = substr($line, $indent, $digit-$indent);
            }
            $ix++;
        }
        print "Journal information: $j_name\n";
    }
    

    I think it would have been easier just to get the data from the Pubmed API! ;)