Search code examples
pythonhadoophadoop-streamingmrjob

ValueError: Can't specify both mapper_raw and mapper in Python


I am trying to read fna file with mrjob in Python.

This is my load_read.py program, all of the code can work correctly without using mrjob.

from mrjob.job import MRJob
from Bio import SeqIO
from Bio.Seq import Seq
import re
from operator import itemgetter 
import sys

def format_read(read):
    z = re.split('[|={,]+', read.description)
    return read.seq, z[3]

class LoadMetaRead(MRJob):

    def mapper_raw(self, file_path, file_uri):
        from Bio import SeqIO
        from Bio.Seq import Seq

        seqs = list(SeqIO.parse(file_path, type='fna'))

        is_paired_end = False
        if len(seqs) > 2 and seqs[0].id[-1:] != seqs[1].id[-1:]:
            is_paired_end = True

        label_list = dict()
        label_index = 0
        
        for i in range(0, len(seqs), 2 if is_paired_end else 1):
            read, label = format_read(seqs[i])
            if is_paired_end:
                read2, _ = format_read(seqs[i + 1])
                read += read2
            
            if label not in label_list:
                label_list[label] = label_index
                label_index += 1

            yield str(i), str(read), str(label_list[label])            

    def mapper(self, _, line):
        yield 'read', line

    def reducer(self, key, values):
        yield key, values

    combiner = reducer

if __name__ == '__main__':
    LoadMetaRead.run()

Example of the data file R4.fna:

>r1.1 |SOURCES={GI=15668172,fw,1146130-1146958}|ERRORS={52_1:A,78_1:G,78_2:G,78_3:G,641_1:G}|SOURCE_1="Methanocaldococcus jannaschii DSM 2661 chromosome" (392b1054a4bf536ea1cc349545ace50120973c3a)
AAACCCTCTTCCACGAACCCTCTTGAAAATCCCCCACATCCACAAAATAAATCAAATAAATTTCA
ACATTATCACCAAAAGGGTAAAAGGTTATTTAAAAAATAAAATAAATTTAAAAATTTAAATTAAA
TACCAAAAAAGCCAAATAACTTATTGTGATTCTTGAGCTTTCTTTAACTCTGCCTTCATATCTTG
ATAGACTTTAGTCCATTTTAATTTTCTTGGATTTCTTCCCATTCTGTAGCTTTTCTCACATTTGG
ATGAGCAGAAATATAATACAGTCCCATCTTTTTCTACGACCATTTTTCCTTTTCCTGGCTCAATT
TCATAACCACAAAAGCTGCATGTTCTCCATTCTGGCATAGCTATCCCCCTTTAATAGTGTTTCAG
TGATTTTAAAATAATTTAAGATTAAATTATTTATCTTCTTCTGTCTAATGGTCTTGCTTCTCTCT
CTGTTTCTCTTAACATAATAATGTCTCCAACTTTAACTGGACCTTTAACGTTTCTAACTAAAACT
CTTCCAGTATCTTTTCCACCTAAGATTTTACATCTAACTTGTATAATTCCTCCAGTAACCCCTGT
TCTACCAATGACTTCAATAACTTCAGCAGCTACTGCTTCCTTATAAACAAATTCATCTTCCGATC
CTCATCACCTAATATTAATGAAGGTTTAAAATTTATAAAAAAGTTAGTAGTAGTGTTTCATAATT
TATATAATAATAACTATATACTATTGATTGATGGTTAAATAGCGTTCTAATAATTTACTGCTTCA
AAACATTTACCTTTTCAATTAATACCTTTAACTCTTCAGCATCTCCTTCGTTG
>r2.1 |SOURCES={GI=15668172,bw,239211-239971}|ERRORS={113:-,217_1:C,281_1:G,627_1:G,717_1:T}|SOURCE_1="Methanocaldococcus jannaschii DSM 2661 chromosome" (392b1054a4bf536ea1cc349545ace50120973c3a)
TAGCATGTAAATCCCTTATTTCTTAATTTCTCCCAGAATTATTTCTATTGCTTTATCAACTGCCT
TGGCAACCTCTTCAGACAACCCTGGTTTTATGTCTGGCATTGTAAATTTTTACCTTGACAACCAA
TAACCACGACTTCTATGCCTTTATTATGTAAATCTTTGAGAAATGGGGCTAATGGAACGTTATGG
GCATCGAAAGAATATTTTTTAACTATTCGGTAATTCATCAACATCTATCTTTTTTATTGTTCCAG
GTTCTAAATCAAAATCAATGGCGATCAACAACAATAATCTTTTTTATATCTTCATCAACCAACGT
CATTAAATAGTATGCTCCACTTGCCCCAGCATCTATAACTTCAACGTTATCTGGCAAGTTCATTT
TTTCTAATTTGCTAACAACCTCACATCCAAAGCCATCATCTCCAAACAACAGATTTCCACAACCA
ACAATTAATATATCCTTCTTTTTCATTTTATCACTTATTTAGCATTTCTTTATATTTTTTAGCCT
CTTCTTTAGGATTTTGTGATTGATAGATTGCCCTTCCAACAATGACGTAATCATTCTCATCTAAA
ATATTTAAAATATCCTCAATCTTCCCTCCCTGAGCTCCGACTCCGTGGTGTTATTACTGGCAATT
CTGCAATTTCTTTAATTTCTTTAAGCCTTTCAGGCCTTGTTGATGGAGCAACTATAGCATCAACT
TTTAGTTTTTTTAGCCATCTCTGACAATTTATCTGCTATTGGCTGTAG

When I run the program with this command:

python load_read.py R4.fna

It raises this error:

ValueError: Can't specify both mapper_raw and mapper

Do you know how to fix this?


Solution

  • So I found that I cannot define both mapper_raw() and mapper. I only need to define one of them. I used mapper_raw() because I read a whole file, not line by line.

    class LoadMetaRead(MRJob):
    
        def mapper_raw(self, file_path, file_uri):
            from Bio import SeqIO
            from Bio.Seq import Seq
    
            seqs = list(SeqIO.parse(file_path, 'fasta'))
    
            is_paired_end = False
            if len(seqs) > 2 and seqs[0].id[-1:] != seqs[1].id[-1:]:
                is_paired_end = True
    
            label_list = dict()
            label_index = 0
            
            for i in range(0, len(seqs), 2 if is_paired_end else 1):
                read, label = format_read(seqs[i])
                if is_paired_end:
                    read2, _ = format_read(seqs[i + 1])
                    read += read2
                
                if label not in label_list:
                    label_list[label] = label_index
                    label_index += 1
    
                yield None, (str(read), str(label_list[label]))
    
        def reducer(self, key, values):
            for value in values:
                yield key, str(value)
    

    This code work as expected.