Search code examples
python-3.xsedasciifastq

Using sed transliterate command in python


So there is this sed command that allows you to transform the quality code in ASCII into bar symbols:

sed -e 'n;n;n;y/!"#$%&'\''()*+,-.\/0123456789:;<=>?@ABCDEFGHIJKL/▁▁▁▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇██████/' myfile.fastq

I have been checking ways to do the same in python, but I have not found a solution I can use. Maybe pysed or re.sub, but I do not even know how to write the ASCII code in a string without python getting mixed up the characters.


Solution

  • So, you want to transliterate characters in the 3rd line of your FASTQ file?

    You can use str.translate on translation table built with str.maketrans:

    #!/usr/bin/env python3
    lut = str.maketrans('''!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKL''',
                        '''▁▁▁▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇██████''')
    
    with open('/path/to/fastq') as f:
        line3 = f.readlines()[3].strip()
    
    print(line3.translate(lut))
    

    For a sample file from Wikipedia:

    @SEQ_ID
    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
    +
    !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
    

    the Python script above will produce:

    ▁▁▁▂▁▁▁▁▂▂▂▂▂▂▁▁▁▂▂▂▁▁▁▁▁▂▃▃▂▂▂▂▂▂▁▁▂▂▂▂▄▄▇▇▇▆▆▆▆▆▆▇▇▇▇▇▇▇▄▄
    

    However, note that according to FASTQ format description on Wikipedia, your translation table is incorrect. The character ! represents the lowest quality while ~ is the highest (not L as you have).

    Also note that quality value characters directly map the ASCII character range !-~ to the quality value. In other words, we can build the translation table programmatically:

    span = ord('█') - ord('▁') + 1
    src = ''.join(chr(c) for c in range(ord('!'), ord('~')+1))
    dst = ''.join(chr(ord('▁') + span*(ord(c)-ord('!'))//len(src)) for c in src)
    lut = str.maketrans(src, dst)