Search code examples
pythonpython-3.xoutputpython-re

How to show only the first 20 entries


The Python code here gets me the output I want. However, I need help with limiting the result to first 20 lines.

Input example is shown below,

gi|170079688|ref|YP_001729008.1| bifunctional riboflavin kinase/FMN adenylyltransferase [Escherichia coli str. K-12 substr. DH10B] MKLIRGIHNLSQAPQEGCVLTIGNFDGVHRGHRALLQGLQEEGRKRNLPVMVMLFEPQPLELFATDKAPA RLTRLREKLRYLAECGVDYVLCVRFDRRFAALTAQNFISDLLVKHLRVKFLAVGDDFRFGAGREGDFLLL QKAGMEYGFDITSTQTFCEGGVRISSTAVRQALADDNLALAESLLGHPFAISGRVVHGDELGRTIGFPTA NVPLRRQVSPVKGVYAVEVLGLGEKPLPGVANIGTRPTVAGIRQQLEVHLLDVAMDLYGRHIQVVLRKKI RNEQRFASLDELKAQIARDELTAREFFGLTKPA gi|170079689|ref|YP_001729009.1| isoleucyl-tRNA synthetase [Escherichia coli str. K-12 substr. DH10B] MSDYKSTLNLPETGFPMRGDLAKREPGMLARWTDDDLYGIIRAAKKGKKTFILHDGPPYANGSIHIGHSV NKILKDIIVKSKGLSGYDSPYVPGWDCHGLPIELKVEQEYGKPGEKFTAAEFRAKCREYAATQVDGQRKD FIRLGVLGDWSHPYLTMDFKTEANIIRALGKIIGNGHLHKGAKPVHWCVDCRSALAEAEVEYYDKTSPSI DVAFQAVDQDALKAKFAVSNVNGPISLVIWTTTPWTLPANRAISIAPDFDYALVQIDGQAVILAKDLVES VMQRIGVTDYTILGTVKGAELELLRFTHPFMGFDVPAILGDHVTLDAGTGAVHTAPGHGPDDYVIGQKYG LETANPVGPDGTYLPGTYPTLDGVNVFKANDIVVALLQEKGALLHVEKMQHSYPCCWRHKTPIIFRATPQ WFVSMDQKGLRAQSLKEIKGVQWIPDWGQARIESMVANRPDWCISRQRTWGVPMSLFVHKDTEELHPRTL ELMEEVAKRVEVDGIQAWWDLDAKEILGDEADQYVKVPDTLDVWFDSGSTHSSVVDVRPEFAGHAADMYL EGSDQHRGWFMSSLMISTAMKGKAPYRQVLTHGFTVDGQGRKMSKSIGNTVSPQDVMNKLGADILRLWVA STDYTGEMAVSDEILKRAADSYRRIRNTARFLLANLNGFDPAKDMVKPEEMVVLDRWAVGCAKAAQEDIL KAYEAYDFHEVVQRLMRFCSVEMGSFYLDIIKDRQYTAKADSVARRSCQTALYHIAEALVRWMAPILSFT ADEVWGYLPGEREKYVFTGEWYEGLFGLADSEAMNDAFWDELLKVRGEVNKVIEQARADKKVGGSLEAAV TLYAEPELSAKLTALGDELRFVLLTSGATVADYNDAPADAQQSEVLKGLKVALSKAEGEKCPRCWHYTQD VGKVAEHAEICGRCVSNVAGDGEKRKFA gi|170079690|ref|YP_001729010.1| lipoprotein signal peptidase [Escherichia coli str. K-12 substr. DH10B] MSQSICSTGLRWLWLVVVVLIIDLGSKYLILQNFALGDTVPLFPSLNLHYARNYGAAFSFLADSGGWQRW FFAGIAIGISVILAVMMYRSKATQKLNNIAYALIIGGALGNLFDRLWHGFVVDMIDFYVGDWHFATFNLA DTAICVGAALIVLEGFLPSRAKKQ

import re

id = None
header = None
seq = ''

a_file = open('e_coli.faa')

for line in a_file:
    m = re.match(">(\S+)\s+(.+)", line.rstrip())
    if m:
        if id is not None:

            print("{0} length:{1} {2}".format(id, len(seq),header))

        id, header = m.groups()
        seq = ''
    else:
        seq += line.rstrip()

Solution

  • In the very top, add c = 0. Then, change

            print("{0} length:{1} {2}".format(id, len(seq),header))
    

    to

            if c < 10:
                print("{0} length:{1} {2}".format(id, len(seq),header))
                c += 1
    

    Result with a few adjustments:

    import re
    
    id = None
    header = None
    seq = ''
    
    with open('e_coli.faa') as a_file:
        for line in a_file:
            m = re.match(">(\S+)\s+(.+)", line.rstrip())
            if m:
                if id and c < 20:
                    print("{0} length:{1} {2}".format(id, len(seq),header))
                    c += 1
    
                id, header = m.groups()
                seq = ''
            else:
                seq += line.rstrip()
    

    To read the first 20 lines of the file. you can use readlines():

    Instead of:

    for line in a_file:
    

    use:

    for line in a_file.readlines()[:20]: