Search code examples
pythontext-parsing

How can I parse this text into a table in Python?


I have this data called text.txt. I also have my code below. I want to extract line values and want to make a table out of it. I also wanted to see if there is a better way to do it. Thanks

test.txt

Counting********************File:  bbduk_trimmed_Ago2_SsHV2L_1_CATGGC_L003_R1_001
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: 
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: 
73764
Counting********************File:  bbduk_trimmed_Ago2_SsHV2L_2_CATTTT_L003_R1_001
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: 
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: 
78640
Counting********************File:  bbduk_trimmed_Ago2_VF_1_CAACTA_L003_R1_001.fastq
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: 
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: 
26267

result I want:

  File Name                                 Seq_132582_1  Seq_483974_49238
0  bbduk_trimmed_Ago2_SsHV2L_1_CATGGC_L003_R1_001     0      73764
1  bbduk_trimmed_Ago2_SsHV2L_2_CATTTT_L003_R1_001     0      78640
2  bbduk_trimmed_Ago2_VF_1_CAACTA_L003_R1_001.fastq   0      26267

code I tried:

import sys

if sys.version_info[0] < 3:
    raise Exception("Python 3 or a more recent version is required.")
import re
import pandas as pd
text = open("text.txt",'r').read()
print(type(text))
results = re.findall(r'(bbduk_trimmed.*.fastq)\nSeq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: \n(\d)\nSeq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: \n(\d*)',text)
df=pd.DataFrame(results)
# df.columns=['FileName','Seq_132582_1','Seq_483974_49238'] #This doesn't work
print(df)

Solution

  • Just replace your regex with below code line:

    re.findall(r'Counting[*]+File:[ ]*([\w.]+)[ \n]*[ :\w]+[\n]*(\w+)[\n]*[ :\w]+[\n]*(\w+)', text)
    

    Explanation:

    • [*]+ - match one or more * character
    • [ ]* - match one or more (space) character
    • ([\w.]+) - match filename and compute as first paranthasis
    • [ \n]* - match zero or more space or newline character
    • [ :\w]+ - match your whole line which is starting with Seq

    The core logic to get sequence in the regex is as below:

    ([\w.]+)[ \n]*[ \w]+:[ :\w]+[\n]*(\w+)

    • after matching filename with ([\w.]+) first, we match the space(s) and new lines(s) using [ \n]*,
    • after that if you want to parse name of sequence you are parsing you might need to keep [ \w]+:[ :\w]+ separately and use it as ([ \w])+:[ :\w]+ where paranthisis can match you can extract sequence which can be Seq_132582_1 or Seq_483974_49238, however if order is not to be considered then you may simply replace it with [ :\w]+[\n]* and match the whole line and match the data you require on next line with (\w+)

    Another easier way is to extract data is shown below to prepare result without using re module:

    results = []
    f = open("content.txt", 'r')
    
    while True:
        line = f.readline()
        if not line:
            break
        file_name = line.split(":")[-1].strip()
        f.readline()  # skip line 
        data_seq1 = f.readline().strip()
        f.readline()  # skip line 
        data_seq2 = f.readline().strip()
        results.append((file_name, data_seq1, data_seq2))