I have this data called text.txt. I also have my code below. I want to extract line values and want to make a table out of it. I also wanted to see if there is a better way to do it. Thanks
test.txt
Counting********************File: bbduk_trimmed_Ago2_SsHV2L_1_CATGGC_L003_R1_001
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT:
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC:
73764
Counting********************File: bbduk_trimmed_Ago2_SsHV2L_2_CATTTT_L003_R1_001
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT:
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC:
78640
Counting********************File: bbduk_trimmed_Ago2_VF_1_CAACTA_L003_R1_001.fastq
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT:
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC:
26267
result I want:
File Name Seq_132582_1 Seq_483974_49238
0 bbduk_trimmed_Ago2_SsHV2L_1_CATGGC_L003_R1_001 0 73764
1 bbduk_trimmed_Ago2_SsHV2L_2_CATTTT_L003_R1_001 0 78640
2 bbduk_trimmed_Ago2_VF_1_CAACTA_L003_R1_001.fastq 0 26267
code I tried:
import sys
if sys.version_info[0] < 3:
raise Exception("Python 3 or a more recent version is required.")
import re
import pandas as pd
text = open("text.txt",'r').read()
print(type(text))
results = re.findall(r'(bbduk_trimmed.*.fastq)\nSeq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: \n(\d)\nSeq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: \n(\d*)',text)
df=pd.DataFrame(results)
# df.columns=['FileName','Seq_132582_1','Seq_483974_49238'] #This doesn't work
print(df)
Just replace your regex with below code line:
re.findall(r'Counting[*]+File:[ ]*([\w.]+)[ \n]*[ :\w]+[\n]*(\w+)[\n]*[ :\w]+[\n]*(\w+)', text)
Explanation:
[*]+
- match one or more *
character[ ]*
- match one or more
(space) character([\w.]+)
- match filename and compute as first paranthasis[ \n]*
- match zero or more space or newline character[ :\w]+
- match your whole line which is starting with Seq
The core logic to get sequence in the regex is as below:
([\w.]+)[ \n]*[ \w]+:[ :\w]+[\n]*(\w+)
([\w.]+)
first, we match the space(s) and new lines(s) using [ \n]*
, [ \w]+:[ :\w]+
separately and use it as ([ \w])+:[ :\w]+
where paranthisis can match you can extract sequence which can be Seq_132582_1
or Seq_483974_49238
, however if order is not to be considered then you may simply replace it with [ :\w]+[\n]*
and match the whole line and match the data you require on next line with (\w+)
Another easier way is to extract data is shown below to prepare result without using re
module:
results = []
f = open("content.txt", 'r')
while True:
line = f.readline()
if not line:
break
file_name = line.split(":")[-1].strip()
f.readline() # skip line
data_seq1 = f.readline().strip()
f.readline() # skip line
data_seq2 = f.readline().strip()
results.append((file_name, data_seq1, data_seq2))