Hello I would like to clean a text file that holds a transcript.
I have copy and pasted a small section:
*CHI: and when he went to sleep one night , somehow the frog escaped from
the jar while he was sleeping .
%mor: coord|and conj|when pro:sub|he v|go&PAST prep|to n|sleep
pro:indef|one n|night cm|cm adv|somehow det:art|the n|frog
v|escape-PAST prep|from det:art|the n|jar conj|while pro:sub|he
aux|be&PAST&13S part|sleep-PRESP .
%gra: 1|4|LINK 2|4|LINK 3|4|SUBJ 4|0|ROOT 5|4|JCT 6|5|POBJ 7|13|LINK
8|13|SUBJ 9|8|LP 10|13|JCT 11|12|DET 12|13|SUBJ 13|6|CMOD 14|13|JCT 15|16|DET
16|14|POBJ 17|20|LINK 18|20|SUBJ 19|20|AUX 20|13|CJCT 21|4|PUNCT
*INV: 0 [=! gasps] .
*CHI: when the boy woke up he noticed that the frog had disappeared .
%mor: conj|when det:art|the n|boy v|wake&PAST adv|up pro:sub|he
v|notice-PAST pro:rel|that det:art|the n|frog aux|have&PAST
dis#part|appear-PASTP .
essentially i would like to only read with the prefix *CHI: but read all the lines that they have said this is my code so far
def read_file(name):
file = open(name,"r",encoding = "UTF-8")
content = file.readlines()
file.close()
return content
def extract_file(text):
clean = []
for line in text:
if line.startswith("*CHI:"):
line = line.replace('\t','')
clean.append(line)
return clean
but this only reads the the line with the prefix but not until the end. it stops after \n
so when i run this i would get
and when he went to sleep one night , somehow the frog escaped from\n instead of
and when he went to sleep one night , somehow the frog escaped from the jar while he was sleeping .
You are trying to process a multi-line format line-by-line. You can of course, by, say, setting an indicator in your if statement, and clear it when done:
def extract_file(text):
clean = []
for line in text:
if line.startswith("*CHI:"):
append = True
elif not line.startwith('\t'):
append = False
if append:
line = line.replace('\t','')
clean.append(line)
return clean
Another approach would read the whole file in a variable data
(or alternatively, you could use mmap), then just extract the data of interest with a regex:
def extract_file(name):
with open(name,"r",encoding = "UTF-8") as file:
data = file.read()
r = re.search("^(\*CHI:.*?)^[^\t]", data, re.M | re.S)
return r.groups(1)[0].replace('\t','').split('\n')