Search code examples
pythonpython-3.xreadfile

Reading File with prefix that spans multiple lines


Hello I would like to clean a text file that holds a transcript.

I have copy and pasted a small section:

*CHI:   and when he went to sleep one night , somehow the frog escaped from
    the jar while he was sleeping .
%mor:   coord|and conj|when pro:sub|he v|go&PAST prep|to n|sleep
    pro:indef|one n|night cm|cm adv|somehow det:art|the n|frog
    v|escape-PAST prep|from det:art|the n|jar conj|while pro:sub|he
    aux|be&PAST&13S part|sleep-PRESP .
%gra:   1|4|LINK 2|4|LINK 3|4|SUBJ 4|0|ROOT 5|4|JCT 6|5|POBJ 7|13|LINK
    8|13|SUBJ 9|8|LP 10|13|JCT 11|12|DET 12|13|SUBJ 13|6|CMOD 14|13|JCT 15|16|DET
    16|14|POBJ 17|20|LINK 18|20|SUBJ 19|20|AUX 20|13|CJCT 21|4|PUNCT
*INV:   0 [=! gasps] .
*CHI:   when the boy woke up he noticed that the frog had disappeared .
%mor:   conj|when det:art|the n|boy v|wake&PAST adv|up pro:sub|he
    v|notice-PAST pro:rel|that det:art|the n|frog aux|have&PAST
    dis#part|appear-PASTP .

essentially i would like to only read with the prefix *CHI: but read all the lines that they have said this is my code so far

def read_file(name):
    file = open(name,"r",encoding = "UTF-8")

    content = file.readlines()

    file.close()

    return content


def extract_file(text):
    clean = []
    for line in text:
        if line.startswith("*CHI:"):
            line = line.replace('\t','')
            clean.append(line)
    return clean

but this only reads the the line with the prefix but not until the end. it stops after \n

so when i run this i would get

and when he went to sleep one night , somehow the frog escaped from\n instead of

and when he went to sleep one night , somehow the frog escaped from the jar while he was sleeping .


Solution

  • You are trying to process a multi-line format line-by-line. You can of course, by, say, setting an indicator in your if statement, and clear it when done:

    def extract_file(text):
      clean = []
      for line in text:
        if line.startswith("*CHI:"):
          append = True
        elif not line.startwith('\t'):
          append = False
        if append:
          line = line.replace('\t','')
          clean.append(line)
      return clean
    

    Another approach would read the whole file in a variable data (or alternatively, you could use mmap), then just extract the data of interest with a regex:

    def extract_file(name):
      with open(name,"r",encoding = "UTF-8") as file:
        data = file.read()
      r = re.search("^(\*CHI:.*?)^[^\t]", data, re.M | re.S)
      return r.groups(1)[0].replace('\t','').split('\n')