Search code examples
pythondata-analysis

Filtering a text file for a given conditions and symbols


I have a multiple text file and upon reading multiple files in a loop it looks like:

@UTF8
@PID:   11312/c-00036109-1
@Begin
@Languages: eng
@Participants:  CHI Target_Child, EXA Investigator
@ID:    eng|ENNI|CHI|4;11.16|male|SLI||Target_Child|||
@ID:    eng|ENNI|EXA|||||Investigator|||
@Comment:   Birth of CHI is 9-MAY-1995
@Date:  25-APR-2000
@Tape Location: Disk L10 Track 3
@Bg:    A1
*CHI:   I saw a giraffe and a elephant .
%mor:   pro:sub|I v|see&PAST det:art|a n|giraffe coord|and det:art|a
    n|elephant .
%gra:   1|2|SUBJ 2|0|ROOT 3|4|DET 4|2|OBJ 5|4|CONJ 6|7|DET 7|5|COORD 8|2|PUNCT
*CHI:   <that> [/] (.) that (i)s it . [+ bch]
%mor:   pro:dem|that cop|be&3S pro:per|it .
%gra:   1|2|SUBJ 2|0|ROOT 3|2|PRED 4|2|PUNCT
*CHI:   I saw an elephant go swimming .
%mor:   pro:sub|I v|see&PAST det:art|a n|elephant v|go part|swim-PRESP .
%gra:   1|2|SUBJ 2|0|ROOT 3|4|DET 4|5|SUBJ 5|2|COMP 6|5|OBJ 7|2|PUNCT
*CHI:   <I saw eleph> [//] I saw the <g> [/] giraffe and the elephant <s>
    [//] drop ball in the pool .
%mor:   pro:sub|I v|see&PAST det:art|the n|giraffe coord|and det:art|the
    n|elephant n|drop n|ball prep|in det:art|the n|pool .
  • Suppose I have files like SLI-1.txt, SLI2.txt ... SLI-10.txt. The first task is to read all the files into one file and perform below actions on it.

  • From this data, I have to extract only statements which are prefixed or begins with *CHI:. (Note that there are some statements that extend to the next line, you should ensure that you take those into account.) Below is the list of symbols that should be filter off from each of the *CHI: statements extracted.

  • Remove those words that have either [ as prefix or ] as suffix but retain these three symbols: [//], [/], and [*]

  • Retain those words that have either < as prefix or > as suffix but these two symbols should be removed.

  • Remove those words that have prefixes of & and +

  • Retain those words that have either ( as prefix or ) as suffix but these two symbols should be removed.

Regular expressions can be used.


Solution

  • To perform the filtering you should use Regular Expression Language as stated in the Hint. You can do it in Python with re module. Of course you need to learn what RegEx is and how to work with it.

    For example, you may extract only statements which are ... with the expression ^(?P<start>\*CHI:)(?P<target>.*)(?P<end>(?P<end_type_1> \.$)|(?P<end_type_2>$\n%mor)) (flags are gmsU) where group target contains what you want to extract from files. You may try it online - https://regex101.com/r/tLdj7t/3/.