Search code examples
pythonregexdataframecsvpython-re

how to write regex pattern.finditer into a dataframe


I am trying to write a regular expression to select the text I want from a corpus, and then write the extracted text into a dataframe in CSV format.

Here is the code that I used:

import re

import pandas as pd

def main():

    pattern = re.compile(r'(case).(reason)(.+)(})')

    with open('/Users/cleantext.txt', 'r') as f:
        content = f.read()
        matches = pattern.finditer(content)
        for match in matches:
            print(tuple(match.groups()))


    # Create a DF for the expenses
    df = pd.DataFrame(data=[tuple(match.groups())])

    df.to_csv("judgement.csv", index=True)

if __name__ == '__main__':
     main()

However the CSV would only return one line of output:

,0,1,2,3
0,xxx,yyy,zzz,}

where I was expecting multiple lines since the corpus contained at least 100 judicial judgements.

the orginal corpus looks something like this:

{mID a9d50454f624         case xxx reason yyy judgement zzz}
{mID a9d5049e34e934bff9b  case xxx reason yyy judgement zzz}
{mID a67c9e34e934bff9b    case xxx reason yyy judgement zzz}

Thank you so much for your help.


Solution

  • You probably need to get the two substrings denoting case and reason from each match. You can use

    pattern = re.compile(r'\bcase\s*(?P<Case>.*?)\s*reason\s*(?P<Reason>.*?)\s*judgement')
    matches = [x.groupdict() for x in pattern.finditer(content)]
    df = pd.DataFrame(matches)
    

    Note the named capturing groups are used to automatically create a column name, the x.groupdict() returns a tuple containing the group name and its value. The [x.groupdict() for x in pattern.finditer(content)] returns a list dictionaries that can be used to populate the dataframe.

    You can also use

    matches = pattern.findall(content)
    df=pd.DataFrame(matches, columns=['Case', 'Reason'])
    

    See the regex demo. Details:

    • \bcase - a whole word case
    • \s* - zero or more whitespaces
    • (?P<Case>.*?) - Group "Case": zero or more chars other than line break chars, as few as possible
    • \s*reason\s* - reason word enclosed with optional whitespaces
    • (?P<Reason>.*?) - Group "Reason": zero or more chars other than line break chars, as few as possible
    • \s*judgement - zero or more whitespaces and then judgement string.