I am trying to write a regular expression to select the text I want from a corpus, and then write the extracted text into a dataframe in CSV format.
Here is the code that I used:
import re
import pandas as pd
def main():
pattern = re.compile(r'(case).(reason)(.+)(})')
with open('/Users/cleantext.txt', 'r') as f:
content = f.read()
matches = pattern.finditer(content)
for match in matches:
print(tuple(match.groups()))
# Create a DF for the expenses
df = pd.DataFrame(data=[tuple(match.groups())])
df.to_csv("judgement.csv", index=True)
if __name__ == '__main__':
main()
However the CSV would only return one line of output:
,0,1,2,3
0,xxx,yyy,zzz,}
where I was expecting multiple lines since the corpus contained at least 100 judicial judgements.
the orginal corpus looks something like this:
{mID a9d50454f624 case xxx reason yyy judgement zzz}
{mID a9d5049e34e934bff9b case xxx reason yyy judgement zzz}
{mID a67c9e34e934bff9b case xxx reason yyy judgement zzz}
Thank you so much for your help.
You probably need to get the two substrings denoting case
and reason
from each match.
You can use
pattern = re.compile(r'\bcase\s*(?P<Case>.*?)\s*reason\s*(?P<Reason>.*?)\s*judgement')
matches = [x.groupdict() for x in pattern.finditer(content)]
df = pd.DataFrame(matches)
Note the named capturing groups are used to automatically create a column name, the x.groupdict()
returns a tuple containing the group name and its value.
The [x.groupdict() for x in pattern.finditer(content)]
returns a list dictionaries that can be used to populate the dataframe.
You can also use
matches = pattern.findall(content)
df=pd.DataFrame(matches, columns=['Case', 'Reason'])
See the regex demo. Details:
\bcase
- a whole word case
\s*
- zero or more whitespaces(?P<Case>.*?)
- Group "Case": zero or more chars other than line break chars, as few as possible\s*reason\s*
- reason
word enclosed with optional whitespaces(?P<Reason>.*?)
- Group "Reason": zero or more chars other than line break chars, as few as possible\s*judgement
- zero or more whitespaces and then judgement
string.