Search code examples
pythonpandasdataframenlppython-re

pandas: text analysis: Transfer raw data to dataframe


I need to read lines from a text file and extract the quoted person name and quoted text from each line.

lines look similar to this:

"Am I ever!", Homer Simpson responded.

Remarks:

Hint: Use the returned object from the 'open' method to get the file handler. Each line you read is expected to contain a new-line in the end of the line. Remove the new-line as following: line_cln =line.strip()

There are the options for each line (assume one of these three options): The first set of patterns, for which the person name appears before the quoted text. The second set of patterns, for which the quoted text appears before the person. Empty lines.

Complete the transfer_raw_text_to_dataframe function to return a dataframe with the extracted person name and text as explained above. The information is expected to be extracted from the lines of the given 'filename' file.

The returned dataframe should include two columns:

  • person_name - containing the extracted person name for each line.
  • extracted_text - containing the extracted quoted text for each line.

The returned values:

  • dataframe - The dataframe with the extracted information as described above.
  • Important Note: if a line does not contain any quotation pattern, no information should be saved in the corresponding row in the dataframe.

what I got so far: [edited]

def transfer_raw_text_to_dataframe(filename):

    data = open(filename)
    
    quote_pattern ='"(.*)"'
    name_pattern = "\w+\s\w+"
    
    df = open(filename, encoding='utf8')
    lines = df.readlines()
    df.close()
    dataframe = pd.DataFrame(columns=('person_name', 'extracted_text'))
    i = 0  

    for line in lines:
        quote = re.search(quote_pattern,line)
        extracted_quotation = quote.group(1)

        name = re.search(name_pattern,line)
        extracted_person_name = name.group(0)
        
        df2 = {'person_name': extracted_person_name, 'extracted_text': extracted_quotation}
        dataframe = dataframe.append(df2, ignore_index = True)

        dataframe.loc[i] = [person_name, extracted_text]
        i =i+1
            
    return dataframe

the dataframe is created with the correct shape, problem is, the person name in each row is: 'Oh man' and the quote is 'Oh man, that guy's tough to love.' (in all of them) which is weird because it's not even in the txt file...

can anyone help me fix this?

Edit: I need to extract from a simple txt file that contains these lines only:

"Am I ever!", Homer Simpson responded.
"Hmmm. So... is it okay if I go to the women's conference with Chloe?", Lisa Simpson answered.
"Really? Uh, sure.", Bart Simpson answered.
"Sounds great.", Bart Simpson replied.
Homer Simpson responded: "Danica Patrick in my thoughts!"
C. Montgomery Burns: "Trust me, he'll say it, or I'll bust him down to Thursday night vespers."
"Gimme that torch." Lisa Simpson said.
"No! No, I've got a lot more mothering left in me!", Marge Simpson said.
"Oh, Homie, I don't care if you're a billionaire. I love you just because you're..." Marge Simpson said.
"Damn you, e-Bay!" Homer Simpson answered.

Solution

  • possibly in such a way:

    import pandas as pd
    import re
    
    # do smth
    with open("12.txt", "r") as f:
        data = f.read()
        # print(data)
        
        # ########## findall text in quotes
        m = re.findall(r'\"(.+)\"', data)
        print("RESULT: \n", m)
        df = pd.DataFrame({'rep': m})
        print(df)
        
        # ##########  retrieve and replace text in quotes for nothing
        m = re.sub(r'\"(.+)\"', r'', data)
        
        # ##########  get First Name & Last Name from the rest text in each line
        regex = re.compile("([A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+)")
        mm = regex.findall(m)
        df1 = pd.DataFrame({'author': mm})
        print(df1)
    
        # ########## join 2 dataframes
        fin = pd.concat([df, df1], axis=1)
        print(fin)
    

    all print just for checking (get them away for cleaner code). Just "C. Montgomery Burns" is loosing his first letter...