Search code examples
pythonpandasdataframedocxtext-extraction

Python - Extract Informations from Docx-File into Pandas Df


I have a Word-Document with the contents of an interview and want to store every question and answer in a Pandas dataframe. The word-document looks like this:

word doc

So in the end I want a pandas dataframe like:

participant | Question_Number | question | answer | timestamps_difference
VP01        |       1         | SOME Q.. | SOME A.| 00:00:02

I firstly used the "textract"-Package to read in the docx-file. After reading the document in, all content is now stored in one string (but type of text is byte):

import textract
text = textract.process("Transkript VP01_test.docx")
text
text = text.decode("utf-8") #convert byte to string

enter image description here

b'Lehrstuhl f\xc3\xbcr Kinder- und Jugendpsychiatrie\n\n\t\tund -psychotherapie\n\n\t\n\n\n\nInterview-Transkription\n\nVP: 01\t\t\t\t\t\t Interview-Dauer: 00:05:55\t\n\n\n\nTranskriptionsregeln: Es wird w\xc3\xb6rtlich transkribiert, nicht lautsprachlich. Vorhandene Dialekte werden m\xc3\xb6glichst wortgenau ins Hochdeutsche \xc3\xbcbersetzt. Satzabbr\xc3\xbcche, Stottern und Wortdoppelungen werden ausgelassen. Die Interpunktion wird zugunsten der Lesbarkeit nachtr\xc3\xa4glich gesetzt.\n\nGespr\xc3\xa4chsteilnehmer: Interviewer (RB); Participant (VP01)\n\nInterview-Transkription:\n\n#00:00:00#: STARTING\n\nRB: SOME QUESTION HERE #00:00:14#\n\nVP01: SOME ANSWER HERE. #00:00:16#\n\nRB: SOME TEXT HERE #00:00:17#\n\nVP01: SOME ANSWER HERE. #00:00:40# \n\nRB: SOME QUESTION HERE #00:00:41#\n\nVP01: SOME ANSWER HERE


Solution

  • I managed to solve the problem. Here is the code:

    import pandas as pd
    import re
    import textract
    from datetime import datetime, timedelta
    
    input_string = textract.process("Transkript VP01_test.docx")
    
    # Extract questions and timestamps using regular expression
    pattern = r'(?P<Participant>RB|VP01):\s(?P<Question>.*?)\s#(?P<Timestamp>\d{2}:\d{2}:\d{2})#'
    matches = re.findall(pattern, input_string)
    
    # Create a dictionary to store the extracted data
    data_dict = {'Participant': [], 'Question': [], 'Timestamp': [], 'Answer': [], 'Time_Difference_in_Sec': []}
    
    # Iterate through the matches and store the data in the dictionary
    for i in range(len(matches)):
        participant = matches[i][0]
        question = matches[i][1]
        question_timestamp = datetime.strptime(matches[i][2], '%H:%M:%S')
        
        if i < len(matches) - 1:
            answer = matches[i+1][1]
            answer_timestamp = datetime.strptime(matches[i+1][2], '%H:%M:%S')
            time_difference = (answer_timestamp - question_timestamp).seconds
        else:
            answer = ''
            time_difference = ''
        
        data_dict['Participant'].append(participant)
        data_dict['Question'].append(question)
        data_dict['Timestamp'].append(matches[i][2])
        data_dict['Answer'].append(answer)
        data_dict['Time_Difference_in_Sec'].append(time_difference)
    
    # Extract interview duration using regular expression
    duration_pattern = r'Interview-Dauer:\s(\d{2}:\d{2}:\d{2})'
    duration_match = re.search(duration_pattern, input_string)
    if duration_match:
        interview_duration = duration_match.group(1)
    else:
        interview_duration = ''
    
    # Create a pandas DataFrame from the dictionary
    df = pd.DataFrame(data_dict)
    
    # Add interview duration as a column in the DataFrame
    df['Interview Duration'] = interview_duration
    
    # Display the DataFrame
    df
    
      Participant            Question Timestamp              Answer  \
    0          RB  SOME QUESTION HERE  00:00:14   SOME ANSWER HERE.   
    1        VP01   SOME ANSWER HERE.  00:00:16      SOME TEXT HERE   
    2          RB      SOME TEXT HERE  00:00:17   SOME ANSWER HERE.   
    3        VP01   SOME ANSWER HERE.  00:00:40  SOME QUESTION HERE   
    4          RB  SOME QUESTION HERE  00:00:41    SOME ANSWER HERE   
    
      Time_Difference_in_Sec Interview Duration  
    0                      2           00:05:55  
    1                      1           00:05:55  
    2                     23           00:05:55  
    3                      1           00:05:55  
    4                      8           00:05:55