python pandas dataframe docx text-extraction

Python - Extract Informations from Docx-File into Pandas Df

I have a Word-Document with the contents of an interview and want to store every question and answer in a Pandas dataframe. The word-document looks like this:

So in the end I want a pandas dataframe like:

participant | Question_Number | question | answer | timestamps_difference
VP01        |       1         | SOME Q.. | SOME A.| 00:00:02

I firstly used the "textract"-Package to read in the docx-file. After reading the document in, all content is now stored in one string (but type of text is byte):

import textract
text = textract.process("Transkript VP01_test.docx")
text
text = text.decode("utf-8") #convert byte to string

b'Lehrstuhl f\xc3\xbcr Kinder- und Jugendpsychiatrie\n\n\t\tund -psychotherapie\n\n\t\n\n\n\nInterview-Transkription\n\nVP: 01\t\t\t\t\t\t Interview-Dauer: 00:05:55\t\n\n\n\nTranskriptionsregeln: Es wird w\xc3\xb6rtlich transkribiert, nicht lautsprachlich. Vorhandene Dialekte werden m\xc3\xb6glichst wortgenau ins Hochdeutsche \xc3\xbcbersetzt. Satzabbr\xc3\xbcche, Stottern und Wortdoppelungen werden ausgelassen. Die Interpunktion wird zugunsten der Lesbarkeit nachtr\xc3\xa4glich gesetzt.\n\nGespr\xc3\xa4chsteilnehmer: Interviewer (RB); Participant (VP01)\n\nInterview-Transkription:\n\n#00:00:00#: STARTING\n\nRB: SOME QUESTION HERE #00:00:14#\n\nVP01: SOME ANSWER HERE. #00:00:16#\n\nRB: SOME TEXT HERE #00:00:17#\n\nVP01: SOME ANSWER HERE. #00:00:40# \n\nRB: SOME QUESTION HERE #00:00:41#\n\nVP01: SOME ANSWER HERE

Solution

I managed to solve the problem. Here is the code:

import pandas as pd
import re
import textract
from datetime import datetime, timedelta

input_string = textract.process("Transkript VP01_test.docx")

# Extract questions and timestamps using regular expression
pattern = r'(?P<Participant>RB|VP01):\s(?P<Question>.*?)\s#(?P<Timestamp>\d{2}:\d{2}:\d{2})#'
matches = re.findall(pattern, input_string)

# Create a dictionary to store the extracted data
data_dict = {'Participant': [], 'Question': [], 'Timestamp': [], 'Answer': [], 'Time_Difference_in_Sec': []}

# Iterate through the matches and store the data in the dictionary
for i in range(len(matches)):
    participant = matches[i][0]
    question = matches[i][1]
    question_timestamp = datetime.strptime(matches[i][2], '%H:%M:%S')
    
    if i < len(matches) - 1:
        answer = matches[i+1][1]
        answer_timestamp = datetime.strptime(matches[i+1][2], '%H:%M:%S')
        time_difference = (answer_timestamp - question_timestamp).seconds
    else:
        answer = ''
        time_difference = ''
    
    data_dict['Participant'].append(participant)
    data_dict['Question'].append(question)
    data_dict['Timestamp'].append(matches[i][2])
    data_dict['Answer'].append(answer)
    data_dict['Time_Difference_in_Sec'].append(time_difference)

# Extract interview duration using regular expression
duration_pattern = r'Interview-Dauer:\s(\d{2}:\d{2}:\d{2})'
duration_match = re.search(duration_pattern, input_string)
if duration_match:
    interview_duration = duration_match.group(1)
else:
    interview_duration = ''

# Create a pandas DataFrame from the dictionary
df = pd.DataFrame(data_dict)

# Add interview duration as a column in the DataFrame
df['Interview Duration'] = interview_duration

# Display the DataFrame
df

  Participant            Question Timestamp              Answer  \
0          RB  SOME QUESTION HERE  00:00:14   SOME ANSWER HERE.   
1        VP01   SOME ANSWER HERE.  00:00:16      SOME TEXT HERE   
2          RB      SOME TEXT HERE  00:00:17   SOME ANSWER HERE.   
3        VP01   SOME ANSWER HERE.  00:00:40  SOME QUESTION HERE   
4          RB  SOME QUESTION HERE  00:00:41    SOME ANSWER HERE   

  Time_Difference_in_Sec Interview Duration  
0                      2           00:05:55  
1                      1           00:05:55  
2                     23           00:05:55  
3                      1           00:05:55  
4                      8           00:05:55