I have a Word-Document with the contents of an interview and want to store every question and answer in a Pandas dataframe. The word-document looks like this:
So in the end I want a pandas dataframe like:
participant | Question_Number | question | answer | timestamps_difference
VP01 | 1 | SOME Q.. | SOME A.| 00:00:02
I firstly used the "textract"-Package to read in the docx-file. After reading the document in, all content is now stored in one string (but type of text is byte):
import textract
text = textract.process("Transkript VP01_test.docx")
text
text = text.decode("utf-8") #convert byte to string
b'Lehrstuhl f\xc3\xbcr Kinder- und Jugendpsychiatrie\n\n\t\tund -psychotherapie\n\n\t\n\n\n\nInterview-Transkription\n\nVP: 01\t\t\t\t\t\t Interview-Dauer: 00:05:55\t\n\n\n\nTranskriptionsregeln: Es wird w\xc3\xb6rtlich transkribiert, nicht lautsprachlich. Vorhandene Dialekte werden m\xc3\xb6glichst wortgenau ins Hochdeutsche \xc3\xbcbersetzt. Satzabbr\xc3\xbcche, Stottern und Wortdoppelungen werden ausgelassen. Die Interpunktion wird zugunsten der Lesbarkeit nachtr\xc3\xa4glich gesetzt.\n\nGespr\xc3\xa4chsteilnehmer: Interviewer (RB); Participant (VP01)\n\nInterview-Transkription:\n\n#00:00:00#: STARTING\n\nRB: SOME QUESTION HERE #00:00:14#\n\nVP01: SOME ANSWER HERE. #00:00:16#\n\nRB: SOME TEXT HERE #00:00:17#\n\nVP01: SOME ANSWER HERE. #00:00:40# \n\nRB: SOME QUESTION HERE #00:00:41#\n\nVP01: SOME ANSWER HERE
I managed to solve the problem. Here is the code:
import pandas as pd
import re
import textract
from datetime import datetime, timedelta
input_string = textract.process("Transkript VP01_test.docx")
# Extract questions and timestamps using regular expression
pattern = r'(?P<Participant>RB|VP01):\s(?P<Question>.*?)\s#(?P<Timestamp>\d{2}:\d{2}:\d{2})#'
matches = re.findall(pattern, input_string)
# Create a dictionary to store the extracted data
data_dict = {'Participant': [], 'Question': [], 'Timestamp': [], 'Answer': [], 'Time_Difference_in_Sec': []}
# Iterate through the matches and store the data in the dictionary
for i in range(len(matches)):
participant = matches[i][0]
question = matches[i][1]
question_timestamp = datetime.strptime(matches[i][2], '%H:%M:%S')
if i < len(matches) - 1:
answer = matches[i+1][1]
answer_timestamp = datetime.strptime(matches[i+1][2], '%H:%M:%S')
time_difference = (answer_timestamp - question_timestamp).seconds
else:
answer = ''
time_difference = ''
data_dict['Participant'].append(participant)
data_dict['Question'].append(question)
data_dict['Timestamp'].append(matches[i][2])
data_dict['Answer'].append(answer)
data_dict['Time_Difference_in_Sec'].append(time_difference)
# Extract interview duration using regular expression
duration_pattern = r'Interview-Dauer:\s(\d{2}:\d{2}:\d{2})'
duration_match = re.search(duration_pattern, input_string)
if duration_match:
interview_duration = duration_match.group(1)
else:
interview_duration = ''
# Create a pandas DataFrame from the dictionary
df = pd.DataFrame(data_dict)
# Add interview duration as a column in the DataFrame
df['Interview Duration'] = interview_duration
# Display the DataFrame
df
Participant Question Timestamp Answer \
0 RB SOME QUESTION HERE 00:00:14 SOME ANSWER HERE.
1 VP01 SOME ANSWER HERE. 00:00:16 SOME TEXT HERE
2 RB SOME TEXT HERE 00:00:17 SOME ANSWER HERE.
3 VP01 SOME ANSWER HERE. 00:00:40 SOME QUESTION HERE
4 RB SOME QUESTION HERE 00:00:41 SOME ANSWER HERE
Time_Difference_in_Sec Interview Duration
0 2 00:05:55
1 1 00:05:55
2 23 00:05:55
3 1 00:05:55
4 8 00:05:55