I have a file with the format turn_index \t sentence \t metadata
and looks like this, where the length of dialogues (i.e. turns) is variable:
0 hello metadata1
1 hi! metadata2
0 hi there metadata3
1 how are you? metadata4
2 very well meta5
3 I'm so busy today meta6
I would like to group two turns in a list, and group all same-dialogue lists in big list:
[["hello", "hi!"]]
[["hi there", "how are you?"], ["how are you?", "very well"]["very well", "I'm so busy today"]]
My attempt at windowing the sentences two at a time is not working, and I can't even begin figure out how to group per dialogue. My code is the following:
turns = data.readlines()
window_size = 2
i = 0
j = 0
dialogue = []
while i < len(turns) - window_size + 1:
restart = False
dialogue=[]
for turn in turns:
sec = turn.rstrip().split("\t")
double_sent = [sec[0], sec[1]]
i += 1
A solution to fit the edited output. Dialogues will hold all lists of lists you mentioned.
dialogues = []
double_sent = []
for line1, line2 in zip(turns[:-1], turns[1:]):
if int(line2.split('\t')[0])-int(line1.split('\t')[0]) == 1:
double_sent.append([line1.split('\t')[1], line2.split('\t')[1]])
else:
dialogues.append(double_sent)
double_sent = []
dialogues.append(double_sent.copy())
Here
zip(turns[:-1], turns[1:])
is is a neat expression to always select two subsequent elements of something. This is definitely something useful to remember.
The next line
if int(line2.split('\t')[0])-int(line1.split('\t')[0]) == 1
checks whether the turn numbering of the selected lines are following each other. This condition will fail only if you have a switch back to 0, which indicates that a dialogue is finished and can be appended to the dialogues list. If there is an error in the numbering this will give a wrong output.
# Output
>>> dialogues
>>> [[['hello', 'hi!']], [['hi there', 'how are you?'], ['how are you?', 'very well'], ['very well', "I'm so busy today"]]]