Search code examples
pythonfor-loopnlplinguistics

Python - How to loop through each index position in a list?


Given a list [[["source1"], ["target1"], ["alignment1"]], ["source2"], ["target2"], ["alignment2"]], ...] , I want to extract the words in the source that align with the words in the target. For example, in the English-German sentence pair The hat is on the table . - Der Hut liegt auf dem Tisch ., I want to print the following:

The - Der
hat - Hut
is - liegt
on - auf
the - dem
table - Tisch
. - . 

So I have written the following:

en_de = [
[['The', 'hat', 'is', 'on', 'the', 'table', '.'], ['Der', 'Hut', 'liegt', 'auf', 'dem', 'Tisch', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6'], 
[['The', 'picture', 'is', 'on', 'the', 'wall', '.'], ['Das', 'Bild', 'hängt', 'an', 'der', 'Wand', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6'], 
[['The', 'bottle', 'is', 'under', 'the', 'sink', '.'], ['Die', 'Flasche', 'ist', 'under', 'dem', 'Waschbecken', '.'], '0-0 1-1 2-2 3-3 4-4 5-5 6-6']
]

for group in en_de:
    src_sent = group[0]
    tgt_sent = group[1]
    aligns = group[2]

    split_aligns = aligns.split()

    hyphen_split = [align.split("-") for align in split_aligns]

    align_index = hyphen_split[0]

    print(src_sent[int(align_index[0])],"-", tgt_sent[int(align_index[1])])

This prints, as expected, the words in index position 0 of src_sent and tgt_sent:

The - Der
The - Das
The - Die

Now, I don't know how I can print the words of all index positions of src_sent and tgt_sent. Obviously, I could manually update align_index to a new index position for each position in the sentence pair, but on the full dataset, some sentences will have up to 25 index positions. Is there a way to possibly for-loop through each index position? When I try:

align_index = hyphen_split[0:]
print(src_sent[int(align_index[0])],"-", tgt_sent[int(align_index[1])])

I get a TypeError: int() argument must be a string, a bytes-like object or a number, not 'list' It's clear that align_index can't be a list, but I'm not sure how to convert it into something that will do what I want it to do. Any advice or help would be greatly appreciated. Thank you in advance.


Solution

  • You are forgetting to loop over your hyphen_split list:

    for group in en_de:
        src_sent = group[0]
        tgt_sent = group[1]
        aligns = group[2]
    
        split_aligns = aligns.split()
    
        hyphen_split = [align.split("-") for align in split_aligns]
    
        for align_index in hyphen_split:
            print(src_sent[int(align_index[0])],"-", tgt_sent[int(align_index[1])])
    

    See the last two lines, updated from your code.