Search code examples
pythonstringtranslationtext-alignmentdefaultdict

How to extract matching strings into a defaultdict(set)? Python


I have a textfile that has such lines (see below), where an english sentence is followed by a spanish sentence and the equivalent translation table delimited by "{##}". (if you know it it's the output for giza-pp)

you have requested a debate on this subject in the course of the next few days , during this part-session . {##} sus señorías han solicitado un debate sobre el tema para los próximos días , en el curso de este período de sesiones . {##} 0-0 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 12-10 13-11 14-11 15-12 16-13 17-14 9-15 10-16 11-17 18-18 17-19 19-21 20-22

The translation table is understood as such, 0-0 0-1 means that the 0th word in english (i.e. you) matches the 0th and 1st word in spanish (i.e. sus señorías)

Let's say i want to know what is the translation of course in spanish from the sentence, normally i'll do it this way:

from collections import defaultdict
eng, spa, trans =  x.split(" {##} ")
tt = defaultdict(set)
for s,t in [i.split("-") for i in trans.split(" ")]:
  tt[s].add(t)

query = 'course'
for i in spa.split(" ")[tt[eng.index(query)]]:
  print i

is there a simple way to do the above? may regex? line.find()?

After some tries i have to do this in order to cover many other issues like MWE and missing translations:

def getTranslation(gizaline,query):
    src, trg, trans =  gizaline.split(" {##} ")
    tt = defaultdict(set)
    for s,t in [i.split("-") for i in trans.split(" ")]:
        tt[int(s)].add(int(t))
    try:
        query_translated =[trg.split(" ")[i] for i in tt[src.split(" ").index(query)]]
    except ValueError:
        for i in src.split(" "):
            if "-"+query or query+"-" in i:
                query = i
                break
        query_translated =[trg.split(" ")[i] for i in tt[src.split(" ").index(query)]]

    if len(query_translated) > 0:
        return ":".join(query_translated)
    else:
        return "#NULL"

Solution

  • That way works fine, but I'd do it slightly differently, using list instead of set so we can order the words correctly (set will output words in alphabetical order, not quite what we want):

    File: q_15125575.py

    #-*- encoding: utf8 -*-
    from collections import defaultdict
    
    INPUT = """you have requested a debate on this subject in the course of the next few days , during this part-session . {##} sus señorías han solicitado un debate sobre el tema para los próximos días , en el curso de este período de sesiones . {##} 0-0 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 12-10 13-11 14-11 15-12 16-13 17-14 9-15 10-16 11-17 18-18 17-19 19-21 20-22"""
    
    if __name__ == "__main__":
        english, spanish, trans = INPUT.split(" {##} ")
        eng_words = english.split(' ')
        spa_words = spanish.split(' ')
        transtable = defaultdict(list)
        for e, s in [i.split('-') for i in trans.split(' ')]:
            transtable[eng_words[int(e)]].append(spa_words[int(s)])
    
        print(transtable['course'])
        print(transtable['you'])
        print(" ".join(transtable['course']))
        print(" ".join(transtable['you']))
    

    Output:
    ['curso']
    ['sus', 'se\xc3\xb1or\xc3\xadas']
    curso
    sus señorías

    It's slightly longer code as I'm using the actual words instead of the indexes - but this allows you to directly lookup from transtable

    However, both your method and my method both fail on the same issue: Repeating words.
    print(" ".join(transtable['this'])
    gives:
    el este
    It's at least in the order that the words appear though, so it's workable. Want the first occurrence of 'this' translated?
    transtable['this'][0] would give you the first word.

    And using your code:

    tt = defaultdict(set)
    for e, s in [i.split('-') for i in trans.split(' ')]:
        tt[int(e)].add(int(s))
    
    query = 'this'
    for i in tt[eng_words.index(query)]:
        print i
    

    Gives:
    7

    Your code will only print the index of the first occurrence of a word.