Search code examples
regexpython-3.xdictionaryfindalldictionary-comprehension

RegEx for retrieving words from a dictionary


here is my code : It looks in one dictionary and in another one and calculate the score of the values of the first dictionary.

d_score = {k: [0, 0, 0] for k in d_filtered_words.keys()}
for k, v in d_filtered_words.items():
    for word in v:
        if word in dico_lexique:
            print(word, dico_lexique[word]
            d_score[k] = [a + b for a, b in zip(map(int, dico_lexique[word]), d_score[k])]
d_score = {k: list(map(str, v)) for k, v in d_score.items()}

The result of the print is :

avoir ['101', '3772', '110']
serrer ['175', '936', '252']
main ['251', '2166', '49']
avoir ['101', '3772', '110']
serrer ['175', '936', '252']
enfant ['928', '1274', '19']
aérien ['354', '769', '39']
affable ['486', '45', '32']
affaire ['46', '496', '104']
agent ['265', '510', '18']
connaître ['448', '293', '29']
rien ['24', '185', '818']
trouver ['387', '198', '31']
être ['225', '328', '44']
emmerder ['0', '23', '493']
rien ['24', '185', '818']
suffire ['420', '35', '56']
mettre ['86', '1299', '67']
multiprise ['314', '71', '0']
abasourdir ['0', '43', '393']
ablation ['75', '99', '353']
abominable ['0', '24', '1170']
être ['225', '328', '44']
seul ['65', '97', '540']
ami ['492', '72', '31']
aimer ['1140', '49', '35']

Just to clarify : The dico_lexique also contains key_word like :

sabot de Vénus>orchidée;294;76;0
imbuvable>boisson;0;0;509
imbuvable>insupportable;0;0;416
accentuer>intensifier;255;89;4
accentuer>mettre un accent;50;29;30

And these are the words i woulk like also to take into consideration when looking at the keys in the dico_lexique

The result of d_score is :

{'15': ['1731', '12856', '792'], '44': ['3079', '4437', '2549'], '45': ['75', '166', '1916'], '47': ['7721', '3854', '7259']}

Hello, Just to clarify the word containing the element '>' are also part of the dico_lexique, they are not from another file. In the dico_lexique, you have different sens of a word and to differentiate it some are follows by '>'. I am looking only in the dico_lexique and the d_filtered and would want to take into consideration key_word follow by '>' so that when i see the 'serrer' in d_filtered_words , the code would retrieve the values of 'serrer' and also all the value of the word "serrer" follow by '>'.

d_score = {k: [0, 0, 0] for k in d_filtered_words.keys()}
for k, v in d_filtered_words.items():
    for word in v:
        regex =????
        if word in dico_lexique and if word = re.findall(regex, word)

Solution

  • EDIT: new version after you updated the problem.

    Sample data is now:

    >>> d_filtered_words = {
    ...    '1': ['avoir', 'main'],
    ...    '2': ['main', 'serrer', 'posséder'],
    ... }
    
    >>> dico_lexique = {
    ...     'avoir': ('101', '3772', '110'),
    ...     'avoir>posséder': ('91', '2724', '108'),
    ...     'serrer': ('175', '936', '252'),
    ...     'main': ('251', '2166', '49'),
    ... }
    

    You have to process dico_lexique first to remove the parts after the > and group the values by main word:

    >>> values_by_word = {}
    >>> for word, values in dico_lexique.items():
    ...     main, *_ = word.split(">")
    ...     values_by_word.setdefault(main, []).append(values)
    >>> values_by_word
    {'avoir': [('101', '3772', '110'), ('91', '2724', '108')], 'serrer': [('175', '936', '252')], 'main': [('251', '2166', '49')]}
    

    Explanation:

    • main, *_ = word.split(">") keeps everything before an optional > and forget the rest (see destructuring assignement)
    • setdefault creates a new list associated with the main word if it doesn't exist and add the values.

    Now, same logic as below:

    >>> def merge_values(tuples):
    ...     """Sums columns (with a str->int->str conversion)"""
    ...     return tuple(str(sum(int(v) for v in vs)) for vs in zip(*tuples))
    
    >>> merged_values_by_word = {code:merge_values(tuples) for code, tuples in values_by_word.items()}
    >>> merged_values_by_word
    {'avoir': ('192', '6496', '218'), 'serrer': ('175', '936', '252'), 'main': ('251', '2166', '49')}
    

    (I renamed get_values to merge_values but it is the same function.) You can use the code below with merged_values_by_word instead of dico_lexique.

    End of edit: old version below, for the record

    Your mixing two problems: what your code does (summing values associated with words or family of words) and parsing a file or a string.

    Some code review

    Let me summarize: * you have dico_lexique that maps a word to three values (strings containing integers) * you have d_filtered_words that maps a code ('15', '44', ...) to a list of words. * you create a dict that maps the code to [sum of the value1, sum of the value2, sum of the value2] for every word that is mapped to the code and is present in dico_lexique.

    First, if you have always three values, use a tuple, not a list. I'll use this custom sample:

    >>> d_filtered_words = {
    ...    '1': ['avoir', 'main'],
    ...    '2': ['main', 'serrer', 'posséder'],
    ... }
    
    >>> dico_lexique = {
    ...     'avoir': ('101', '3772', '110'),
    ...     'serrer': ('175', '936', '252'),
    ...     'main': ('251', '2166', '49'),
    ...     # no posséder here
    ... }
    

    Second, build a dict that maps the code to the list of three values:

    >>> def get_tuples(words):
    ...     """return the tuples of values for every word in dico_lexique"""
    ...     return [dico_lexique[word] for word in words if word in dico_lexique]
    
    >>> tuples_by_code = {code:get_tuples(words) for code, words in  d_filtered_words.items()}
    >>> tuples_by_code
    {'1': [('101', '3772', '110'), ('251', '2166', '49')], '2': [('251', '2166', '49'), ('175', '936', '252')]}
    

    Third, sum the values "by column". There is an easy way to do it:

    >>> tuples = [(1,2,3), (4,5,6)]
    >>> tuple(zip(*tuples))
    ((1, 4), (2, 5), (3, 6))
    >>> tuple(map(sum, zip(*tuples)))
    (5, 7, 9)
    

    The zip function will group the first element of every tuple, then the second element of every tuple, then...: you get the "columns" and just have to sum them. In your case:

    >>> def get_values(tuples):
    ...     """Sums columns (with a str->int->str conversion)"""
    ...     return tuple(str(sum(int(v) for v in vs)) for vs in zip(*tuples))
    
    >>> values_by_code = {code:get_values(tuples) for code, tuples in tuples_by_code.items()}
    >>> values_by_code
    {'1': ('352', '5938', '159'), '2': ('426', '3102', '301')}
    

    Your question

    Now your question. Imagine I have a text file with the alternative forms:

    >>> text = """avoir>posséder
    ... voilé>dissimulé
    ... voilé>caché"""
    

    You have to parse that file and to split every line on > to build a dict alternative -> main:

    >>> main_by_alternative = {a: m for line in text.split("\\n") for m, a in [line.split(">")]}
    >>> main_by_alternative
    {'posséder': 'avoir', 'dissimulé': 'voilé', 'caché': 'voilé'}
    

    The key idea is to split the line on the char > to get the main form and the alternative form in a list. for m, a in [line.split(">")] is a trick to have m, a = line.split(">") in a dict comprehension. Now, back to get_tuples:

    >>> def get_tuples(words):
    ...     """return the tuples of values for every word in dico_lexique"""
    ...     return [dico_lexique[main_by_alternative.get(word, word)] for word in words if main_by_alternative.get(word, word) in dico_lexique]
    

    What's new? Look at: main_by_alternative.get(word, word). It simply gets the main form if it exits, or the word itself else.

    >>> {code:get_tuples(words) for code, words in  d_filtered_words.items()}
    {'1': [('101', '3772', '110'), ('251', '2166', '49')], '2': [('251', '2166', '49'), ('175', '936', '252'), ('101', '3772', '110')]}
    

    The code 2 is now mapped to the three words: 'main', 'serrer', 'avoir' (via 'posséder').

    Hope it helps. I used a lot of dict/list comprehensions to make it short, but if you need, do not hesitate to expand the code into regular loops.