SpaCy, how to create a pattern to match a string caught via SpeechRecognition?

first time here asking for help, hope everything is clear! FACT: I'm building an App for a role playing game (GURPS), who tracks damage dealt to enemies from players. App itself is quite done, and i used PySimpleGUI for the graphic interface. Next step, is to integrate vocal commands, in order to enter input not from keyboard, but from voice (because there are several inputs, so, why not?). So, i used SpeechRecognition library to catch the voice input, creating a string variable who stores the input from user. Now I'm working on the second part: from the string, extract the inputs. The last part would be to stored those inputs into a dictionary and use it as input for my functions.

WHAT I'M TRYING TO ACHIEVE I'm having a lot of problem designing the matches with SpaCy. Because i think there are no databases to train a NN or a ML model for my tasks, i'm using Rules Matching. In this way, any sentence has to be structurated in a certain way, in order to extract token as i'd wish. An example of sentece is this one: "You hit the enemy zombie one, that has vulnerability 2, to the head, with a large piercing attack, dealing 8 damage".

The inputs i have to extract are following:

enemy hit: zombie one (is the enemy hit, and in the dataframe created could be present zombie_1, zombie_2, etc..., in general, multiple zombie, with a sequential number attached. Still try to understand if will be a better idea to name them as zombie1, zombie2...)
vulnerability "number"
location hit: head in this case, but could be "right arm", that i can't extract because tokenization sees them as 2 tokens instead of 1
large penetration: type of attack (the easiest case in something like "cutting", or "crushing", one word, easy to take, but i didn't find any way to extract these tokens togheter, becuase how tokenization works)
damage 8: the damage dealt

PROBLEMS: I'm currently using DependencyMatcher. Main problems are:

Because tokenization works on single word, in the case said above, i would lose the second part (right arm, extract only arm; large penetration, only penetration).
Can't generalize my pattern and i'm not sure "DependencyMatcher" is the right tool here. I'm working with Italian Language, but i'm testing in English for semplicity. My current script for english language is:

string = "You hit the enemy zombie one, that has vulnerability 2, to the head, with a large piercing attack, dealing 8 damage."
    nlp = spacy.load("en_core_news_sm")
    doc = nlp(string)
    # for token in doc:
    #     print(token.text, token.dep_)
    
# i'm going to create 2 lists with all words of body locations hit and type of attacks, in order to find the words via "LOWER" or "LEMMA" dependency (first part of list is in english, second part in italian)
   
    body_list_words = ["Body", "Head", "Arm_right", "Arm_left", "Leg_right", "Leg_left", "Hand_right", "Hand_left", "Foot_right", "Foot_left",
                 "Groin", "Skull", "Vitals", "Neck", "corpo", "testa", "braccio destro", "braccio sinistro", "gamba destra", "gamba sinistra",
                       "mano destra", "mano sinistra", "piede destro", "piede sinistro", "testicoli", "cranio", "vitali", "collo"]

    attack_type_words = ["cutting", "impaling", "crushing", "small penetration", "penetration", "big penetration", "huge penetration",
                          "burning", "explosive", "tagliente", "impalamento", "schiacciamento", "penetrazione minore", "piccola penetrazione",
                          "penetrazione", "penetrazione maggiore", "enorme penetrazione", "infuocati", "esplosivi"]


    ###############
    # Trovare i match
    ##############
    matcher = DependencyMatcher(nlp.vocab)
    # I'm starting finding the verb
    patterns = [{"RIGHT_ID": "anchor_verbo",
                 "RIGHT_ATTRS": {"POS": "VERB"}},
    
    # Looking for Obj (word: enemy)
                {"LEFT_ID": "anchor_verbo",
                 "REL_OP": ">",
                 "RIGHT_ID": "obj_verbo",
                 "RIGHT_ATTRS": {"DEP": "obj"}},

    # Looking for the name of the enemy: zombie1
                {"LEFT_ID": "obj_verbo",
                 "REL_OP": ">",
                 "RIGHT_ID": "type_enemy",
                 "RIGHT_ATTRS": {"DEP": "nmod"}},
    
     # Looking for word: vulnerability
                {"LEFT_ID": "anchor_verbo",
                 "REL_OP": ">",
                 "RIGHT_ID": "vulnerability",
                 "RIGHT_ATTRS": {"LEMMA": "vulnerability"}},

    #Looking for number associated to Vulnerability
                {"LEFT_ID": "vulnerability",
                 "REL_OP": ">",
                 "RIGHT_ID": "num_vulnerability",
                 "RIGHT_ATTRS": {"DEP": "nummod"}},

    #location of body hit
                {"LEFT_ID": "anchor_verbo",
                 "REL_OP": ">>",
                 "RIGHT_ID": "location",
                 "RIGHT_ATTRS": {"LOWER": {"IN": body_list_words}}},

   # Looking for word: attack, in order to find the type of attack
                {"LEFT_ID": "anchor_verbo",
                 "REL_OP": ">>",
                 "RIGHT_ID": "attack",
                 "RIGHT_ATTRS": {"POS": "NOUN"}},

    #Looking for type of attack
                {"LEFT_ID": "attack",
                 "REL_OP": ">>",
                 "RIGHT_ID": "type_attack",
                 "RIGHT_ATTRS": {"LEMMA": {"IN": attack_type_words}}},

    #Looking for word: damage in order to extract the number
                {"LEFT_ID": "attack",
                 "REL_OP": ">>",
                 "RIGHT_ID": "word_damage",
                 "RIGHT_ATTRS": {"DEP": "nmod"}},

    # Looking for the number
                {"LEFT_ID": "word_damage",
                 "REL_OP": ">>",
                 "RIGHT_ID": "num_damage",
                 "RIGHT_ATTRS": {"DEP": "nummod"}}

                ]

    matcher.add("Inputs1", [patterns])
    matches = matcher(doc)

    match_id, token_ids = matches[0]
    matched_words = []
    for i in range(len(token_ids)):
        #print(patterns[i]["RIGHT_ID"] + ":", doc[token_ids[i]].text)
        matched_words.append(doc[token_ids[i]].text)
    
#########
# Now i'm creating the dictionary, deleting first element
#########
    index_to_remove = [0]
    for index, elem in enumerate(index_to_remove):
        del matched_words[elem]
    print(matched_words)

    input_dict = {matched_words[0]: matched_words[1], "location": matched_words[4], matched_words[5]: matched_words[6],
                  matched_words[7]: matched_words[8], matched_words[2]: matched_words[3]}

    #print(input_dict)
    return input_dict

General problem to solve: any complex words that should groupped togheter (as for "right arm", "left leg", "large penetration") can't be extract in this way (only arm, leg or penetration would be returned).

Can you help me? Thanks!

Solution

To summarize your problem, you are getting single words, but you want to capture multiple words that are a single unit, like "right arm".

You can do this with the dependency matcher but it'll take a little work. Basically you want to match the whole subtree of the single word you're getting now. In the phrase "right arm", "arm" is the head noun, and "right" will depend on "arm". All the words that depend on "arm", directly or indirectly (through other words), are called the "subtree".

Understanding the dependencies is a little complicated but very powerful. I recommend you read Chapter 14 in the Jurafsky and Martin book, which is a straightforward guide to dependency parsing. Feel free to skim lots of it.

That said, for the kind of phrases you want there is a simpler method you can try in spaCy. Try using the merge_noun_chunks function, which will turn chunks into single tokens, which are easier to work with.

A noun chunk is kind of hard to define, and the way it works in spaCy may not be exactly what you want, but you can also look at the source for it to write your own definition if you want. In order for that to work you will have to understand dependency parsing though.