Search code examples
pythonlistdictionarysorting

How can I sort different elements based on keywords?


I'm trying to sort different sentences from a text file according to the part of speech of the specified word in each sentence. For example: Given the big [house] and the {red} flower, I want to create two dictionaries such as dict1

{house: ["the big house", "substantive"]

and dict2

{red: "the red flower", "adjective"}

This is the idea I came up with to later merge them and have a dictionary that contains the keyword as the main word from the sentence and a list with the sentence where I got it from and also its part of speech.

I've tried in multiple ways but it always end up mixing it all up without almost any order. This is the last I've tried an, though I know it could be better formatted and it's not the most clean solution, it's the most I've got it to work so far.

These are a sample from the sentences that I'm working with:

Es (duftete) nach Erde und Pilze
die [Wände] waren mit Moos überzogen.
Ihr zerrissenes [Gewand] war wieder wie neu
Er saß da wie verzaubert und schaute sie an und konnte seine Augen nicht {mehr} von ihr abwenden
Da sie durchscheinend waren, sahen sie aus wie aus rosa [Glas], das von innen erleuchtet ist.

and this is what I wrote to sort it:

def getWordsSelected (sentence):
    #the parameter sentence gets a list with the previous sentence sample showed
    global WordsDictionary
    WordsDictionary = {}

    verbDict = {}
    subsDict = {}
    adjDict = {}
    
    for wordSentenceToSearch in sentence :
        #SUBSTANTIVE 

        startSubstantive = wordSentenceToSearch.find("[")
        endSubstantive = wordSentenceToSearch.find("]")
        substringSubstantive = wordSentenceToSearch[startSubstantive:endSubstantive]
        wordToSearchSubstantive = substringSubstantive.strip("[]")

        
        subsDict [wordToSearchSubstantive] = [wordSentenceToSearch]
        subsDict.setdefault(wordToSearchSubstantive, []).append("substantive")

    for wordSentenceToSearch in sentence :

        #VERB
        startVerb = wordSentenceToSearch.find("(")
        endVerb = wordSentenceToSearch.find(")")
        substringVerb = wordSentenceToSearch[startVerb:endVerb]
        wordToSearchVerb = substringVerb.strip("()")

       
        verbDict [wordToSearchVerb] = [wordSentenceToSearch]
        verbDict.setdefault(wordToSearchVerb, []).append("Verb")
        
    for wordSentenceToSearch in sentence :

        #ADJ

        startADJ = wordSentenceToSearch.find("{")
        endADJ = wordSentenceToSearch.find("}")
        substringADJ = wordSentenceToSearch[startADJ:endADJ]
        wordToSearchADJ = substringADJ.strip(r"{}")

       
        adjDict [wordToSearchADJ] = [wordSentenceToSearch]
        adjDict.setdefault(wordToSearchADJ, []).append("ADJ")

    print(subsDict)
    print(verbDict)
    print(adjDict)

This almost works, however this is the result:

{'': ['Er saß da wie verzaubert und schaute sie an und konnte seine Augen nicht {mehr} von ihr abwenden', 'substantive'], 'Wände': ['die [Wände] waren mit Moos überzogen.', 'substantive'], 'Gewand': ['Ihr zerrissenes [Gewand] war wieder wie neu', 'substantive'], 'Glas': ['Da sie durchscheinend waren, sahen sie aus wie aus rosa [Glas], das von innen erleuchtet ist.', 'substantive']}

In the above dictionary it should show only substantives, and it almost does it except for the first element; where it adds the sentence of the highlighted word "mehr", which is not a substantive (And that's why it doesn't add any keyword, because it's not recognizing anything there with the parameters to qualify as a substantive, but it DOES however get it in there for some reason)

{'duftete': ['Es (duftete) nach Erde und Pilze', 'Verb'], '': ['Da sie durchscheinend waren, sahen sie aus wie aus rosa [Glas], das von innen erleuchtet ist.', 'Verb']}

Here is the verb list and it gets it right with duftete (the only verb in the sample), but again it cramps in there another sentence without any rhyme or reason.

{'': ['Da sie durchscheinend waren, sahen sie aus wie aus rosa [Glas], das von innen erleuchtet ist.', 'ADJ'], 'mehr': ['Er saß da wie verzaubert und schaute sie an und konnte seine Augen nicht {mehr} von ihr abwenden', 'ADJ']}

and finally the adjective and adverb category (they must be in the same list) adds as well the sentence for Glas which is a substantive and shouldn't be there since it doesn't (and shouldn't) recognize any parameter for that to happen.

So, what is happening here? why does it add sentences without any (apparent) logical explanation? And most importantly, what can I do to fix this in order to sort the sentences appropriately


Solution

  • Here's a working solution. As I said in my comment, using regular expressions makes it far easier to retrieve the "highlighted" word. Note that it would be quite easy (by storing the word category delimiters in a dictionary, and replacing the 3 dictionaries with one dictionary of dictionaries) to make the code more flexible (adding new categories) while avoiding the repetition of if ... statements.

    import re
    
    sentences = [
        "Es (duftete) nach Erde und Pilze",
        "die [Wände] waren mit Moos überzogen.",
        "Ihr zerrissenes [Gewand] war wieder wie neu",
        "Er saß da wie verzaubert und schaute sie an und konnte seine Augen nicht {mehr} von ihr abwenden",
        "Da sie durchscheinend waren, sahen sie aus wie aus rosa [Glas], das von innen erleuchtet ist.",
    ]
    
    
    def getWordsSelected(sentences):
        # the parameter sentences is a list of the previous sentences sample showed
    
        verbDict = {}
        subsDict = {}
        adjDict = {}
    
        for wordSentenceToSearch in sentences:
            # SUBSTANTIVE
            if (substantive := re.findall(r'\[([^]]*)', wordSentenceToSearch)):
                subsDict.setdefault(substantive[0], []).append((wordSentenceToSearch, "substantive"))
    
            # VERB
            if (verb := re.findall(r'\(([^)]*)', wordSentenceToSearch)):
                verbDict.setdefault(verb[0], []).append((wordSentenceToSearch, "verb"))
    
            # ADJ
            if (adj := re.findall(r'\{([^}]*)', wordSentenceToSearch)):
                adjDict.setdefault(adj[0], []).append((wordSentenceToSearch, "adjective"))
    
        print(subsDict)
        print(verbDict)
        print(adjDict)
    

    OUTPUT:

    getWordsSelected(sentences)
    {'Wände': [('die [Wände] waren mit Moos überzogen.', 'substantive')], 'Gewand': [('Ihr zerrissenes [Gewand] war wieder wie neu', 'substantive')], 'Glas': [('Da sie durchscheinend waren, sahen sie aus wie aus rosa [Glas], das von innen erleuchtet ist.', 'substantive')]}
    {'duftete': [('Es (duftete) nach Erde und Pilze', 'verb')]}
    {'mehr': [('Er saß da wie verzaubert und schaute sie an und konnte seine Augen nicht {mehr} von ihr abwenden', 'adjective')]}
    

    Edit: here's an improved version following what I wrote earlier:

    import re
    
    
    def getWordsSelected(sentences):
        # the parameter sentences is a list of the previous sentences sample showed
    
        word_categories = {
            'verb': '()',
            'substantive': '[]',
            'adjective': '{}'
        }
    
        word_dict = {category: {} for category in word_categories}
    
        for wordSentenceToSearch in sentences:
            for category, delimiters in word_categories.items():
                if word := re.findall(
                        fr'{re.escape(delimiters[0])}([^{re.escape(delimiters[1])}]*)',
                        wordSentenceToSearch
                ):
                    word_dict[category].setdefault(word[0], []).append((wordSentenceToSearch, category))
    
        print(word_dict)
    

    OUTPUT:

    {
    'verb': {'duftete': [('Es (duftete) nach Erde und Pilze', 'verb')]},
    'substantive': {'Wände': [('die [Wände] waren mit Moos überzogen.', 'substantive')], 'Gewand': [('Ihr zerrissenes [Gewand] war wieder wie neu', 'substantive')], 'Glas': [('Da sie durchscheinend waren, sahen sie aus wie aus rosa [Glas], das von innen erleuchtet ist.', 'substantive')]},
    'adjective': {'mehr': [('Er saß da wie verzaubert und schaute sie an und konnte seine Augen nicht {mehr} von ihr abwenden', 'adjective')]}
    }