Search code examples
pythontextsplitnltksentence

Split sentences, process words, and put sentence back together?


I have a function that scores words. I have lots of text from sentences to several page documents. I'm stuck on how to score the words and return the text near its original state.

Here's an example sentence:

"My body lies over the ocean, my body lies over the sea."

What I want to produce is the following:

"My body (2) lies over the ocean (3), my body (2) lies over the sea."

Below is a dummy version of my scoring algorithm. I've figured out how to take text, tear it apart and score it.

However, I'm stuck on how to put it back together into the format I need it in.

Here's a dummy version of my function:

def word_score(text):
    words_to_work_with = []
    words_to_return = []
    passed_text = TextBlob(passed_text)
    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word)
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)
    for word in words to work with:
        if word == 'body':
            score = 2
        if word == 'ocean':
            score = 3
        else:
            score = None
        words_to_return.append((word,score))
    return words_to_return

I'm a relative newbie so I have two questions:

  1. How can I put the text back together, and
  2. Should that logic be put into the function or outside of it?

I'd really like to be able to feed entire segments (i.e. sentences, documents) into the function and have it return them.

Thank you for helping me!


Solution

  • So basically, you want to attribute a score for each word. The function you give may be improved using a dictionary instead of several if statements. Also you have to return all scores, instead of just the score of the first wordin words_to_work_with which is the current behavior of the function since it will return an integer on the first iteration. So the new function would be :

    def word_score(text)
        words_to_work_with = []
        passed_text = TextBlob(text)
        for word in words_to_work_with:
            word = word.singularize().lower()
            word = str(word) # Is this line really useful ?
            e_word_lemma = lemmatizer.lemmatize(word)
            words_to_work_with.append(e_word_lemma)
    
        dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
        return [dict_scores.get(word, None)] # if word is not recognized, score is None
    

    For the second part, which is reconstructing the string, I would actually do this in the same function (so this answers your second question) :

    def word_score_and_reconstruct(text):
        words_to_work_with = []
        passed_text = TextBlob(text)
    
        reconstructed_text = ''
    
        for word in words_to_work_with:
            word = word.singularize().lower()
            word = str(word)  # Is this line really useful ?
            e_word_lemma = lemmatizer.lemmatize(word)
            words_to_work_with.append(e_word_lemma)
    
        dict_scores = {'body': 2, 'ocean': 3}
        dict_strings = {'body': ' (2)', 'ocean': ' (3)'}
    
        word_scores = []
    
        for word in words_to_work_with:
            word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here
    
            # we add 'word'+'(word's score)', only if the word has a score
            # if not, we add the default value '' meaning we don't add anything
            reconstructed_text += word + dict_strings.get(word, '')
    
        return reconstructed_text, word_scores
    

    I'm not guaranteeing this code will work at first try, I can't test it but it'll give you the main idea