I have a function that scores words. I have lots of text from sentences to several page documents. I'm stuck on how to score the words and return the text near its original state.
Here's an example sentence:
"My body lies over the ocean, my body lies over the sea."
What I want to produce is the following:
"My body (2) lies over the ocean (3), my body (2) lies over the sea."
Below is a dummy version of my scoring algorithm. I've figured out how to take text, tear it apart and score it.
However, I'm stuck on how to put it back together into the format I need it in.
Here's a dummy version of my function:
def word_score(text):
words_to_work_with = []
words_to_return = []
passed_text = TextBlob(passed_text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word)
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
for word in words to work with:
if word == 'body':
score = 2
if word == 'ocean':
score = 3
else:
score = None
words_to_return.append((word,score))
return words_to_return
I'm a relative newbie so I have two questions:
I'd really like to be able to feed entire segments (i.e. sentences, documents) into the function and have it return them.
Thank you for helping me!
So basically, you want to attribute a score for each word. The function you give may be improved using a dictionary instead of several if
statements.
Also you have to return all scores, instead of just the score of the first word
in words_to_work_with
which is the current behavior of the function since it will return an integer on the first iteration.
So the new function would be :
def word_score(text)
words_to_work_with = []
passed_text = TextBlob(text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
return [dict_scores.get(word, None)] # if word is not recognized, score is None
For the second part, which is reconstructing the string, I would actually do this in the same function (so this answers your second question) :
def word_score_and_reconstruct(text):
words_to_work_with = []
passed_text = TextBlob(text)
reconstructed_text = ''
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body': 2, 'ocean': 3}
dict_strings = {'body': ' (2)', 'ocean': ' (3)'}
word_scores = []
for word in words_to_work_with:
word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here
# we add 'word'+'(word's score)', only if the word has a score
# if not, we add the default value '' meaning we don't add anything
reconstructed_text += word + dict_strings.get(word, '')
return reconstructed_text, word_scores
I'm not guaranteeing this code will work at first try, I can't test it but it'll give you the main idea