Search code examples
pythonlistdictionarytextnlp

Text Preprocessing for NLP but from List of Dictionaries


I'm attempting to do an NLP project with a goodreads data set. my data set is a list of dictionaries. Each dictionary looks like so (the list is called 'reviews'):

>>> reviews[0]
{'user_id': '8842281e1d1347389f2ab93d60773d4d',
'book_id': '23310161',
'review_id': 'f4b4b050f4be00e9283c92a814af2670',
'rating': 4,
'review_text': 'Fun sequel to the original.',
'date_added': 'Tue Nov 17 11:37:35 -0800 2015',
'date_updated': 'Tue Nov 17 11:38:05 -0800 2015',
'read_at': '',
'started_at': '',
'n_votes': 7,
'n_comments': 0}

There are 700k+ of these dictionaries in my dataset.

First question: I am only interested in the elements 'rating' and 'review_text'. I know I can delete elements from each dictionary, but how do I do it for all of the dictionaries?

Second question: I am able to do sentence and word tokenization of an individual dictionary in the list by specifying the dictionary in the list, then the element 'review_text' within the dictionary like so:

paragraph = reviews[0]['review_text']

And then applying sent_tokenize and word_tokenize like so:

print(sent_tokenize(paragraph))
print(word_tokenize(paragraph))

But how do I apply these methods to the entire data set? I am stuck here, and cannot even attempt to do any of the text preprocessing (lower casing, removing punctuation, lemmatizing, etc).

TIA


Solution

  • To answer the first question, you can simply put them into dataframe with only your interesting columns (i.e. rating and review_text). This is to avoid looping and managing them record by record and is also easy to be manipulated on the further processes.

    After you came up with the dataframe, use apply to preprocess (e.g. lower, tokenize, remove punctuation, lemmatize, and stem) your text column and generate new column named tokens that store the preprocessed text (i.e. tokens). This is to satisfy the second question.

    from nltk import sent_tokenize, word_tokenize
    from nltk.stem import WordNetLemmatizer
    from nltk.stem import PorterStemmer
    import string
    
    punc_list = list(string.punctuation)
    porter = PorterStemmer()    
    lemmatizer = WordNetLemmatizer()
    
    def text_processing(row):
        all_words = list()
        # sentence tokenize
        for sent in sent_tokenize(row['review_text']):
            # lower words and tokenize
            words = word_tokenize(sent.lower())
            # lemmatize
            words_lem = [lemmatizer.lemmatize(w) for w in words]
            # remove punctuation
            used_words = [w for w in words_lem if w not in punc_list]
            # stem
            words_stem = [porter.stem(w) for w in used_words]
            all_words += words_stem
        return all_words
    
    # create dataframe from list of dicts (select only interesting columns)
    df = pd.DataFrame(reviews, columns=['user_id', 'rating', 'review_text'])
    
    df['tokens'] = df.apply(lambda x: text_processing(x), axis=1)
    print(df.head())
    

    example of output:

      user_id  rating              review_text                        tokens
    0       1       4        Fun sequel to the        [fun, sequel, to, the]
    1       2       2  It was a slippery slope  [it, wa, a, slipperi, slope]
    2       3       3     The trick to getting         [the, trick, to, get]
    3       4       3           The bird had a           [the, bird, had, a]
    4       5       5      That dog likes cats        [that, dog, like, cat]
    

    Finally, if you don’t prefer dataframe, you can export it as other formats such as csv (to_csv), json (to_json), and list of dicts (to_dict('records')).

    Hope this would help