Search code examples
pythondeep-learninggensimword2vec

Tokenizing a corpus composed of articles into sentences Python


I will like to analyze my first deep learning model using Python and in order to do so I have to first split my corpus (8807 articles) into sentences. My corpus is built as follows:

## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

import json
import nltk
import re
import pandas


appended_data = []


#for i in range(20014,2016):
#    df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
#    appended_data.append(df0)

for i in range(2005,2016):
    if i > 2013:
        df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
        appended_data.append(df0)
    df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
    df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)])
    df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)])
    df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)])
    appended_data.append(df1)
    appended_data.append(df2)
    appended_data.append(df3)
    appended_data.append(df4)


appended_data = pandas.concat(appended_data)
# doc_set = df1.body

doc_set = appended_data.body

I am trying to use the function Word2Vec.load_word2vec_format from the library gensim.models but I have to first split my corpus (doc_set) into sentences.

from gensim.models import word2vec
model = Word2Vec.load_word2vec_format(doc_set, binary=False)

Any recommendations?

cheers


Solution

  • So, Gensim's Word2Vec requires this format for its training input: sentences = [['first', 'sentence'], ['second', 'sentence']].

    I assume your documents contain more than one sentence. You should first split by sentences, you can do that with nltk (you might need to download the model first). Then tokenize each sentence and put everything together in a list.

    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentenized = doc_set.body.apply(sent_detector.tokenize)
    sentences = itertools.chain.from_iterable(sentenized.tolist()) # just to flatten
    
    result = []
    for sent in sentences:
        result += [nltk.word_tokenize(sent)]
    gensim.models.Word2Vec(result)
    

    Unfortunately I am not good enough with Pandas to perform all the operations in a "pandastic" way.

    Pay a lot of attention to the parameters of Word2Vec picking them right can make a huge difference.