I will like to analyze my first deep learning model using Python and in order to do so I have to first split my corpus (8807 articles) into sentences. My corpus is built as follows:
## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import json
import nltk
import re
import pandas
appended_data = []
#for i in range(20014,2016):
# df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
# appended_data.append(df0)
for i in range(2005,2016):
if i > 2013:
df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
appended_data.append(df0)
df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)])
df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)])
df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)])
appended_data.append(df1)
appended_data.append(df2)
appended_data.append(df3)
appended_data.append(df4)
appended_data = pandas.concat(appended_data)
# doc_set = df1.body
doc_set = appended_data.body
I am trying to use the function Word2Vec.load_word2vec_format
from the library gensim.models
but I have to first split my corpus (doc_set
) into sentences.
from gensim.models import word2vec
model = Word2Vec.load_word2vec_format(doc_set, binary=False)
Any recommendations?
cheers
So, Gensim's Word2Vec
requires this format for its training input: sentences = [['first', 'sentence'], ['second', 'sentence']]
.
I assume your documents contain more than one sentence. You should first split by sentences, you can do that with nltk (you might need to download the model first). Then tokenize each sentence and put everything together in a list.
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentenized = doc_set.body.apply(sent_detector.tokenize)
sentences = itertools.chain.from_iterable(sentenized.tolist()) # just to flatten
result = []
for sent in sentences:
result += [nltk.word_tokenize(sent)]
gensim.models.Word2Vec(result)
Unfortunately I am not good enough with Pandas to perform all the operations in a "pandastic" way.
Pay a lot of attention to the parameters of Word2Vec
picking them right can make a huge difference.