I am new to stackoverflow and python so please bear with me. I am trying to run an Latent Dirichlet Analysis on a text corpora with the gensim package in python using PyCharm editor. I prepared the corpora in R and exported it to a csv file using this R command:
write.csv(testdf, "C://...//test.csv", fileEncoding = "utf-8")
Which creates the following csv structure (though with much longer and already preprocessed texts):
1,"1960-01-01","id_1","Newspaper1","Test text one"
2,"1960-01-02","id_2","Newspaper1","Another text"
3,"1960-01-03","id_3","Newspaper1","Yet another text"
4,"1960-01-04","id_4","Newspaper2","Four Five Six"
5,"1960-01-05","id_5","Newspaper2","Alpha Bravo Charly"
6,"1960-01-06","id_6","Newspaper2","Singing Dancing Laughing"
I then try the following essential python code (based on the gensim tutorials) to perform simple LDA analysis:
import gensim
from gensim import corpora, models, similarities, parsing
import pandas as pd
from six import iteritems
import os
import pyLDAvis.gensim
class MyCorpus(object):
def __iter__(self):
for row in pd.read_csv('//mpifg.local/dfs/home/lu/Meine Daten/Imagined Futures and Greek State Bonds/Topic Modelling/Python/test.csv', index_col=False, header = 0 ,encoding='utf-8')['text']:
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(row.split())
if __name__ == '__main__':
dictionary = corpora.Dictionary(row.split() for row in pd.read_csv(
'//.../test.csv', index_col=False, encoding='utf-8')['text'])
'//.../greekdict.dict') # store the dictionary, for future reference
## create an mmCorpus
corpora.MmCorpus.serialize('//.../greekcorpus.mm', MyCorpus())
corpus = corpora.MmCorpus('//.../greekcorpus.mm')
dictionary = corpora.Dictionary.load('//.../greekdict.dict')
corpus = corpora.MmCorpus('//.../greekcorpus.mm')
# train model
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, iterations=1000)
I get the following error codes and the code exits:
...\Python\venv\lib\site-packages\setuptools-28.8.0-py3.6.egg\pkg_resources_vendor\pyparsing.py:832: DeprecationWarning: invalid escape sequence \d
\...\Python\venv\lib\site-packages\setuptools-28.8.0-py3.6.egg\pkg_resources_vendor\pyparsing.py:2736: DeprecationWarning: invalid escape sequence \d
\...\Python\venv\lib\site-packages\setuptools-28.8.0-py3.6.egg\pkg_resources_vendor\pyparsing.py:2914: DeprecationWarning: invalid escape sequence \g
\...\Python\venv\lib\site-packages\pyLDAvis_prepare.py:387: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing
I cannot find any solution and to be honest neither have any clue where exactly the problem comes from. I spent hours making sure that the encoding of the csv is utf-8 and exported (from R) and imported (in python) correctly.
What am I doing wrong or where else could I look at? Cheers!
is exactly that - warning about a feature being deprecated which is supposed to prompt the user to use some other functionality instead to maintain the compatibility in the future. So in your case I would just watch for the update of libraries that you use.
Starting with the last warning it look like it is originating from pandas
and has been logged against pyLDAvis
The remaining ones come from pyparsing
module but it does not seem that you are importing it explicitly. Maybe one of the libraries you use has a dependency and uses some relatively old and deprecated functionality. To eradicate the warning for the start I would check if upgrading does not help. Good luck!