Search code examples
pythonnumpypandaspymongoblaze

Insert into MongoDB retuns cannot encode object


I'm doing a rather simple insert into a local MongoDB sourced from of a Python pandas DataFrame. Essentially I'm calling datframe.loc[n].to_dict() and getting my dictionary directly from the df. All is well so far until I attempt the insert, where I'm getting a 'Cannot encode object'. Looking at the dict directly showed that everything looked good but then (while writing this question) it dawned me to check each type in the dict and found that a long ID number had converted to a numpy.int64 instead of a simple int (which when I created the dict manually as an int would insert fine).

So, I was unable to find anything within the pandas documentation on adding arguments to the to_dict that would allow me to override this behavior and while there are brute force methods to fixing this issue, there must be a bit more eloquent way to sort this issue without resorting to that sort of thing.

Question is then, how to convert a row of a dataframe to a dict for insertion into a MongoDB, ensuring I am using only acceptable content types ... OR, can I back up further here and use a simpler approach to get each row of a dataframe to be a document within Mongo?

Thanks

As requested, here is an addendum to the post with a sample of the data I am using.

{'Account Created': 'about 3 hours ago',
 'Followers': 13,
 'Following': 499,
 'Screen Name': 'XXXXXXXXXX',
 'Status': 'Alive',
 'Tweets': 12,
 'Twitter ID': 0000000000L}

This directly from the to_dict output that faulted on insert. I copied this directly into a 'test' dict and that worked perfectly fine. If I print out values of each of the dicts I get the following...

to_dict = ['Alive', 'a_aheref77', 'about 3 hours ago', 12, 13, 499, 0000000000L, ObjectId('551bd8cfae89e9370851aa64')]

test = ['Alive', 'XXXXXXXX', 'about 3 hours ago', 499, 13, 12, 0000000000, ObjectId('551bd6fdae89e9370851aa63')]

The only difference (as far as I can tell) is the Long int, which interestingly enough, when I did the Mongo insert it shows that field as being 'Number Long' within the document. Hope this help clarify som.


Solution

  • Take a look at the odo library. In particular, the mongodb docs. Pandas isn't likely to grow any kind of to_mongo methods in the near future so Odo is where this sort of functionality should go. Here's an example with a simple DataFrame:

    In [13]: import pandas as pd
    
    In [14]: from odo import odo
    
    In [15]: df = pd.DataFrame({'a': [1, 2, 3], 'b': list('abc')})
    
    In [17]: m = odo(df, 'mongodb://localhost/db::t')
    
    In [18]: list(m.find())
    Out[18]:
    [{u'_id': ObjectId('551bfb20362e696200d568d9'), u'a': 1, u'b': u'a'},
     {u'_id': ObjectId('551bfb20362e696200d568da'), u'a': 2, u'b': u'b'},
     {u'_id': ObjectId('551bfb20362e696200d568db'), u'a': 3, u'b': u'c'}]
    

    You can get the required deps and odo by doing

    conda install odo pymongo --channel blaze
    

    or

    pip install odo