Search code examples
pythonjsonnumpypandasdump

Record Array to json.dumps


I need to generate a json from a Pandas DataFrame, but using df.to_json shows segmentation error, so I want to find another way to create the json and the only thing I got was to create a records array from the dataframe.

Now I need to create the json.dumps with the names of the files. Something like this

{ "id":123, "name":"myname"}

This is the code I've managed to create and this is my file (http://pastebin.com/iYewEfTg):

import pandas as pd
import json

columns = [u'SalesOrderID', u'OrderDate', u'DueDate', u'ShipDate', u'SalesOrderNumber', u'Title', u'FirstName', u'MiddleName', u'LastName', u'Suffix', u'PhoneNumber', u'PhoneNumberType', u'EmailAddress', u'EmailPromotion', u'AddressType', u'AddressLine1', u'AddressLine2', u'City', u'StateProvinceName', u'PostalCode', u'CountryRegionName', u'SubTotal', u'TaxAmt', u'Freight', u'TotalDue', u'UnitPrice', u'ProductName', u'ProductSubcategory', u'ProductCategory']
data = pd.read_csv('../Uploads/TxtDataSimplified.txt', header=0, names=columns, sep='\t')
data2 = data.to_records(index=0)   
arrayJSON = []
for r in data2:
    for c in columns:
        d=[]
        d[c] = r.__getattribute__(c)
    arrayJSON.append(d)

I need a JSON like this:

[  
{  
   'City':'Sooke',
   'FirstName':'Devin',
   'Title':nan,
   'LastName':'Phillips',
   'SubTotal':'189,97',
   'OrderDate':'2014-06-30 00:00:00.000',
   'AddressType':'Home',
   'PhoneNumberType':'Home',
   'TaxAmt':'15,1976',
   'AddressLine2':nan,
   'AddressLine1':'2742 Cincerto Circle',
   'DueDate':'2014-07-12 00:00:00.000',
   'TotalDue':'209,9169',
   'ShipDate':'2014-07-07 00:00:00.000',
   'StateProvinceName':'British Columbia',
   'MiddleName':nan,
   'ProductCategory':'Accessories',
   'PhoneNumber':'425-555-0163',
   'CountryRegionName':'Canada',
   'PostalCode':'V0',
   'SalesOrderNumber':'SO75123',
   'Suffix':nan,
   'ProductName':'All-Purpose Bike Stand',
   'SalesOrderID':75123,
   'EmailAddress':'[email protected]',
   'EmailPromotion':0,
   'Freight':'4,7493',
   'UnitPrice':'159',
   'ProductSubcategory':'Bike Stands'
},
{  
   'City':'Sooke',
   'FirstName':'Devin',
   'Title':nan,
   'LastName':'Phillips',
   'SubTotal':'189,97',
   'OrderDate':'2014-06-30 00:00:00.000',
   'AddressType':'Home',
   'PhoneNumberType':'Home',
   'TaxAmt':'15,1976',
   'AddressLine2':nan,
   'AddressLine1':'2742 Cincerto Circle',
   'DueDate':'2014-07-12 00:00:00.000',
   'TotalDue':'209,9169',
   'ShipDate':'2014-07-07 00:00:00.000',
   'StateProvinceName':'British Columbia',
   'MiddleName':nan,
   'ProductCategory':'Clothing',
   'PhoneNumber':'425-555-0163',
   'CountryRegionName':'Canada',
   'PostalCode':'V0',
   'SalesOrderNumber':'SO75123',
   'Suffix':nan,
   'ProductName':'AWC Logo Cap',
   'SalesOrderID':75123,
   'EmailAddress':'[email protected]',
   'EmailPromotion':0,
   'Freight':'4,7493',
   'UnitPrice':'8,99',
   'ProductSubcategory':'Caps'
}
]

And the error that I'm getting is:

Traceback (most recent call last):                                                                                                                                
  File "/home/ubuntu/workspace/python/tests2.py", line 11, in <module>                                                                                            
    d[c] = r.__getattribute__(c)                                                                                                                                  
TypeError: list indices must be integers, not unicode   

But I really appreciate a help with the final result, I've been changings from error to error, but unable to get to what I want. I need the JSON to insert it in MongoDB.


Solution

  • Like the error says, d is a list, which you are trying to index with unicode strings. You have to change this to a dictionary (d = {}).

    However, the output still wouldn't be what you'd expect. Instead you can do this:

    for r in data2:
        arrayJSON.append(dict(zip(columns, r.tolist())))
    

    or even this:

    arrayJSON = [dict(zip(columns, r.tolist())) for r in data2]
    

    tolist() will convert the record r to a normal list containing native python values. This is so that it can be serialized by json.dumps. json.dumps may still include values such as NaN though, which is not actually valid JSON. You can replace these values in your DataFrame by using: data.fillna(value="", inplace=True).

    This is what it all would look like:

    import pandas as pd
    import json
    
    columns = [...]
    data = pd.read_csv('../Uploads/TxtDataSimplified.txt', header=0, names=columns, sep='\t')
    data.fillna(value="", inplace=True)
    data2 = data.to_records(index=0)
    
    arrayJSON = [dict(zip(columns, r.tolist())) for r in data2]
    print(json.dumps(arrayJSON))