OK, I have been beating my head against the wall with this one all afternoon. I know that there are many similar posts, but I keep getting errors and am probably making a stupid mistake.
I am using the apyori
package found here to do some transaction basket analysis: https://pypi.python.org/pypi/apyori/1.1.1
It appears that the packages dump_as_json()
method spits out dictionaries of RelationRecords
for each possible basket.
I want to take these json formatted dictionaries into one pandas dataframe, but have had fits with different errors when attempting to use pd.read_json()
.
Here is my code:
import apyori, shutil, os
from apyori import apriori
from apyori import dump_as_json
import pandas as pd
import json
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
transactions = [
['Jersey','Magnet'],
['T-Shirt','Cap'],
['Magnet','T-Shirt'],
['Jersey', 'Pin'],
['T-Shirt','Cap']
]
results = list(apriori(transactions))
results_df = pd.DataFrame()
for RelationRecord in results:
dump_as_json(RelationRecord,output_file)
print output_file.getvalue()
json_file = json.dumps(output_file.getvalue())
print json_file
print data_df.head()
Any ideas how to get the json formatted dictionaries stored in output_file
into a pandas dataframe?
I would suggest reading up on StackOverflow's guidelines on producing a Minimal, Complete, and Verifiable example. Also, statements like "I keep getting errors" are not very helpful. That said, I took a look at your code and the source code for this apyori
package. Typos aside, it looks like the problem line is here :
for RelationRecord in results:
dump_as_json(RelationRecord,output_file)
You're creating a one-object-per-line JSON file (I think this is sometimes referred to as LSON or Line-JSON.) As a whole document, it just isn't valid JSON. You could try to keep this as a list of homogeneous dictionaries or some other pd.DataFrame friendly structure.
output = []
for RelationRecord in results:
o = StringIO()
dump_as_json(RelationRecord, o)
output.append(json.loads(o.getvalue()))
data_df = pd.DataFrame(output)