I'm struggling with converting of local json files into parquet files. Each file should be converted with pandas to a parquet file and save it, so i have the same amount of files, just as parquets.
I looped through my directory and became a list of all my json files existing and put them into a pandas dataframe.
path = 'trackingdata/'
df = list()
for root, dirs, files in os.walk(path, topdown=False):
for name in files:
df.append(os.path.join(root, name))
df = pd.DataFrame(df)
Is it better to loop trough the dataframe now and transform each file with
df.to_parquet('trackingdata.parquet')
or would it be better to write the transformation into the code above after looping through the dir? And how can i transform each of the files to parquet without joining all together?
How about defining a json_to_parquet converter:
def json_to_parquet(filepath):
df = pd.read_json(filepath, typ='series').to_frame("name")
parquet_file = filepath.split(".")[0] + ".parquet"
df.to_parquet(parquet_file)
Depending on how your json is formatted you may need to change the read_json line and/or use the tips here
Then just processing each file one at at time:
path = 'trackingdata/'
for root, dirs, files in os.walk(path, topdown=False):
for name in files:
json_to_parquet(os.path.join(root, name))