Search code examples
pythonjsonpandasparquet

Transform json files in directory in parquet files with python & pandas


I'm struggling with converting of local json files into parquet files. Each file should be converted with pandas to a parquet file and save it, so i have the same amount of files, just as parquets.

I looped through my directory and became a list of all my json files existing and put them into a pandas dataframe.

path = 'trackingdata/'

df = list()
for root, dirs, files in os.walk(path, topdown=False):
   for name in files:
      df.append(os.path.join(root, name))
df = pd.DataFrame(df)     

Is it better to loop trough the dataframe now and transform each file with

df.to_parquet('trackingdata.parquet')

or would it be better to write the transformation into the code above after looping through the dir? And how can i transform each of the files to parquet without joining all together?


Solution

  • How about defining a json_to_parquet converter:

    def json_to_parquet(filepath):
        df = pd.read_json(filepath, typ='series').to_frame("name")
        parquet_file = filepath.split(".")[0] + ".parquet"
        df.to_parquet(parquet_file)
    

    Depending on how your json is formatted you may need to change the read_json line and/or use the tips here

    Then just processing each file one at at time:

    path = 'trackingdata/'
    
    for root, dirs, files in os.walk(path, topdown=False):
        for name in files:
            json_to_parquet(os.path.join(root, name))