Search code examples
pythonpandasdataframeexport-to-csv

How to create csv files based on column value


I have a master dataset in pandas usa_df that looks like this

usa_df = {
    "state_name": ["Alabama", "Alabama", "Alaska", "Alaska", "Arkansas"],
    "persons": [1, 2, 1, 1, 1],
    "day_week": ["monday", "monday", "tuesday", "monday", "tuesday"]
}

There are about 200k rows in the dataset, and each state name appears for several thousand rows. Each row is a recorded crime, with the state name, persons, and day of the week being datapoints out of about a hundred).

I have a code that filters the state_name column for a specific state and then exports it as a csv

state_df = usa_df[usa_df.state_name == "Alabama"]
state_df.reset_index(drop=True)
os.chdir("/path/to/file")
state_df.to_csv("./Alabama.csv", index=False)

This method works fine, but I am trying to find a way that will automatically create a file for each state without having to do this fifty times.

Being new, I don't know how to proceed from here beyond maybe setting a list of states as a list[str], but beyond that I'm lost. Is it possible to automatically create all 50 files for each row that matches a specific state? Thanks for your help!


Solution

  • A quick and simple solution would be using a for loop over all unique state names. You can get them from your dataframe itself:

    import pandas as pd
    
    usa_df = {
        "state_name": ["Alabama", "Alabama", "Alaska", "Alaska", "Arkansas"],
        "persons": [1, 2, 1, 1, 1],
        "day_week": ["monday", "monday", "tuesday", "monday", "tuesday"]
    }
    
    usa_df = pd.DataFrame(usa_df)
    
    for current_name in sorted(set(usa_df.state_name)):
        state_df = usa_df[usa_df.state_name == current_name]
        state_df.to_csv(f"/path/to/file/{current_name}.csv", index=False)
    

    The more efficient solution, as pointed out in @mozway's comment, would be using grouping, so as to not filter the dataframe multiple times:

    import pandas as pd
    
    usa_df = {
        "state_name": ["Alabama", "Alabama", "Alaska", "Alaska", "Arkansas"],
        "persons": [1, 2, 1, 1, 1],
        "day_week": ["monday", "monday", "tuesday", "monday", "tuesday"]
    }
    
    usa_df = pd.DataFrame(usa_df)
    
    for current_name, current_group in usa_df.groupby("state_name"):
        current_group.to_csv(f"/path/to/file/{current_name}.csv", index=False)
    

    Also note:

    • You don't need to reset the index, as you save with index=False anyway.
    • Rather than changing your current working directory, you can give the full path to the to_csv() method.