Search code examples
pythonpandasmissing-data

Filling in missing values with Pandas


Link: CSV with missing Values enter image description here

I am trying to figure out the best way to fill in the 'region_cd' and 'model_cd' fields in my CSV file with Pandas. The 'RevenueProduced' field can tell you what the right value is for either missing fields. My idea is to make some query in my dataframe that looks for all the fields that have the same 'region_cd' and 'RevenueProduced' and make all the 'model_cd' match (vice versa for the missing 'region_cd').

import pandas as pd
import requests as r

#variables needed for ease of file access
url = 'http://drd.ba.ttu.edu/isqs3358/hw/hw2/'
file_1 = 'powergeneration.csv'



res = r.get(url + file_1)
res.status_code
df = pd.read_csv(io.StringIO(res.text), delimiter=',')

There is likely many ways to solve this but I am just starting Pandas and I am stumped to say the least. Any help would be awesome.


Solution

  • Assuming that each RevenueProduced maps to exactly one region_cd and one model_cd.

    Take a look at the groupby pandas function. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

    You could do the following:

    # create mask to grab only regions with values
    mask = df['region_cd'].notna()
    
    # group by region, collect the first `RevenueProduced` and reset the index
    region_df = df[mask].groupby('RevenueProduced')["region_cd"].first().reset_index()
    
    # checkout the built-in zip function to understand what's happening here
    region_map = dict(zip(region_df.RevenueProduced, region_df.region_cd))
    
    # store data in new column, although you could overwrite "region_cd"
    df.loc[:, 'region_cd_NEW'] = df["RevenueProduced"].map(region_map)
    
    

    You would do the exact same process with model_cd. I haven't run this code since at the time of writing this I don't have access to your csv, but I hope this helps.

    Here is the documentation for .map series method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html

    (Keep in mind a series is just a column in a dataframe)