Filling in missing values with Pandas

Link: CSV with missing Values

I am trying to figure out the best way to fill in the 'region_cd' and 'model_cd' fields in my CSV file with Pandas. The 'RevenueProduced' field can tell you what the right value is for either missing fields. My idea is to make some query in my dataframe that looks for all the fields that have the same 'region_cd' and 'RevenueProduced' and make all the 'model_cd' match (vice versa for the missing 'region_cd').

import pandas as pd
import requests as r

#variables needed for ease of file access
url = 'http://drd.ba.ttu.edu/isqs3358/hw/hw2/'
file_1 = 'powergeneration.csv'



res = r.get(url + file_1)
res.status_code
df = pd.read_csv(io.StringIO(res.text), delimiter=',')

There is likely many ways to solve this but I am just starting Pandas and I am stumped to say the least. Any help would be awesome.

Solution

Assuming that each RevenueProduced maps to exactly one region_cd and one model_cd.

Take a look at the groupby pandas function. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

You could do the following:

# create mask to grab only regions with values
mask = df['region_cd'].notna()

# group by region, collect the first `RevenueProduced` and reset the index
region_df = df[mask].groupby('RevenueProduced')["region_cd"].first().reset_index()

# checkout the built-in zip function to understand what's happening here
region_map = dict(zip(region_df.RevenueProduced, region_df.region_cd))

# store data in new column, although you could overwrite "region_cd"
df.loc[:, 'region_cd_NEW'] = df["RevenueProduced"].map(region_map)

You would do the exact same process with model_cd. I haven't run this code since at the time of writing this I don't have access to your csv, but I hope this helps.

Here is the documentation for .map series method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html

(Keep in mind a series is just a column in a dataframe)