I have an unbalanced dataset, namely unbalanced.df
that looks as follows:
Date | ID | City | State | Quantity |
---|---|---|---|---|
2019-01 | 10001 | Los Angeles | CA | 500 |
2019-02 | 10001 | Los Angeles | CA | 995 |
2019-03 | 10001 | Los Angeles | CA | 943 |
2019-01 | 10002 | Houston | TX | 4330 |
2019-03 | 10002 | Houston | TX | 2340 |
2019-01 | 10003 | Sacramento | CA | 235 |
2019-02 | 10003 | Sacramento | CA | 239 |
2019-03 | 10003 | Sacramento | CA | 233 |
As you can see, Houston does not have 2019-02
as a Date. This happens all throughout my panel data with different cities.
I want to make this panel symmetric by adding NA rows on the missing dates, such that the new panel data looks like this, balanced.df
:
Date | ID | City | State | Quantity |
---|---|---|---|---|
2019-01 | 10001 | Los Angeles | CA | 500 |
2019-02 | 10001 | Los Angeles | CA | 995 |
2019-03 | 10001 | Los Angeles | CA | 943 |
2019-01 | 10002 | Houston | TX | 4330 |
2019-02 | 10002 | Houston | TX | NaN |
2019-03 | 10002 | Houston | TX | 2340 |
2019 -01 | 10003 | Sacramento | CA | 235 |
2019-02 | 10003 | Sacramento | CA | 239 |
2019-03 | 10003 | Sacramento | CA | 233 |
In this case, I have an absolute minimum date and absolute maximum date, so I want to make sure that all cities follow the same dates. How can I fill my panel with NaN rows for cities and have therefore the same number of rows for each ID, City and State?
One option, that offers an efficient abstraction, is with complete from pyjanitor to get missing rows for the combination of Date
vs the group of ('ID', 'City', 'State')
:
# pip install pyjanitor
import pandas as pd
import janitor
df.complete(('ID', 'City', 'State'), 'Date')
Date ID City State Quantity
0 2019-01 10001 Los Angeles CA 500.0
1 2019-02 10001 Los Angeles CA 995.0
2 2019-03 10001 Los Angeles CA 943.0
3 2019-01 10002 Houston TX 4330.0
4 2019-02 10002 Houston TX NaN
5 2019-03 10002 Houston TX 2340.0
6 2019-01 10003 Sacramento CA 235.0
7 2019-02 10003 Sacramento CA 239.0
8 2019-03 10003 Sacramento CA 233.0