I am following: https://medium.com/@anateresa.mdneto/starbucks-capstone-project-79f84b2a1558 Data is from the links in the article.
Data is initially loaded with this code:
file1 = "e:\\python\\pandas\\datasets\\Starbucks\\portfolio.json"
portfolio = pd.read_json(file1, orient='records', lines=True)
file2 = "e:\\python\\pandas\\datasets\\Starbucks\\profile.json"
profile = pd.read_json(file2, orient='records', lines=True)
file3 = "e:\\python\\pandas\\datasets\\Starbucks\\transcript.json"
transcript = pd.read_json(file3, orient='records', lines=True)
I have a df (transcript) that one columns values are dicts. Most dicts are a single key:value pair, but some dicts have two key:value pairs.
I start by extracting\exploding the column of dicts so that I get a new column for each of the keys.
There are four unique keys and I get four new columns.
I also print the head() and check the notnull() counts:
transcript_cp = transcript.copy(deep=True)
transcript_cp = transcript_cp.join(pd.DataFrame(transcript_cp.pop('value').values.tolist()))
print(transcript_cp.head(), '\n')
print(transcript_cp['offer id'].notnull().sum())
print(transcript_cp['amount'].notnull().sum())
print(transcript_cp['offer_id'].notnull().sum())
print(transcript_cp['reward'].notnull().sum(), '\n')
Output:
person event time offer id amount offer_id reward
0 78afa995795e4d85b5d9ceeca43f5fef offer received 0 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN NaN NaN
1 a03223e636434f42ac4c3df47e8bac43 offer received 0 0b1e1539f2cc45b7b9fa7c272da2e1d7 NaN NaN NaN
2 e2127556f4f64592b11af22de27a7932 offer received 0 2906b810c7d4411798c6938adc9daaa5 NaN NaN NaN
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received 0 fafdcd668e3743c1bb461111dcafc2a4 NaN NaN NaN
4 68617ca6246f4fbc85e91a2a49552598 offer received 0 4d5c57ea9a6940dd891ad53e9dbe8da0 NaN NaN NaN
134002
138953
33579
33579
'offer id' and 'offer_id' are really the same thing. There is a typo in the name, so I want\need to combine these two columns into a single column.
For this to work correctly The following assumptions must hold:
Here is how I prove my assumptions are valid:
df1 = transcript_cp.isna()
df2 = pd.crosstab(df1['offer id'], df1['offer_id'])
print(df2)
offer_id False True
offer id
False 0 134002
True 33579 138953
False\False == not null\not null There are zero instances of both columns being non null for any given single row.
False\True == not null\null There are 134002 instances where 'offer id' is not null but 'offer_id' is.
True\False == null\not null There are 33579 instances where 'offer id' is null but 'offer_id' is not.
True\True == null\null There are 138953 instances where both are null.
To make a new row that has the values from 'offer id' and 'offer_id' combined I am using np.where.
transcript_cp['TEMP'] = np.where(transcript_cp['offer_id'] != np.nan, transcript_cp['offer_id'], transcript_cp['offer id'])
However, my total for non null values never adds up to 134002 + 33579 = 167581.
Using the above np.where code I get 33579.
If I flip my 'offer id' and 'offer_id' (see below), which to me should make no difference whatsoever, I get 134002.
transcript_cp['TEMP'] = np.where(transcript_cp['offer id'] != np.nan, transcript_cp['offer id'], transcript_cp['offer_id'])
Am I not using np.where correctly? I thought it reads: if condtion True, result1 else result2.
So I am saying if the checked column is not null, return that value, otherwise return the value in the other column.
Based upon my crosstab results I believe I should be getting 167581 non null values, and it should not make any difference in which order I perform the np.where.
np.nan != np.nan
is evaluated to True
. So there are differences between the two commands (what happens when offer id
is nan
?).
Why don't you just use fillna
:
transcript_cp['offer id'].fillna(transcript_cp['offer_id'])