I am sorry I didnt really know how to word the title of this question. I do not work with Python too often and I am just starting to work with the pandas and numpy packages.
I am getting unexpected results when trying to concatenate and append a pandas dataframe in a for loop.
I have a data set that I got from sql and put into a pandas dataframe (df):
print(df.head())
date visitor visitor_score home home_score W L
0 20160405 BOS 6 CLE 2 94 67
1 20160406 BOS 6 CLE 7 94 67
2 20160408 BOS 8 TOR 7 89 73
3 20160409 BOS 8 TOR 4 89 73
4 20160410 BOS 0 TOR 3 89 73
I have another data set from sql that I also put in a pandas data frame (dfBostonStats):
print(dfBostonStats.head())
teamID ab h 2b 3b hr so sb ra er era IPouts HA \
0 BOS 5670 1598 343 25 208 1160 83 694 640 4.0 4319 1342
hra soa e fp bpf ppf dp
0 176 1362 75 0.987 108 106 139
I want to concatenate that data frame (dfBostonStats) to each row of the first data frame (df).
I determined I could use pandas.concat and I proved this through concatenating the first row of df:
print(pd.concat([df.iloc[[0]], dfBostonStats], axis=1))
date visitor visitor_score home home_score W L teamID ab \
0 20160405 BOS 6 CLE 2 94 67 BOS 5670
h ... era IPouts HA hra soa e fp bpf ppf dp
0 1598 ... 4.0 4319 1342 176 1362 75 0.987 108 106 139
I then tried to concatenate each row by using a for loop but it gives me an unexpected result. it concatenates one row properly but then prints me a row of just the 2nd dataframe I have listed (dfBostonStats)
for index, element in df.iterrows():
tempdf = pd.concat([df.iloc[[index]], dfBostonStats], axis=1)
concatDataFrame = concatDataFrame.append(tempdf, ignore_index=True)
print(concatDataFrame.head())
date visitor visitor_score home home_score W L teamID \
0 20160405 BOS 6.0 CLE 2.0 94.0 67.0 BOS
1 NaN NaN NaN NaN NaN NaN NaN BOS
2 20160406 BOS 6.0 CLE 7.0 94.0 67.0 NaN
3 NaN NaN NaN NaN NaN NaN NaN BOS
4 20160408 BOS 8.0 TOR 7.0 89.0 73.0 NaN
ab h ... era IPouts HA hra soa e fp \
0 5670.0 1598.0 ... 4.0 4319.0 1342.0 176.0 1362.0 75.0 0.987
1 5670.0 1598.0 ... 4.0 4319.0 1342.0 176.0 1362.0 75.0 0.987
2 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
3 5670.0 1598.0 ... 4.0 4319.0 1342.0 176.0 1362.0 75.0 0.987
4 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
bpf ppf dp
0 108.0 106.0 139
1 108.0 106.0 139
2 NaN NaN NaN
3 108.0 106.0 139
4 NaN NaN NaN
I can not figure out why it is printing that row with only dfBostonStats rather then just printing only concatenated rows?
On a side note, I know inside the for loop there is a copy occuring every time causing a performance hit but I figured I would deal with that once I get the data looking how it should.
I think if need join first dataframe by column visitor
and second by column teamID
use merge
with left join. No loop is necessary:
print (df)
date visitor visitor_score home home_score W L
0 20160405 BOS 6 CLE 2 94 67
1 20160406 BOS 6 CLE 7 94 67
2 20160408 AAA 8 TOR 7 89 73
3 20160409 AAA 8 TOR 4 89 73
4 20160410 AAA 0 TOR 3 89 73
print (dfBostonStats)
teamID ab h 2b 3b hr so sb ra er era IPouts HA \
0 BOS 5670 1598 343 25 208 1160 83 694 640 4.0 4319 1342
0 AAA 4 5 6 4 5 1160 83 694 640 4.0 4319 1342
hra soa e fp bpf ppf dp
0 176 1362 75 0.987 10 106 139
0 176 1362 75 0.987 10 106 139
df2 = df.merge(dfBostonStats, left_on='visitor', right_on='teamID', how='left')
print (df2)
date visitor visitor_score home home_score W L teamID ab \
0 20160405 BOS 6 CLE 2 94 67 BOS 5670
1 20160406 BOS 6 CLE 7 94 67 BOS 5670
2 20160408 AAA 8 TOR 7 89 73 AAA 4
3 20160409 AAA 8 TOR 4 89 73 AAA 4
4 20160410 AAA 0 TOR 3 89 73 AAA 4
h ... era IPouts HA hra soa e fp bpf ppf dp
0 1598 ... 4.0 4319 1342 176 1362 75 0.987 10 106 139
1 1598 ... 4.0 4319 1342 176 1362 75 0.987 10 106 139
2 5 ... 4.0 4319 1342 176 1362 75 0.987 10 106 139
3 5 ... 4.0 4319 1342 176 1362 75 0.987 10 106 139
4 5 ... 4.0 4319 1342 176 1362 75 0.987 10 106 139
[5 rows x 27 columns]