I am working with a data set from kaggle on NBA allstars (https://www.kaggle.com/fmejia21/nba-all-star-game-20002016) [link for anyone who wants to run it themselves]. The data set looks like this:
In [3]: df1.head(3)
Out[3]:
Year Player Pos ... Selection Type NBA Draft Status Nationality
0 2016 Stephen Curry G ... Western All-Star Fan Vote Selection 2009 Rnd 1 Pick 7 United States
1 2016 James Harden SG ... Western All-Star Fan Vote Selection 2009 Rnd 1 Pick 3 United States
2 2016 Kevin Durant SF ... Western All-Star Fan Vote Selection 2007 Rnd 1 Pick 2 United States
[3 rows x 9 columns]
What I am trying to do is grab the draft position under the 'NBA Draft Status' column and store it in another column, so I begin just by checking the split:
In [4]: df1['NBA Draft Status'].str.split(' ')
Out[4]:
0 [2009, Rnd, 1, Pick, 7]
1 [2009, Rnd, 1, Pick, 3]
So it seems simple enough; just grab the item in the fourth position. If it's a second round pick then add 30 to that number. I use this:
In [5]: positions = []
...: for draft in df1['NBA Draft Status']:
...: if 'Rnd 2' in draft:
...: position = draft.split(' ')[4]
...: position = int(position) + 30
...: positions.append(position)
...: else:
...: position = draft.split(' ')[4]
...: position = int(position)
...: positions.append(position)
and it throws an index error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-5-0946ed392ea2> in <module>
6 positions.append(position)
7 else:
----> 8 position = draft.split(' ')[4]
9 position = int(position)
10 positions.append(position)
IndexError: list index out of range
Okay... now this is where the question is; why is it out of range? While trying to investigate what the issue is, I found that I can print this index but for whatever reason can't append it to an empty list. This works:
In [6]: for draft in df1['NBA Draft Status']:
...: print(draft.split(' ')[4])
...: break
...:
7
Can someone explain to me what is going on? I know this is rather wordy but I didn't know how else to convey the problem without giving some backdrop to the data set.
The problem is you have some values in df1['NBA Draft Status']
which only have 3 spaces in them, so when you call .split()
on them the resultant list is 4 items long, which with 0 indexing is causing your index error.
df1['length'] = df1['NBA Draft Status'].apply(lambda draft: len(draft.split()))
df2 = df1.loc[df1.length == 4,:]
df2['NBA Draft Status']
Out[74]:
309 1996 NBA Draft, Undrafted
334 1996 NBA Draft, Undrafted
346 1998 NBA Draft, Undrafted
348 1996 NBA Draft, Undrafted
360 1996 NBA Draft, Undrafted
371 1998 NBA Draft, Undrafted
Name: NBA Draft Status, dtype: object
Drop them with: df1 = df1.loc[df1.length == 5,:]
, and then rerun your code. It will work.