Search code examples
pythonpandasdataframedatetimeindex

Unable to slice pandas dataframe (with date as key) using date as string


I'm generating an empty dataframe with a series of dates as the index. Data will be added to the dataframe at a later point.

cbd=pd.date_range(start=pd.datetime(2017,01,02),end=pd.datetime(2017,01,30),period=1)

df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)

df.head()
           Test1 Test2
2017-01-02   NaN   NaN
2017-01-03   NaN   NaN
2017-01-04   NaN   NaN
2017-01-05   NaN   NaN
2017-01-06   NaN   NaN

A few slicing methods don't seem to work. The following returns a KeyError:

df['2017-01-02']

However any of the following work:

df['2017-01-02':'2017-01-02']
df.loc['2017-01-02']

What am I missing here? Why doesn't the first slice return a result?


Solution

  • Dual behavior of [] in df[]

    • When you don't use : inside [], then the value(s) inside it will be considered as column(s).
    • And when you use : inside [], then the value(s) inside it will be considered as row(s).

    Why the dual nature?

    Because most of the time people want to slice the rows instead of slicing the columns.

    So they decided that x and y in df[x:y] should correspond to rows,

    and x in d[x] or x, y in df[[x,y]] should correspond to column(s).

    Example:

    df = pd.DataFrame(data = [[1,2,3], [1,2,3], [1,2,3]],
                                     index = ['A','B','C'], columns = ['A','B','C'])
    print df
    

    Output:

       A  B  C
    A  1  2  3
    B  1  2  3
    C  1  2  3
    

    Now when you do df['B'], it can mean 2 things:

    • Take the 2nd index B and give you the 2nd row 1 2 3

                       OR
      
    • Take the 2nd column B and give you the 2nd column 2 2 2.

    So in order to resolve this conflict and keep it unambiguous df['B'] will always mean that you want the column 'B', if there is no such column then it will throw an Error.

    Why does df['2017-01-02'] fails?

    It will search for a column '2017-01-02', Because there is no such column, it throws an error.

    Why does df.loc['2017-01-02'] works then?

    Because .loc[] has syntax of df.loc[row,column] and you can leave out the column if you will, as in your case, it simply means df.loc[row]