Search code examples
pythonpandasdataframesample

Why does my new column does net get assigned after using .sample method?


So I was just answering a question and I came across something interesting:

The dataframe looks like this:

  string1 string2
0     abc     def
1     ghi     jkl
2     mno     pqr
3     stu     vwx

So when I do the following, the assigning of new columns works:

df['string3'] = df.string2

print(df)

  string1 string2 string3
0     abc     def     def
1     ghi     jkl     jkl
2     mno     pqr     pqr
3     stu     vwx     vwx

But when I use pandas.DataFrame.Series.sample, the new column does net get assigned, at least not the sampled one:

df['string4'] = df.string2.sample(len(df.string2))
print(df)
  string1 string2 string3 string4
0     abc     def     def     def
1     ghi     jkl     jkl     jkl
2     mno     pqr     pqr     pqr
3     stu     vwx     vwx     vwx

So I tested some things:

Test1 Using sample without assign, gives us correct output:

df.string2.sample(len(df.string2))

2    pqr
1    jkl
0    def
3    vwx
Name: string2, dtype: object

Test2 Cannot overwrite either:

df['string2'] = df.string2.sample(len(df.string2))
print(df)
  string1 string2
0     abc     def
1     ghi     jkl
2     mno     pqr
3     stu     vwx

This works but why?

df['string2'] = df.string2.sample(len(df.string2)).values
print(df)
  string1 string2
0     abc     jkl
1     ghi     def
2     mno     vwx
3     stu     pqr

Why do I need to explicitly use .values or .tolist() to get the assigning correct?


Solution

  • pandas is index sensitive , which means they check the index when assign it , that is when you do the serise assign , the whole df not change , since the index is not change , after sort_index, it still show the same order of values, but if you do the numpy array assignment , the index will not be considered , so that the value itself will be assign back to the original df , which yield the output

    An example of egde

    df['string3']=pd.Series(['aaa','aaa','aaa','aaa'],index=[100,111,112,113])
    df
    Out[462]: 
      string1 string2 string3
    0     abc     vwx     NaN
    1     ghi     jkl     NaN
    2     mno     dfe     NaN
    3     stu     pqr     NaN
    

    Because of that index sensitive when you do condition assignment with.loc

    You can always do

    df.loc[df.condition,'value']=df.value*100 
    # since the not selected one will not be change 
    

    Just same to what you do with np.where

    df['value']=np.where(df.condition,df.value*100 ,df.value)
    

    Some other use case when I do groupby apply with none-agg function and try to assign it back ,why it is failed

    df['String4']=df.groupby('string1').apply(lambda x :x['string2']+'aa')

    TypeError: incompatible index of inserted column with frame index

    Let us try to look at the return of groupby.apply

    df.groupby('string1').apply(lambda x : x['string2']+'aa')
    Out[466]: 
    string1   
    abc      0    vwxaa
    ghi      1    jklaa
    mno      2    dfeaa
    stu      3    pqraa
    Name: string2, dtype
    

    Notice here it add the one more level into the index , so the return is multiple index ,and original df only have one dimension which will cause the error message .


    How to fix it ?


    reset the index and using the original index which is the second level of the groupby product , then assign it back

    df['String4']=df.groupby('string1').apply(lambda x : x['string2']+'aa').reset_index(level=0,drop=True)
    df
    Out[469]: 
      string1 string2 string3 String4
    0     abc     vwx     NaN   vwxaa
    1     ghi     jkl     NaN   jklaa
    2     mno     dfe     NaN   dfeaa
    3     stu     pqr     NaN   pqraa
    

    As Erfan mentioned in the comment, how can we forbidden accidentally assign unwanted value to pandas.DataFrame

    Two different ways of assign .

    1st, with a array or list or tuple .. CANNOT ALIGN, which means when you have different length between df and assign object , it will fail

    2nd assign with pandas object, ALWAYS aligns, no error will return, even the length different

    However when the assign object have duplicated index , it will raise the error

    df['string3']=pd.Series(['aaa','aaa','aaa','aaa'],index=[100,100,100,100])
    ValueError: cannot reindex from a duplicate axis