Search code examples
pythonpandasdataframeminstring-length

Comparing a dataframe on string lengths for different columns


I am trying to get the string lengths for different columns. Seems quite straightforward with:

df['a'].str.len()

But I need to apply it to multiple columns. And then get the minimum on it.

Something like:

df[['a','b','c']].str.len().min

I know the above doesn't work, but hopefully you get the idea. Column a, b, c all contain names and I want to retrieve the shortest name.

Also because of huge data, I am avoiding creating other columns to save on size.


Solution

  • I think you need list comprehension, because string function works only with Series (column):

    print ([df[col].str.len().min() for col in ['a','b','c']])
    

    Another solution with apply:

    print ([df[col].apply(len).min() for col in ['a','b','c']])
    

    Sample:

    df = pd.DataFrame({'a':['h','gg','yyy'],
                       'b':['st','dsws','sw'],
                       'c':['fffff','','rr'],
                       'd':[1,3,5]})
    
    print (df)
    
         a     b      c  d
    0    h    st  fffff  1
    1   gg  dsws         3
    2  yyy    sw     rr  5
    
    print ([df[col].str.len().min() for col in ['a','b','c']])
    [1, 2, 0]
    

    Timings:

    #[3000 rows x 4 columns]
    df = pd.concat([df]*1000).reset_index(drop=True)
    
    In [17]: %timeit ([df[col].apply(len).min() for col in ['a','b','c']])
    100 loops, best of 3: 2.63 ms per loop
    
    In [18]: %timeit ([df[col].str.len().min() for col in ['a','b','c']])
    The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
    100 loops, best of 3: 2.88 ms per loop
    

    Conclusion:

    apply is faster, but not works with None.

    df = pd.DataFrame({'a':['h','gg','yyy'],
                       'b':[None,'dsws','sw'],
                       'c':['fffff','','rr'],
                       'd':[1,3,5]})
    
    
    print (df)
         a     b      c  d
    0    h  None  fffff  1
    1   gg  dsws         3
    2  yyy    sw     rr  5
    
    print ([df[col].apply(len).min() for col in ['a','b','c']])
    

    TypeError: object of type 'NoneType' has no len()

    print ([df[col].str.len().min() for col in ['a','b','c']])
    [1, 2.0, 0]
    

    EDIT by comment:

    #fail with None
    print (df[['a','b','c']].applymap(len).min(axis=1))
    0    1
    1    0
    2    2
    dtype: int64
    

    #working with None
    print (df[['a','b','c']].apply(lambda x: x.str.len().min(), axis=1))
    0    1
    1    0
    2    2
    dtype: int64