Search code examples
pythonpandasdataframevariable-length

Total length of elements, and subsets, in a pandas dataframe


How I can count the total elements in a dataframe, including the subset, and put the result in the new column?

import pandas as pd
x = pd.Series([[1, (2,5,6)], [2, (3,4)], [3, 4], [(5,6), (7,8,9)]], \
              index=range(1, len(x)+1))
df = pd.DataFrame({'A': x})

I tried with the following code but it gives 2 in each of row:

df['Length'] = df['A'].apply(len)

print(df)

                         A  Length
    1       [1, (2, 5, 6)]       2
    2          [2, (3, 4)]       2
    3               [3, 4]       2
    4  [(5, 6), (7, 8, 9)]       2

However, what I want to get is as follow:

                         A  Length
    1       [1, (2, 5, 6)]       4
    2          [2, (3, 4)]       3
    3               [3, 4]       2
    4  [(5, 6), (7, 8, 9)]       5

thanks


Solution

  • Given:

    import pandas as pd
    x = pd.Series([[1, (2,5,6)], [2, (3,4)], [3, 4], [(5,6), (7,8,9)]])
    df = pd.DataFrame({'A': x}) 
    

    You can write a recursive generator that will yield 1 for each nested element that is not iterable. Something along these lines:

    import collections 
    
    def glen(LoS):
        def iselement(e):
            return not(isinstance(e, collections.Iterable) and not isinstance(e, str))
        for el in LoS:
            if iselement(el):
                yield 1
            else:
                for sub in glen(el): yield sub    
    
    df['Length'] = df['A'].apply(lambda e: sum(glen(e)))
    

    Yielding:

    >>> df
                         A  Length
    0       [1, (2, 5, 6)]       4
    1          [2, (3, 4)]       3
    2               [3, 4]       2
    3  [(5, 6), (7, 8, 9)]       5
    

    That will work in Python 2 or 3. With Python 3.3 or later, you can use yield from to replace the loop:

    def glen(LoS):
        def iselement(e):
            return not(isinstance(e, collections.Iterable) and not isinstance(e, str))
        for el in LoS:
            if iselement(el):
                yield 1
            else:
                yield from glen(el)