Search code examples
pythonpandasobjectaccessor

Is there a pandas accessor for whatever is the underlying value in the object in each cell?


In the codebase I have pandas objects (pd.DataFrame / pd.Series) that contain custom objects.

It would simplify the codebase significantly if I could call a method or property from the underlying objects without resorting to .apply.

To illustrate the point, consider a pandas series of "Car" objects.

class Car:
   ...
   def max_speed(self)->float:
      ...

x = pd.Series([car1, car2, car3]) 

Currently I could get the average car speed by doing:

x.apply(lambda x: x.max_speed()).mean()

I think it'd be nice if I could skip the .apply(lambda x: x...) and replace it with something like:

x.obj.max_speed().mean()

where obj would be my custom accessor.

To further illustrate the point, consider a class Plane

class Plane:
    def cruise_height(self)->float:

In the codebase I have:

x1 = pd.Series([car1, car2, car3])
x2 = pd.Series([plane1, plane2, plane3])

and I could get the average car speed / plane cruise height with

x1.apply(lambda x: x.max_speed()).mean()
x2.apply(lambda x: x.cruise_height()).mean()

I think it'd be more readable if I could do:

x1.obj.max_speed().mean()
x2.obj.cruise_height().mean()

I imagine this would be similar to how .str. exposes the underlying string methods.

pd.Series(['Hello', 'World']).str.get(0) # returns ['H', 'W']
pd.Series(['Hello', 'World']).str.upper()
# etc

Solution

  • As per Pandas documentation, you can register custom accessors using special decorators, like this:

    import pandas as pd
    
    @pd.api.extensions.register_series_accessor("spec")
    class SpecAccessor:
        def __init__(self, pandas_obj: pd.Series):
            self._obj = pandas_obj
            for i in range(len(self._obj)):
                for attr in self._obj[i].__class__.__dict__:
                    # set objects methods on the accessor
                    if not attr.startswith("__"):
                        ser = pd.Series(
                            [getattr(self._obj[i], attr)() for i in range(len(self._obj))]
                        )
                        setattr(self, attr, ser)
    

    So that with the following classes and instances:

    class Car:
        def __init__(self, speed: float):
            self._speed = speed
    
        def max_speed(self) -> float:
            return self._speed * 1.5
    
    class Plane:
        def __init__(self, max_height: float):
            self._max_height = max_height
    
        def cruise_height(self) -> float:
            return self._max_height * 0.6
    
    car1 = Car(10.0)
    car2 = Car(30.5)
    car3 = Car(50.9)
    
    plane1 = Plane(5_000.0)
    plane2 = Plane(3_000.5)
    plane3 = Plane(9_000.9)
    

    You can do:

    print(pd.Series([car1, car2, car3]).spec.max_speed)
    # Ouputs
    0    15.00
    1    45.75
    2    76.35
    dtype: float64
    
    print(pd.Series([plane1, plane2, plane3]).spec.cruise_height)
    # Outputs
    0    3000.00
    1    1800.30
    2    5400.54
    dtype: float64