Search code examples
pythonpandaspint

Parsing units out of column


I've got some data I'm reading into Python using Pandas and want to keep track of units with the Pint package. The values have a range of scales, so have mixed units, e.g. lengths are mostly meters but some are centimeters.

For example the data:

what,length
foo,5.3 m
bar,72 cm

and I'd like to end up with the length column in some form that Pint understands. Pint's Pandas integration suggests that it only supports the whole column having the same datatype, which seems reasonable. I'm happy with some arbitrary unit being picked (e.g. the first, most common, or just SI base unit) and everything expressed in terms of that.

I was expecting some nice way of getting from the data I have to what's expected, but I don't see anything.

import pandas as pd
import pint_pandas

length = pd.Series(['5.3 m', "72 cm"], dtype='pint[m]')

Doesn't do the correct thing at all, for example:

length * 2

outputs

0    5.3 m5.3 m
1    72 cm72 cm
dtype: pint[meter]

so it's just leaving things as strings. Calling length.pint.convert_object_dtype() doesn't help and everything stays as strings.


Solution

  • Going through the examples, it looks like pint_pandas is expecting numbers rather than strings. You can use apply to do the conversion:

    from pint import UnitRegistry
    ureg = UnitRegistry()
    
    df["length"].apply(lambda i: ureg(i)).astype("pint[m]")
    

    However, why keep the column as Quantity objects instead of just plain float numbers?