python pandas jupyter numeric custom-data-type

Ounces to Pounds in Python

I have a column of weights (there are actually 5k weights), a small batch looks like this:

weight
15.00 oz
19.00 oz
2 lb 15.00 oz 
1 lb 19.00 oz

What I wanted to do is to convert the whole weight to pounds, something like this:

weight
0.9375 lb
1.1875 lb
2.9375 lb
2.1875 lb

What shall I do to achieve this? What I have tried so far is:

df[['lbs','oz']] = df.Weight.str.split("lb",expand=True)

but this doesn't work, as for rows with no 'lb' unit, the code doesn't work. that looked like this:

pounds    ounces
 15.00    oz
 19.00    oz
 2lb      15.00oz
 1lb      19.00oz

Solution

This works, but there's almost certainly a neater 'more pandas' way to do it... This should be fast enough to process 5,000 values.

Imports:

import pandas as pd

Test data set-up (including the data with oz values after the .):

df = pd.DataFrame(["15.00 oz",
                   "19.00 oz",
                   "2 lb 15.00 oz",
                   "1 lb 19.00 oz",
                   "1 lb 12.80 oz",
                   "1 lb",
                   "nothing"],
                   columns=["weight"])

Produces:

          weight
0       15.00 oz
1       19.00 oz
2  2 lb 15.00 oz
3  1 lb 19.00 oz
4  1 lb 12.80 oz
5           1 lb
6        nothing

Define a function to map from the individual lb/oz values to a single lb value. This takes an array of tuples, which may be empty, such as: [(,'15.00')] or [] or [('1', '12.80')] (The 'numbers' in the matches are still of type str at this point):

def lbsFromMatchedNumbers(matchData):
    if len(matchData) == 0:
        return None
    (lbs, oz) = matchData[0]
    lbs = float(lbs or 0)
    oz = float(oz or 0)
    ounces_in_pound = 16.0
    return lbs + (oz / ounces_in_pound)

Find all the items in the 'weight' row, and then process them with the function, and assign to new 'lb' column:

matchPattern = "^(?:(\d+) lb ?)?(?:(\d+(?:.\d+)?) oz)?$"
df["lb"] = df["weight"].str.findall(matchPattern).apply(lbsFromMatchedNumbers)

Produces:

          weight      lb
0       15.00 oz  0.9375
1       19.00 oz  1.1875
2  2 lb 15.00 oz  2.9375
3  1 lb 19.00 oz  2.1875
4  1 lb 12.80 oz  1.8000
5           1 lb  1.0000
6        nothing     NaN

Note: This works if there are only lb or oz numbers, as shown in extra rows in the sample data I've used. If there's neither, it produces NaN.

Explanation of the regex part

We're using a regex ('regular expression') to match the portions of the 'weight' text content using this pattern: "^(?:(\d+) lb ?)?(?:(\d+(?:.\d+)?) oz)?$"

Regex syntax used

\d looks for a single 0-9 value
\d+ looks for one or more 0-9 values (e.g., 1 or 435245)
? (a , then a ?) looks for either a space, or nothing (the ? makes it optional)
(hello)? looks for 'hello', but continues anyway if it's not found (due to the ?
Brackets group items together
Brackets with followed by ?:, (?:like this), group items together, but don't save this as one of the 'matched groups'. In our example only the two numbers are returned, when they are matched (as they are enclosed in plain brackets)

Our particular example regex

Putting that together, this regex basically says:

From the very start of the string (marked by ^)
This part is optional:
- look for 1-or-more 0-9 digits - if you find this it is 'group 1'
- then a space
- then the text 'lb'
- then (optionally) a space
Followed by (also optional):
- This whole part is 'group 2':
  - 1-or-more 0-9 digits
  - this bit is optional:
    - a ., followed by 1-or-more 0-9 digits
- then a space
- then the text 'oz'
- then the end of the string (marked by $)