Search code examples
pythonregexpandasextract

Python Regex Extract Width x Depth x Height


I am trying to extract the physical dimensions of items from a column "Description" in a df to create a new column with it.

Dimensions usually appear in this format (120x80x100) in the middle of long descriptions like:

Lorem ipsum dolor sit amet, consectetur adipiscing elit 120x80x100 ed do eiusmod tempor...

But sometimes have spaces between:

120 x 80 x 100

Or don't have height:

120x80
120 x 80

Any help? Thanks in advance


Solution

  • You can use the regex, \d+\s*x\s*\d+(?:\s*x\s*\d+)?

    Explanation:

    • \d+: One or more digits
    • \s*: Zero or more whitespace characters
    • x: Literal, x
    • (?:\s*x\s*\d+)?: Optional non-capturing group

    If you want the numbers to be of one to three digits, replace \d+ with \d{1,3} as shown in the regex, \d{1,3}\s*x\s*\d{1,3}(?:\s*x\s*\d{1,3})?.

    If your code requires you to use a group, do it as follows:

    (\d{1,3}\s*x\s*\d{1,3}(?:\s*x\s*\d{1,3})?)