Search code examples
pythonregexregex-lookaroundsregex-groupregex-greedy

RegEx for matching temperatures (°c)


I would like to get all temperature/temperature ranges with and without spaces in between them. For now, I am able to get those without spaces in between them using:

re.findall(r'[0-9°c-]+', text)

enter image description here

What would I need to add to the regex such that I can get the ones with spaces between them properly as well? E.g 50 space ° space C should be seen as a whole instead of three pieces.


Solution

  • You may use

    -?\d+(?:\.\d+)?\s*°\s*c(?:\s*-\s*-?\d+(?:\.\d+)?\s*°\s*c)?
    

    See the regex demo.

    The pattern consists of a -?\d+(?:\.\d+)?\s*°\s*c block that is repeated twice (to match an optional range part) and matches negative and fractional temperature values:

    • -? - an optional hyphen
    • \d+ - 1+ digits
    • (?:\.\d+)? - an optional fractional part
    • \s* - 0+ whitespaces
    • ° - the degree symbol
    • \s* - 0+ whitespaces
    • c - c char.

    The (?:\s*-\s*<ABOVE_BLOCK>)? matches 1 or 0 repetitions of a hyphen enclosed with 0+ whitespaces and then the same block as described above.

    In Python, it makes sense to build the pattern dynamically:

    tb = r'-?\d+(?:\.\d+)?\s*°\s*c'
    rx = r'{0}(?:\s*-\s*{0})?'.format(tb)
    results = re.findall(rx, s)
    

    If c is optional replace \s*c with (?:\s*c)?.

    If ° and c are optional replace \s*°\s*c with (?:\s*°\s*c)? or (?:\s*°(?:\s*c)?)?.

    Here is the temperature block pattern where the degree symbol and the c char are all optional but follow in the same order as before:

    tb = r'-?\d+(?:\.\d+)?(?:\s*°(?:\s*c)?)?'
    

    Full Python demo code:

    import re
    s = 'This is some temperature 30° c - 50 ° c  2°c  34.5 °c 30°c - 40 °c and "30° - 40, and -45.5° - -56.5° range' 
    tb = r'-?\d+(?:\.\d+)?(?:\s*°(?:\s*c)?)?'
    rx = r'{0}(?:\s*-\s*{0})?'.format(tb)
    results = re.findall(rx, s)
    print(results)
    # => ['30° c - 50 ° c', '2°c', '34.5 °c', '30°c - 40 °c', '30° - 40', '-45.5° - -56.5°']
    

    If the degree symbol may go missing and c may still be there move the grouping boundary:

    tb = r'-?\d+(?:\.\d+)?(?:\s*°)?(?:\s*c)?'
                          ^-------^^-------^
    

    See this regex demo and the full Python code demo:

    import re
    s = 'This is some temperature 30° c - 50 ° c  2°c  34.5 °c 30°c - 40 °c and "30° - 40, and -45.5° - -56.5° range 30c - 50 °c" or 30c - 40' 
    tb = r'-?\d+(?:\.\d+)?(?:\s*°)?(?:\s*c)?'
    rx = r'{0}(?:\s*-\s*{0})?'.format(tb)
    results = re.findall(rx, s)
    print(results)
    

    Output:

    ['30° c - 50 ° c', '2°c', '34.5 °c', '30°c - 40 °c', '30° - 40', '-45.5° - -56.5°', '30c - 50 °c', '30c - 40']