I would like to get all temperature/temperature ranges with and without spaces in between them. For now, I am able to get those without spaces in between them using:
re.findall(r'[0-9°c-]+', text)
What would I need to add to the regex such that I can get the ones with spaces between them properly as well? E.g 50 space ° space C should be seen as a whole instead of three pieces.
You may use
-?\d+(?:\.\d+)?\s*°\s*c(?:\s*-\s*-?\d+(?:\.\d+)?\s*°\s*c)?
See the regex demo.
The pattern consists of a -?\d+(?:\.\d+)?\s*°\s*c
block that is repeated twice (to match an optional range part) and matches negative and fractional temperature values:
-?
- an optional hyphen\d+
- 1+ digits (?:\.\d+)?
- an optional fractional part\s*
- 0+ whitespaces°
- the degree symbol\s*
- 0+ whitespacesc
- c
char.The (?:\s*-\s*<ABOVE_BLOCK>)?
matches 1 or 0 repetitions of a hyphen enclosed with 0+ whitespaces and then the same block as described above.
In Python, it makes sense to build the pattern dynamically:
tb = r'-?\d+(?:\.\d+)?\s*°\s*c'
rx = r'{0}(?:\s*-\s*{0})?'.format(tb)
results = re.findall(rx, s)
If c
is optional replace \s*c
with (?:\s*c)?
.
If °
and c
are optional replace \s*°\s*c
with (?:\s*°\s*c)?
or (?:\s*°(?:\s*c)?)?
.
Here is the temperature block pattern where the degree symbol and the c
char are all optional but follow in the same order as before:
tb = r'-?\d+(?:\.\d+)?(?:\s*°(?:\s*c)?)?'
Full Python demo code:
import re
s = 'This is some temperature 30° c - 50 ° c 2°c 34.5 °c 30°c - 40 °c and "30° - 40, and -45.5° - -56.5° range'
tb = r'-?\d+(?:\.\d+)?(?:\s*°(?:\s*c)?)?'
rx = r'{0}(?:\s*-\s*{0})?'.format(tb)
results = re.findall(rx, s)
print(results)
# => ['30° c - 50 ° c', '2°c', '34.5 °c', '30°c - 40 °c', '30° - 40', '-45.5° - -56.5°']
If the degree symbol may go missing and c
may still be there move the grouping boundary:
tb = r'-?\d+(?:\.\d+)?(?:\s*°)?(?:\s*c)?'
^-------^^-------^
See this regex demo and the full Python code demo:
import re
s = 'This is some temperature 30° c - 50 ° c 2°c 34.5 °c 30°c - 40 °c and "30° - 40, and -45.5° - -56.5° range 30c - 50 °c" or 30c - 40'
tb = r'-?\d+(?:\.\d+)?(?:\s*°)?(?:\s*c)?'
rx = r'{0}(?:\s*-\s*{0})?'.format(tb)
results = re.findall(rx, s)
print(results)
Output:
['30° c - 50 ° c', '2°c', '34.5 °c', '30°c - 40 °c', '30° - 40', '-45.5° - -56.5°', '30c - 50 °c', '30c - 40']