Search code examples
pythonregexdata-cleaning

How can I clean a string of a range of price connected by a dash(-) using Regex in Python?


I want to clean a string of range of price 'GBP 10,000,000 – GBP 15,000,000' and remove the currency GBP and replace the dash(-) with a comma(,) using Regex in Python. The output I want is (10000000,15000000).

This is what I tried: re.sub('[GBP,/s-]','', text) which produces the output ' 10000000 – 15000000' I also would like to get rid of the leading and trailing whitespaces while replacing the dash(-) with a comma(,) to produce the output of a tuple (10000000,15000000)


Solution

  • Using re.sub with a callback function we can try:

    inp = "GBP 10,000,000 – GBP 15,000,000"
    output = re.sub(r'[A-Z]{3} (\d{1,3}(?:,\d{3})*) – [A-Z]{3} (\d{1,3}(?:,\d{3})*)', lambda m: '(' + m.group(1).replace(',', '') + ',' + m.group(2).replace(',', '') + ')', inp)
    print(output)  # (10000000,15000000)
    

    If you want an actual list/tuple of matches, then I suggest using re.findall:

    inp = "GBP 10,000,000 – GBP 15,000,000"
    output = [x.replace(',', '') for x in re.findall(r'[A-Z]{3} (\d{1,3}(?:,\d{3})*)', inp)]
    print(output)  # ['10000000', '15000000']