Right way to scrape this noisy price tag

Given a <div> containing a price with a lot of noise:

Price 1\u00a0500\u00a0000 EUR

and you want only the pure amount (1500000), what is the best way to implement this in Scrapy?

I tried to combine regex:

il.add_css('price', 'div.price_tag::text', re='([.\d]+)\s*(?:EUR)')

together with a general pipeline removing non-ascii code:

def process_item(self, item, spider):
    def remove_non_ascii(text):
        return ''.join(i for i in text if ord(i)<128)
    for key, value in item.items():
        item[key] =remove_non_ascii(item[key]) 
    return item

But it seems the pipeline is executed after the regex, and hence it would only find "000" instead of "1500000".

Of course one could build in a .replace() somewhere for those specific cases, but I would prefer to stick with the standard methods available and keep it better maintainable.

Solution

You can use

\d+(?:\s\d+)*(?=\s*EUR)

See the regex demo.

Details:

\d+ - one or more digits
(?:\s\d+)* - zero or more sequences of a whitespace and one or more digits
(?=\s*EUR) - a positive lookahead that matches a location in string that is immediately followed with zero or more whitespaces and then EUR.

NOTE that \s and other shorthand character classes are Unicode-aware by default in a Python 3.x regex, you need no additional flags.