Given a <div>
containing a price with a lot of noise:
Price 1\u00a0500\u00a0000 EUR
and you want only the pure amount (1500000), what is the best way to implement this in Scrapy?
I tried to combine regex:
il.add_css('price', 'div.price_tag::text', re='([.\d]+)\s*(?:EUR)')
together with a general pipeline removing non-ascii code:
def process_item(self, item, spider):
def remove_non_ascii(text):
return ''.join(i for i in text if ord(i)<128)
for key, value in item.items():
item[key] =remove_non_ascii(item[key])
return item
But it seems the pipeline is executed after the regex, and hence it would only find "000" instead of "1500000".
Of course one could build in a .replace()
somewhere for those specific cases, but I would prefer to stick with the standard methods available and keep it better maintainable.
You can use
\d+(?:\s\d+)*(?=\s*EUR)
See the regex demo.
Details:
\d+
- one or more digits(?:\s\d+)*
- zero or more sequences of a whitespace and one or more digits(?=\s*EUR)
- a positive lookahead that matches a location in string that is immediately followed with zero or more whitespaces and then EUR
.NOTE that \s
and other shorthand character classes are Unicode-aware by default in a Python 3.x regex, you need no additional flags.