I have a huge list of urls with links to Amazon products, this urls have an information contained within that I need that is called ASIN number.
I understand that one of the best ways to extract that information is via Regular Expressions, I found a pattern in the urls that could help
The respective ASIN numbers are:
1- B07P4LVZNL, located between: dp/B07P4LVZNL/ref=sr_1_f
2- B07DXPN7TK, located between: dp/B07DXPN7TK/ref=sr_1_fkmr2_
3- B07R23QGH6, located between: gp/B07R23QGH6/ref=sr_1_fkmr2_
I tried this code:
asin = re.match("http[s]?://www.amazon.com(\w+)(.*)/(dp|gp/product)/(?P<asin>\w+).*", href, flags=re.IGNORECASE)
href is the variable where I have stored the urls
But well... It doesn't work quite well, this is the type of result I get:
<re.Match object; span=(0, 175), match='https://www.amazon.com/adidas-Originals-Solid-Mel>
<re.Match object; span=(0, 171), match='https://www.amazon.com/adidas-Game-Mode-Polo-Mult>
<re.Match object; span=(0, 167), match='https://www.amazon.com/adidas-Tech-Tee-Black-X-La>
Thank you for your help
I suggest using
/[dg]p/([^/]+)
It matches /dp/
or /gp/
and then captures into Group 1 any one or more characters other than /
.
See the regex demo. In Python:
asin = re.search(r'/[dg]p/([^/]+)', href, flags=re.IGNORECASE)
if asin:
print(asin.group(1))