I have to extract the brand name, model, and sometimes trim level of cars found on a website. Problem is that when I put two groups in my regex, I do not have access to the third element (trim level of the car) and when I put three groups in my regex, I get nothing from cars without trim levels.
<a href="https://XXX.ir/car/bmw/x4">بیامو ایکس ۴ </a>
<a href="https://XXX.ir/car/peugeot/405/glx">پژو ۴۰۵ جیالایکس</a>
my_regex_1 = r'https:\/\/XXX\.ir\/car\/(.+)\/(.+)\/(.+)'
my_regex_2 = r'https:\/\/XXX\.ir\/car\/(.+)\/(.+)\/'
My code:
import requests
from bs4 import BeautifulSoup
import re
mainpage = requests.get('https://bama.ir/')
soup = BeautifulSoup(mainpage.text, 'html.parser')
brands = soup.find_all('a')
infos = []
for item in brands:
link = item['href']
info = re.findall(r'https:\/\/bama\.ir\/car\/([^\/]+?)\/([^\/]+?)(?:\/([^"]+))?', link)
infos.append(info)
print(infos)
Try Regex: https:\/\/XXX\.ir\/car\/([^\/]+?)\/([^\/]+?)(?:\/([^\"]+))?\"