Search code examples
pythonbeautifulsouphtml-parsing

how to fetch all links from html without ancore tag?


I want to fetch all links from the link given in code and particularly this https://api.somthing.com/v1/companies/ link. All the regex which I found online is only fetching simple links like https://api.somthing.com

import requests
import re
from bs4 import BeautifulSoup

url='https://www.linkdin.com/'

x = requests.get(url)
html_doc=x.text
soup = BeautifulSoup(html_doc,"html.parser" )
print(soup)


Solution

  • You can findall the urls directly from the response content :

    p= r'https://api\.something\.com/.*?(?=")'
    
    urls = re.findall(p, html_doc)
    

    ​ Output :

    ['https://api.something.com/v1/companies/postings/733260034',
     'https://api.something.com/v1/companies/postings/371262356',
     'https://api.something.com/v1/companies/postings/465637233',
     'https://api.something.com/v1/companies/postings/315747724,
    ...