Search code examples
pythonbeautifulsouphreflimit

need to limit BeautifulSoup href result to first occurence - or - account for an open parenthesis in href string


I want ONLY the <a href NPPES Data Dissemination in the Full Replacement Monthly NPI File section of https://download.cms.gov/nppes/NPI_Files.html. There are other <a href NPPES Data Dissemination files in the Weekly Incremental NPI Files that I do NOT want. Here is the code that gets ALL NPPES Data Dissemination files in the monthly and weekly sections:

import subprocess
import re
from bs4 import BeautifulSoup
import requests
import wget

def get_urls(soup):
    urls = []
    for a in soup.find_all('a', href=True):
        ul = a.find_all(text=re.compile('NPPES Data Dissemination'))
        if ul != []:
            urls.append(a)
    print('done scraping the url...')
    return urls

def download_and_extract(urls): for texts in urls: text = str(texts) file = text[55:99] print('zip file :', file) zip_link = texts['href'] print('Downloading %s :' %zip_link) slashurl = zip_link.split('/') print(slashurl) wget.download("https://download.cms.gov/nppes/"+ slashurl[1])

r = requests.get('https://download.cms.gov/nppes/NPI_Files.html')
soup = BeautifulSoup(r.content, 'html.parser')
urls = get_urls(soup)
download_and_extract(urls)

Tried: Limit=1 does not work as I have it below, as all NPPES Data Dissemination files are still collected

def get_urls(soup):
    urls = []
    for a in soup.find_all('a', href=True):
        ul = a.find_all(text=re.compile('NPPES Data Dissemination'), limit=1)
        if ul != []:
            urls.append(a)
    print('done scraping the url......!!!!')
    return urls

Tried: If I use the open parenthesis 'NPPES Data Dissemination (' as it is only in the Full Replacement Monthly NPI File section, I get errors (below)

def get_urls(soup):
    urls = []
    for a in soup.find_all('a', href=True):
        ul = a.find_all(text=re.compile('NPPES Data Dissemination ('), limit=1)
        if ul != []:
            urls.append(a)
    print('done scraping the url......!!!!')
    return urls 

thank you for any assistance you may provide!!!!


Solution

  • If what you need is only the first link

    So what happen here is, the limit you set is the first regex found in the link But you still loop searching it for all links

    The simple solution to get the first link is just add break when you found so it will stop the loop

    def get_urls(soup):
        urls = []
        for a in soup.find_all('a', href=True):
            ul = a.find_all(text=re.compile('NPPES Data Dissemination'))
            if ul != []:
                urls.append(a)
                # break (stop loop) if found
                break
        print('done scraping the url......!!!!')
        return urls
    

    Update: when I look at the website actually you can update it by using regex only (not using break)

    Full Replacement Monthly NPI File -> re.compile('NPPES Data Dissemination \(')

    Full Replacement Monthly NPI Deactivation File -> re.compile('NPPES Data Dissemination - Monthly Deactivation Update')

    Weekly Incremental NPI Files -> re.compile('NPPES Data Dissemination - Weekly Update')