I want ONLY the <a href NPPES Data Dissemination in the Full Replacement Monthly NPI File section of https://download.cms.gov/nppes/NPI_Files.html. There are other <a href NPPES Data Dissemination files in the Weekly Incremental NPI Files that I do NOT want. Here is the code that gets ALL NPPES Data Dissemination files in the monthly and weekly sections:
import subprocess
import re
from bs4 import BeautifulSoup
import requests
import wget
def get_urls(soup):
urls = []
for a in soup.find_all('a', href=True):
ul = a.find_all(text=re.compile('NPPES Data Dissemination'))
if ul != []:
urls.append(a)
print('done scraping the url...')
return urls
def download_and_extract(urls): for texts in urls: text = str(texts) file = text[55:99] print('zip file :', file) zip_link = texts['href'] print('Downloading %s :' %zip_link) slashurl = zip_link.split('/') print(slashurl) wget.download("https://download.cms.gov/nppes/"+ slashurl[1])
r = requests.get('https://download.cms.gov/nppes/NPI_Files.html')
soup = BeautifulSoup(r.content, 'html.parser')
urls = get_urls(soup)
download_and_extract(urls)
Tried: Limit=1 does not work as I have it below, as all NPPES Data Dissemination files are still collected
def get_urls(soup):
urls = []
for a in soup.find_all('a', href=True):
ul = a.find_all(text=re.compile('NPPES Data Dissemination'), limit=1)
if ul != []:
urls.append(a)
print('done scraping the url......!!!!')
return urls
Tried: If I use the open parenthesis 'NPPES Data Dissemination (' as it is only in the Full Replacement Monthly NPI File section, I get errors (below)
def get_urls(soup):
urls = []
for a in soup.find_all('a', href=True):
ul = a.find_all(text=re.compile('NPPES Data Dissemination ('), limit=1)
if ul != []:
urls.append(a)
print('done scraping the url......!!!!')
return urls
thank you for any assistance you may provide!!!!
If what you need is only the first link
So what happen here is, the limit you set is the first regex found in the link But you still loop searching it for all links
The simple solution to get the first link is just add break
when you found so it will stop the loop
def get_urls(soup):
urls = []
for a in soup.find_all('a', href=True):
ul = a.find_all(text=re.compile('NPPES Data Dissemination'))
if ul != []:
urls.append(a)
# break (stop loop) if found
break
print('done scraping the url......!!!!')
return urls
Update: when I look at the website actually you can update it by using regex only (not using break)
Full Replacement Monthly NPI File -> re.compile('NPPES Data Dissemination \(')
Full Replacement Monthly NPI Deactivation File -> re.compile('NPPES Data Dissemination - Monthly Deactivation Update')
Weekly Incremental NPI Files -> re.compile('NPPES Data Dissemination - Weekly Update')