Search code examples
pythoncsvparsingweb-scrapingbeautifulsoup

Scraping and Parsing a website for information


I am trying to gather information about all golf courses in the US. I have created a script to scrape out data from the PGA website which provides with about 18000 golf courses. So my script is not running properly and I am having a problem fixing it. It suppose to create a column for ownership that suppose to provide information if it is private or public. I was able to find the information but when executed it is placed in random parts of CSV and not joined with its right golf course information. How do I go about fixing that it will give me all the necessary data from name,address,phone number, and website.

Second for the Address Field I want to parse out the information to be distributed into different columns in my CSV. I want the address field to broken up into Street Name and Number, City, CA, Zipcode and Country.

Lastly, I was wondering if it is possible to create a function that when the address has a P.O Box in its string it will be moved into another column named PO Box. How do I go about it?

I want to save all of this information into a CSV with the all the data I need

Here is my script:

import csv
import codecs
import requests 
from bs4 import BeautifulSoup

courses_list = []
for i in range(1):      # Number of pages plus one 
     url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
     r = requests.get(url)
     soup = BeautifulSoup(r.content)

     g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
     g_data2=soup.find_all("div",{"class":"views-field-nothing"})

     for item in g_data2 and g_data1:
          try:
               ownership = item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
               print (ownership)
          except:    
               ownership = ''
          try:
               name = item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
               print name
          except:
               name=''
          try:
               address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
          except:
               address1=''
          try:
               address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
          except:
               address2=''
          try:
               website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
          except:
               website=''   
          try:
               Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
          except:
               Phonenumber=''      

          course=[name,address1,address2,website,Phonenumber,ownership]
          courses_list.append(course)

with open ('Testing.csv','a') as file:
     writer=csv.writer(file)
     for row in courses_list:
          writer.writerow([s.encode("utf-8") for s in row])

Solution

  • I think Beautiful Soup might be overkill here, you should be able to parse this page just using regex. I've just done the first two fields for you, I'll leave it to you to fill in the other ones. By the way, don't call your list list, this is a reserved word in Python.

    import re
    import requests
    
    L = []
    for i in range(1):      # Number of pages plus one 
         url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
         r = requests.get(url)
         ownership = re.findall('(?<=<div class="views-field-course-type"><span class="field-content">)([^<]+)',r.text)    
         address = re.findall('(?<=<div class="views-field-address"><span class="field-content">)([^<]+)', r.text)
         L.extend(zip(ownership,address))
    

    If you want to export as a CSV, a Pandas DataFrame is probably the easiest way to go:

    import pandas as pd
    df = pd.DataFrame(L, columns = ['Ownership','Address'])
    df.to_csv('c:/golfcourselist.csv')
    df.head()
    
      Ownership             Address
    0   Private   1801 Merrimac Trl
    1    Public     12551 Glades Rd
    2    Public  13601 SW 115th Ave
    3    Public  465 Warrensburg Rd
    4    Public    45120 Waxpool Rd