I wrote a function that scrapes data from a website for multiple zip codes. The code works for most zip codes, but there are some where I'm getting an Unknown String Error.
Here is the code I'm using
import time
from datetime import date, timedelta
from bs4 import BeautifulSoup
import urllib2
from dateutil.parser import parse
import pandas as pd
import random
import os
url = 'https://www.sittercity.com/jobs/search?distance=50&&page=1&per_page=100000&search_strategy=babbysitting_job&&selected_facets%5Bnew_jobs%5D=true&sort=relevance&zipcode=94513'
soup = BeautifulSoup(urllib2.urlopen(url))
posts = [t.text for t in soup.find_all(class_ = "item posted-by")]
dates = [parse(item, fuzzy = True) for item in posts]
The error is coming from the 34th item in the posts list. However the datatype of each element in the list is the same, so I'm confused. Also the 33rd item in the list seems to work. For example:
This works:
dates_single = parse(posts[32], fuzzy = True)
But this doesn't (?)
dates_single = parse(posts[33], fuzzy = True)
Here are the values of posts[32] and posts[33]
>>> posts[33]
u'Posted by April A. on 3/28/2016'
>>> posts[32]
u'Posted by Chandrika M. on 3/30/2016'
I read through the datetil.parser documentation and none of the "Unknown String Error" use cases seem to fit.
Your error occurs due to conflict between April
(detected as month name) and 3
detected as month number.
Minimal example:
from dateutil.parser import parse
parse(u'Posted by Chandrika M. on 3/30/2016', fuzzy=True) # datetime.datetime(2016, 3, 30, 0, 0)
parse(u'Posted by April A. on 3/28/2016', fuzzy=True) # ValueError: Unknown string format
parse(u'Posted by XYZ A. on 3/28/2016') # datetime.datetime(2016, 3, 28, 0, 0)
Since your format is well-defined you may simply perform straightforward conversion, without any heuristics.
s = u'Posted by April A. on 3/28/2016'
datetime.datetime.strptime(s.split()[-1], "%m/%d/%Y") # datetime.datetime(2016, 3, 28, 0, 0)