I got this piece of code to spit out the unique "area number" in the URL. However, the loop doesn't work. It spits out the same number, please see below:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
url = open('MS Type 1 URL.txt',encoding='utf-8-sig')
links = []
for link in url:
y = link.strip()
links.append(y)
url.close()
print('Amount of Links: ', len(links))
for x in links:
j = (x.find("=") + 1)
g = (x.find('&housing'))
print(link[j:g])
Results are:
As you can see it spits out the area number '23' which is only in one of this URL but not the '38' of the other URL.
There's a typo in your code. You iterate over links
list and bind its elements to x
variable, but print a slice of link
variable, so you get the same string printed on each loop iteration. So you can change print(link[j:g])
to print(x[j:g])
, but it's better to call your variables with more descriptive names, so here's the fixed version of your loop:
for link in links:
j = link.find('=') + 1
g = link.find('&housing')
print(link[j:g])
And I also want to show you a proper way to extract area
value from URLs:
from urllib.parse import urlparse, parse_qs
url = 'http://millersamuel.com/aggy-data/home/query_report?area=38&housing_type=3&measure=4&query_type=quarterly®ion=1&year_end=2020&year_start=1980'
area = parse_qs(urlparse(url).query)['area'][0]
So instead of using str.find
method, you can write this:
for url in urls:
parsed_qs = parse_qs(urlparse(url).query)
if 'area' in parsed_qs:
area = parsed_qs['area'][0]
print(area)
Used functions: