Search code examples
pythonpython-3.xloopsbeautifulsoupurllib3

For Loop doesn't spit out needed results


I got this piece of code to spit out the unique "area number" in the URL. However, the loop doesn't work. It spits out the same number, please see below:

import urllib3
from bs4 import BeautifulSoup

http = urllib3.PoolManager()

url = open('MS Type 1 URL.txt',encoding='utf-8-sig')

links = []
for link in url:
    y = link.strip()
    links.append(y)

url.close()

print('Amount of Links: ', len(links))

for x in links:
    j = (x.find("=") + 1)
    g = (x.find('&housing'))
    print(link[j:g])

Results are:

http://millersamuel.com/aggy-data/home/query_report?area=38&housing_type=3&measure=4&query_type=quarterly&region=1&year_end=2020&year_start=1980 23

http://millersamuel.com/aggy-data/home/query_report?area=23&housing_type=1&measure=4&query_type=annual&region=1&year_end=2020&year_start=1980 23

As you can see it spits out the area number '23' which is only in one of this URL but not the '38' of the other URL.


Solution

  • There's a typo in your code. You iterate over links list and bind its elements to x variable, but print a slice of link variable, so you get the same string printed on each loop iteration. So you can change print(link[j:g]) to print(x[j:g]), but it's better to call your variables with more descriptive names, so here's the fixed version of your loop:

    for link in links:
        j = link.find('=') + 1
        g = link.find('&housing')
        print(link[j:g])
    

    And I also want to show you a proper way to extract area value from URLs:

    from urllib.parse import urlparse, parse_qs
    url = 'http://millersamuel.com/aggy-data/home/query_report?area=38&housing_type=3&measure=4&query_type=quarterly&region=1&year_end=2020&year_start=1980'
    area = parse_qs(urlparse(url).query)['area'][0]
    

    So instead of using str.find method, you can write this:

    for url in urls:
        parsed_qs = parse_qs(urlparse(url).query)
        if 'area' in parsed_qs:
            area = parsed_qs['area'][0]
            print(area)
    

    Used functions: