Search code examples
pythonbeautifulsoupfilterlist-comprehension

Filtering a list - Typerror


I'm a beginner, I want to filter the list lista_aziende_raw. I know for a fact that names that contain Azienda are present in the list, but I get an empty list. I guess that in only looks for identical results, hence no results. Is there a way to get what I want using list comprehension, or should I just use the re module?

lista_aziende_raw = []

for link in soup.find_all('a'):
    lista_aziende_raw.append(link.get('href'))

parsing_name = ['Azienda']

for link in soup.find_all('a'):
    lista_aziende_raw.append(link.get('href')) 

print(lista_aziende_raw)


lista_aziende_filtered = [x for x in lista_aziende_raw if x in parsing_name] 

print(lista_aziende_filtered)

I also tried to put it in an if statement, following this answer but that spits out TypeError: argument of type 'NoneType' is not iterable'.

lista_aziende_raw = []
for link in soup.find_all('a'):
    lista_aziende_raw.append(link.get('href')) 

print(lista_aziende_raw)

def parser(starting_list):
    parsing_name = 'Azienda'
    lista_aziende_filtered = []

    for x in starting_list:
        if parsing_name in x:
            lista_aziende_filtered.append(x)
    return lista_aziende_filtered

print(parser(lista_aziende_raw))

What am I missing? How would you do it?

EDIT: Here the output of print(lista_aziende_raw). The full error for the list comprehension:

File "/home/enrico/Documents/Learning_Python/edo/web_scraper.py", line 41, in <listcomp>
    lista_aziende_filtered = [x for x in lista_aziende_raw if x in parsing_name] 
                                                              ^^^^^^^^^^^^^^^^^
TypeError: 'in <string>' requires string as left operand, not NoneType

Full error for the if statement:

 print(parser(lista_aziende_raw))
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/enrico/Documents/Learning_Python/edo/web_scraper.py", line 27, in parser
    if parsing_name in x:
       ^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NoneType' is not iterable

Solution

  • One of the elements you're parsing is returning a None (lista_aziende_raw[10]) Modify your initial block to:

    for link in soup.find_all('a'):
        link_text = link.get('href')
        if isinstance(link_text, str): # check that the link_text is a string
            lista_aziende_raw.append(link_text) 
    

    What the additional isinstance(link_text, str) does is to check that the object returned by link.get is a string. You could also directly do if link_str:... since None evaluates to False, though I used the more explicit check for clarity.

    And your second attempt as well as the other suggestion should work.

    The only other problem is that most of your urls actually have 'azienda' in lowercase, so you should do if "azienda" in x.lower() for your condition.