Search code examples
pythonnonetype

Is there such thing as a "if x (or any variable) has any value" function in Python?


I'm trying to build a web crawler that generates a text file for multiple different websites. After it crawls a website it is supposed to get all the links in a website. However, I have encountered a problem while web crawling Wikipedia. The python script gives me the error:

Traceback (most recent call last):
  File "/home/banana/Desktop/Search engine/data/crawler?.py", line 22, in <module>
    urlwaitinglist.write(link.get('href'))
TypeError: write() argument must be str, not None

I looked deeper into it by having it print the discovered links and it has "None" at the top. I'm wondering if there is a function to see if the variable has any value.

Here is the code I have written so far:

from bs4 import BeautifulSoup
import os
import requests
import random
import re

toscan = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
url = toscan
source_code = requests.get(url)
plain_text = source_code.text

removal_list = ["http://", "https://", "/"]

for word in removal_list:
    toscan = toscan.replace(word, "")

soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.find_all('a'):
    print(link.get('href'))
    urlwaitinglist = open("/home/banana/Desktop/Search engine/data/toscan", "a")
    urlwaitinglist.write('\n')
    urlwaitinglist.write(link.get('href'))
    urlwaitinglist.close()
    
print(soup.get_text())

directory = "/home/banana/Desktop/Search engine/data/Crawled Data/"

results = soup.get_text()

results = results.strip()

f = open("/home/banana/Desktop/Search engine/data/Crawled Data/" + toscan + ".txt", "w")
f.write(url)
f.write('\n')
f.write(results)
f.close()

Solution

  • Looks like not every <a> tag you are grabbing is returning a value. I would suggest making every link variable you grab a string and check if its not None. It is also bad practice to to open a file without using the 'with' clause. I have added an example that grabs every https|http link and writing it to file using some of your code below:

    from bs4 import BeautifulSoup
    import os
    import requests
    import random
    import re
    
    file_directory = './' # your specified directory location
    filename = 'urls.txt' # your specified filename
    
    url = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
    res = requests.get(url)
    html = res.text
        
    soup = BeautifulSoup(html, 'html.parser')
    links = []
    
    for link in soup.find_all('a'):
        link = link.get('href')
        print(link)
        match = re.search('^(http|https)://', str(link))
        if match:
            links.append(str(link))
        
        
    with open(file_directory + filename, 'w') as file:
        for link in links:
            file.write(link + '\n')