Why doesn't BeautifulSoup manage to download information from wix? I'm trying to use BeautifulSoup in order to download images from my website, while other sites do work (example of the code actually working) wix does not work... Is there anything I can change in my site's settings in order for it to work?
EDIT: CODE
from bs4 import BeautifulSoup
import urllib2
import shutil
import requests
from urlparse import urljoin
import time
def make_soup(url):
req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"})
html = urllib2.urlopen(req)
return BeautifulSoup(html, 'html.parser')
def get_images(url):
soup = make_soup(url)
images = [img for img in soup.findAll('img')]
print (str(len(images)) + " images found.")
print 'Downloading images to current working directory.'
image_links = [each.get('src') for each in images]
for each in image_links:
try:
filename = each.strip().split('/')[-1].strip()
src = urljoin(url, each)
print 'Getting: ' + filename
response = requests.get(src, stream=True)
# delay to avoid corrupted previews
time.sleep(1)
with open(filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
except:
print ' An error occurred. Continuing.'
print 'Done.'
def main():
url = HIDDEN ADDRESS
get_images(url)
if __name__ == '__main__':
main()
BeautifulSoup can only parse html. Wix sites are generated by javascript that runs when you load the page. When you request the page's html via urllib, you don't get the rendered html, you just get the base html with scripts to build the rendered html. In order to do this, you'd need something like selenium or a headless chrome browser to render the site via it's javascript, and then get the rendered html and feed it to beautifulsoup.
Here's an example of the body of a wix site, which you can see has no content other than a single div that gets populated via javascript.
...
<body>
<div id="SITE_CONTAINER"></div>
</body>
...