Search code examples
pythonhtmlwebbeautifulsoupscreen-scraping

Can't extract the text and find all by BeautifulSoup


I want to extract the all the available items in the équipements, but I can only get the first four items, and then I got '+ plus'.

import urllib2
from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
url = 'https://www.airbnb.fr/rooms/8261637?s=bAMrFL5A'
req = urllib2.Request(url = url, headers = headers)
html = urllib2.urlopen(req)
bsobj = BeautifulSoup(html.read(),'lxml')
b = bsobj.findAll("div",{"class": "row amenities"})

for the result of b, it does not return all the list inside the tag. And for the last one of it is '+ plus', looks like as following.

<span data-reactid=".mjeft4n4sg.0.0.0.0.1.8.1.0.0.$1.1.0.0">+ Plus</span></strong></a></div></div></div></div></div>]

Solution

  • This is because data filled up using reactjs after page load. So if you download it via requests you can't see the data.

    Instead you have to use selenium web driver, open page and process all the javascripts. Then you can get ccess to all data you expect