Search code examples
pythonweb-scrapingbinary-data

Binary numbers returns all 0


I'm trying to convert my web scraping data into binary numbers. Basically, if the class name contains yes it is equal to 1 and no is equal to 0. When I print out the binary_value, it returns all 0 even though it contains yes. I'm not really sure what am I missing. Highly appreciated in advance.

import cfscrape

from bs4 import BeautifulSoup

scraper = cfscrape.create_scraper()

response = scraper.get('https://www.hipflat.co.th/projects/ruam-rudee-penthouse-lvukdc')

soup = BeautifulSoup(response.text, 'html.parser')

divs = soup.find_all('div', class_=lambda x: x and ("amenities__icon amenities__icon--yes" in x or "amenities__icon amenities__icon--no" in x))

# Convert the elements to binary numbers

for div in divs:
  if "amenities__icon amenities__icon--yes" in div['class']:
    binary_value = 1
  else:
    binary_value = 0
    
  print(binary_value)

the result appears in the terminal when print(div)

<div class="amenities__icon amenities__icon--yes"></div>
<div class="amenities__icon amenities__icon--no"></div>
<div class="amenities__icon amenities__icon--yes"></div>
<div class="amenities__icon amenities__icon--no"></div>
<div class="amenities__icon amenities__icon--yes"></div>
<div class="amenities__icon amenities__icon--no"></div>
<div class="amenities__icon amenities__icon--yes"></div>
<div class="amenities__icon amenities__icon--no"></div>
<div class="amenities__icon amenities__icon--yes"></div>
<div class="amenities__icon amenities__icon--no"></div>
<div class="amenities__icon amenities__icon--no"></div>
<div class="amenities__icon amenities__icon--no"></div>

Solution

  • Note: it would be shorter to get divs using .select with CSS selectors

    divs = soup.select('div.amenities__icon:is(.amenities__icon--yes, .amenities__icon--no)')
    

      if "amenities__icon amenities__icon--yes" in div['class']:
    

    You can actually just check if "amenities__icon--yes" in div['class'] since every div should have amenities__icon anyway - the lambda expression ensures it.

    None of the items in div['class'] (which can be expected to be a list of strings) will have any spaces, since HTML classes are separated by spaces, and when BeautifulSoup parses them, they are split into a list. (It becomes quite obvious if you just print the classes with for div in divs: print(div['class']).)

    So, the correct way to check for both amenities__icon and amenities__icon--yes classes would be

      if "amenities__icon" in div['class'] and "amenities__icon--yes" in div['class']:
    

    or, if you wanted that specific order for some reason, you could join the classes back into a single string before checking

      if "amenities__icon amenities__icon--yes" in " ".join(div['class']):
    

    If you use list comprehension

    [int("amenities__icon amenities__icon--yes" in " ".join(d['class'])) for d in divs] # OR
    # [1 if "amenities__icon amenities__icon--yes" in " ".join(d['class']) else 0 for d in divs]
    

    would return [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0].