Search code examples
pythonhtmlweb-scrapingscrapy

How to extract this HTML data not enclosed in <div> using Scrapy?


I've been trying to get this specific data set extracted and to use with Scrapy for a scraping project. My current python code is:

bedrooms_info = house_listing.css(
                '.search-results-listings-list__item-description__characteristics__item:contains("Chambres") ::text').get()
            bedrooms = self.extract_number(bedrooms_info) if bedrooms_info else None

The extract number method described above is:

    def extract_number(self, value):
        try:
            # Use regular expression to extract numeric values
            match = re.search(r'\d+', value)
            return int(match.group()) if match else None
        except (TypeError, ValueError):
            return None

And the HTML sequence of the website in question is:

<div class="search-results-listings-list__item-description__item search-results-listings-list__item-description__characteristics">
            <div class="search-results-listings-list__item-description__characteristics__item">
            <!--?xml version="1.0"?-->
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 46 41" class="search-results-listings-list__item-description__characteristics__icon search-results-listings-list__item-description__characteristics__icon--bedrooms"><path d="M5.106 0c-.997 0-1.52.904-1.52 1.533v11.965L.074 23.95c-.054.163-.074.38-.074.486V39.2c-.017.814.727 1.554 1.54 1.554.796 0 1.54-.74 1.52-1.554v-3.555h39.88V39.2c-.016.814.724 1.554 1.52 1.554.813 0 1.56-.74 1.54-1.554V24.436c0-.106-.017-.326-.074-.486l-3.512-10.449V1.537c0-.633-.523-1.534-1.52-1.534H5.106V0zm1.54 3.07h32.708v3.663a5.499 5.499 0 0 0-2.553-.614h-9.708c-1.614 0-3.06.687-4.093 1.77a5.648 5.648 0 0 0-4.093-1.77H9.2c-.924 0-1.793.217-2.553.614V3.07zm2.553 6.098h9.708c1.45 0 2.553 1.12 2.553 2.547v.523H6.646v-.523c0-1.426 1.103-2.547 2.553-2.547zm17.894 0H36.8c1.45 0 2.553 1.12 2.553 2.547v.523H24.54v-.523c0-1.426 1.103-2.547 2.553-2.547zm-20.88 6.12H39.79l2.553 7.615H3.656l2.556-7.615zM3.06 25.973h39.88v6.625H3.06v-6.625z"></path></svg>
            <div class="search-results-listings-list__item-description__characteristics-popover">Chambres</div>
            1
        </div>
                    </div>

I've been trying for a whole day to extract the number of bedrooms (in the above code, it's the 1). However, all my program is returning is null. If anyone has any insights into how I could extract that specific number, I'd appreciate it.

I've tried multiple different approaches, most of them ending with null. One alternative led me to extracting "Chambres" rather than the actual number of bedrooms. This alternative approach also returns null:

bedrooms_info = house_listing.css(
                'div.search-results-listings-list__item-description__characteristics__item::text').get()

Solution

  • You are very very close.

    The only key change you really needed was to use getall instead of get on your css query.

    .search-results-listings-list__item-description__characteristics__item:contains("Chambres") ::text

    What your css query says in english is get the text contents of the element with class .search-results-listings-list__item-description__characteristics__item and also contains a child with the value of "Chambres".

    So your selector is correct, the only issue is that there are multiple different results for that query and by using get you only return the first result.

    Using getall will return each of the results in a list for which the one you are looking for is the last of them.

    So an example of successfully extracting the "1" value would be:

    html = """
    <html>
    <div class="search-results-listings-list__item-description__item search-results-listings-list__item-description__characteristics">
                <div class="search-results-listings-list__item-description__characteristics__item">
                <!--?xml version="1.0"?-->
    <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 46 41" class="search-results-listings-list__item-description__characteristics__icon search-results-listings-list__item-description__characteristics__icon--bedrooms"><path d="M5.106 0c-.997 0-1.52.904-1.52 1.533v11.965L.074 23.95c-.054.163-.074.38-.074.486V39.2c-.017.814.727 1.554 1.54 1.554.796 0 1.54-.74 1.52-1.554v-3.555h39.88V39.2c-.016.814.724 1.554 1.52 1.554.813 0 1.56-.74 1.54-1.554V24.436c0-.106-.017-.326-.074-.486l-3.512-10.449V1.537c0-.633-.523-1.534-1.52-1.534H5.106V0zm1.54 3.07h32.708v3.663a5.499 5.499 0 0 0-2.553-.614h-9.708c-1.614 0-3.06.687-4.093 1.77a5.648 5.648 0 0 0-4.093-1.77H9.2c-.924 0-1.793.217-2.553.614V3.07zm2.553 6.098h9.708c1.45 0 2.553 1.12 2.553 2.547v.523H6.646v-.523c0-1.426 1.103-2.547 2.553-2.547zm17.894 0H36.8c1.45 0 2.553 1.12 2.553 2.547v.523H24.54v-.523c0-1.426 1.103-2.547 2.553-2.547zm-20.88 6.12H39.79l2.553 7.615H3.656l2.556-7.615zM3.06 25.973h39.88v6.625H3.06v-6.625z"></path></svg>
                <div class="search-results-listings-list__item-description__characteristics-popover">Chambres</div>
                1
            </div>
                    </div>
    </html>
    """
    
    import scrapy
    import re
    
    selector = scrapy.Selector(text=html)
    
    bedrooms_info = selector.css('.search-results-listings-list__item-description__characteristics__item:contains("Chambres") ::text').getall()
    bedrooms = bedrooms_info[-1]  # '\n            1\n        ' 
    print(int(re.match(r'\d+', bedrooms).group()))  # 1
    

    OUTPUT

    1