Really new to scrapy and webscraping in general, ive been working through the tutorials and wanted to branch out and see what I could do.
Im have some content from a website I want to scrape which has the details in three columns. I want to pull the details from each column.
I cant work out how to get the data for each heading from the columns. For example, if I want to store the body type in a scrapy field called body_type, how would I get the text "Coachbuilt" ? The other thing is, the content I want may not always be in the same location in the table, but would always be proceeded with the right name. EG Transmission is the first field in the second column, with data of Manual, but may not always be there, could be in the first column.
Ive got as far as getting the whole top level div using response.css(".single-product__left-desc__specification").get()
then from there I can get the first column with firstcol=response.css(".single-product__left-desc__specification .first-col")
and the subsequently the row-4 content with fields=firstcol.css(".col-4 strong::text").getall()
but that gives me the first row, I can then do something like values=firstcol.css(".col-8::text").getall() but that doesn't give me all the data and Im not sure how to match the fields to the values to create the fields I want. Can someone point me in the right direction?
Here is the html snippet.
<div class="single-product__left-desc__specification">
<div class="row">
<div class="first-col col-md-6 col-lg-5 p-0">
<div class="row">
<div class="col-4"><strong>Body Type</strong></div>
<div class="col-8 text-left pl-0">Coachbuilt</div>
</div>
<div class="row">
<div class="col-4 pr-0"><strong>Make</strong></div>
<div class="col-8 text-left pl-0">Adria</div>
</div>
<div class="row">
<div class="col-4"><strong>Range</strong></div>
<div class="col-8 text-left pl-0">Matrix 670 SLT Supreme </div>
</div>
<div class="row">
<div class="col-4"><strong>Reg Year</strong></div>
<div class="col-8 text-left pl-0 product-single__taxonomy-reg">2018</div>
</div>
<div class="row">
<div class="col-4"><strong>Layout</strong></div>
<div class="col-8 text-left pl-0">End Washroom</div>
</div>
</div>
<div class="col-md-6 col-lg-7 p-0">
<div class="row">
<div class="col-md-6 col-lg-7 p-0">
<div class="row">
<div class="col-6"><strong>Transmission</strong></div>
<div class="col-6 text-left pl-0">Manual</div>
</div>
<div class="row">
<div class="col-6"><strong>Drive</strong></div>
<div class="col-6 text-left pl-0">RHD</div>
</div>
<div class="row">
<div class="col-6"><strong>Engine Size</strong></div>
<div class="col-6 text-left pl-0">2.3</div>
</div>
<div class="row">
<div class="col-6"><strong>Engine Power</strong></div>
<div class="col-6 text-left pl-0">143bhp</div>
</div>
<div class="row">
<div class="col-6"><strong>Bed Type</strong></div>
<div class="col-6 text-left pl-0">Single Beds</div>
</div>
</div>
<div class="last-col col-md-6 col-lg-5 p-0">
<div class="row">
<div class="col-6"><strong>Mileage</strong></div>
<div class="col-6 text-left pl-0">
<div class="product-single__taxonomy">11111</div>
</div>
</div>
<div class="row">
<div class="col-6"><strong>Berths</strong></div>
<div class="col-6 text-left pl-0">
<div class="product-single__taxonomy">4</div>
</div>
</div>
<div class="row">
<div class="col-6"><strong>Weight</strong></div>
<div class="col-6 text-left pl-0">3800</div>
</div>
<div class="row">
<div class="col-6"><strong>Seatbelts</strong></div>
<div class="col-6 text-left pl-0">4</div>
</div>
</div>
</div>
</div>
</div>
</div>
Ideally, if someone can point to docs, videos, howtos that would help me work this out, that would be fantastic!
Many thanks
Solution should look like this
result = {}
for row in response.css('.single-product__left-desc__specification .row .row'):
name = row.css('strong::text').get('')
value = ''.join([c.strip('\n ') for c in row.css('.pl-0').css('*::text').getall()])
result[name] = value
for k, v in result.items():
print(f"{k} : {v}")
so in general I made.. compound css query .single-product__left-desc__specification .row .row
and for each result (inside for
cycle) - I made separate query for field name (inside strong
tag) and to it's value (that complicated stuff to get all text of other column).
Output - dictionaty with following content:
Body Type : Coachbuilt
Make : Adria
Range : Matrix 670 SLT Supreme
Reg Year : 2018
Layout : End Washroom
Transmission : Manual
Drive : RHD
Engine Size : 2.3
Engine Power : 143bhp
Bed Type : Single Beds
Mileage : 11111
Berths : 4
Weight : 3800
Seatbelts : 4
In scrapy dictionary also can be returned as item by adding yield result
inside parse method