Search code examples
dynamicdatatablescrapy

Scrapy Extract Dynamic Table Data from Datasource directly


using scrapy I want to extract the data that is shown in a dynamic table on the webpage. As the table is a dynamic one - scrapy's response xpath to tbody-tag doesn't return any data

In [1]: response.xpath('//table/tbody').getall()
Out[1]: ['<tbody></tbody>']

On the other hand scrapy's response xpath to table-tag actually already contains all data - even in a structured way:

In [2]: response.xpath('//table').getall()
Out[2]: ['<table class="table icms-dt rs_preserve" cellspacing="0" width="100%" id="publikation" data-webpack-module="datatables" data-entity-type="publikation" data-entities="{&quot;emptyColumns&quot;:[&quot;privatKategorie&quot;,&quot;_thumbnail&quot;],&quot;data&quot;:[{&quot;name&quot;:&quot;&lt;a href=\\&quot;\\/_rte\\/publikation\\/35897\\&quot;&gt;Nutzungsbedingungen&lt;\\/a&gt;&quot;,&quot;name-sort&quot;:&quot;nutzungsbedingungen&quot;,&quot;herausgeber&quot;:&quot;Informatikdienst&quot;,&quot;herausgeber-sort&quot;:&quot;informatikdienst&quot;,&quot;datum&quot;:&quot;16.12.2010&quot;,&quot;datum-sort&quot;:&quot;2010-12-16&quot;,&quot;kategorieId&quot;:&quot;publikation&quot;,&quot;kategorieId-sort&quot;:&quot;publikation&quot;,&quot;privatKategorie&quot;:&quot;&quot;,&quot;privatKategorie-sort&quot;:&quot;&quot;,&quot;_thumbnail&quot;:&quot;&quot;,&quot;_downloadBtn

I want to extract the table data in a structured way - e.g. by row and column. Is there a way with BeautifulSoup for instance? Any idea & help are highly appreciated.

The table can be examined with scrapy shell as follows:

scrapy shell "rapperswil-jona.ch/publikationen"

Solution

  • Here you go:

    import json
    raw_data =response.xpath('//table/@data-entities').get()
    data = json.loads(raw_data)
    

    The data is in the data-entities attribute. You can extract that using the XPath as above. This returns a string.

    This string can then be converted to a dict using json.loads().

    Expanding this further, the actual data is in the key data. If you access it, you will get a list. You can run a loop, export to CSV, or process it further as you wish:

    for item in data['data']:
         print(item['name-sort'])