Search code examples
httpnginxweb-scrapingscreen-scraping

Scraping countermeasures on nginx?


I'm writing some code that has to get some data from a website. Nothing controversial (I think) - it's for a kid's sports club, and it has to get their times from the national organisation's website. It's not proprietary or commercial data.

The problem is that the returned data appears to be deliberately corrupted. I may just be being paranoid, but I've spent a few hours checking this. I'm using my own code, and I'm using the live headers Firefox extension to find out what to send to the site. I'm duplicating the GET headers exactly, except that I'm leaving out Accept-Encoding, since I don't want to handle gzip. I've tried Connection set to both close and keep-alive, but it makes no difference.

The returned page has a few additional hex character sequences spread around it - nothing much, but it's enough to mess up my parsing. The characters and their location change every time I try this. My initial thought was that I was messing up the stitching together of the buffers I was getting back (I have to call recv maybe 20 times to get the entire page), but this doesn't seem to be the problem. The scraped version of the page always ends like this, for example:

</body>

7
</html>
0

where the live page always ends </body></html>.

Any idea what's going on here? This site appears to be on Cloudflare/nginx. Is this something that nginx can do? Is it possible that they're messing up the text version of the page, and sending good data on the gzipped version? I'm not keen to start unzipping data.


Solution

  • I think there's a bug with HTTP1.1 processing in nginx. Practically any set of headers with HTTP1.0 returns correct data. This works:

    GET /foo/bar/etc HTTP/1.0
    Host: baz.org
    Connection: close
    

    but this inserts spurious data into the returned doc:

    GET /foo/bar/etc HTTP/1.1
    Host: baz.org
    Connection: close
    

    So, nothing to do with scraping.