Search code examples
pythonweb-crawlerpypi

PyPi download counts seem unrealistic


I put a package on PyPi for the first time ~2 months ago, and have made some version updates since then. I noticed this week the download count recording, and was surprised to see it had been downloaded hundreds of times. Over the next few days, I was more surprised to see the download count increasing by sometimes hundreds per day, even though this is a niche statistical test toolbox. In particular, older versions of package are continuing to be downloaded, sometimes at higher rates than the newest version.

What is going on here?

Is there a bug in PyPi's downloaded counting, or is there an abundance of crawlers grabbing open source code (as mine is)?


Solution

  • This is kind of an old question at this point, but I noticed the same thing about a package I have on PyPI and investigated further. It turns out PyPI keeps reasonably detailed download statistics, including (apparently slightly anonymised) user agents. From that, it was apparent that most people downloading my package were things like "z3c.pypimirror/1.0.15.1" and "pep381client/1.5". (PEP 381 describes a mirroring infrastructure for PyPI.)

    I wrote a quick script to tally everything up, first including all of them and then leaving out the most obvious bots, and it turns out that literally 99% of the download activity for my package was caused by mirrorbots: 14,335 downloads total, compared to only 146 downloads with the bots filtered. And that's just leaving out the very obvious ones, so it's probably still an overestimate.

    It looks like the main reason PyPI needs mirrors is because it has them.