Search code examples
pythonsqlgoogle-bigquerypypi

Is it possible to write a BigQuery to retrieve binned counts of PyPI downloads over time?


The following code is an SQL query for google's BigQuery that counts the number of times my PyPI package has been downloaded in the last 30 days.

#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pycotools'
  -- Only query the last 30 days of history
  AND _TABLE_SUFFIX
    BETWEEN FORMAT_DATE(
      '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())

Is it possible to modify this query so that I get the number of downloads every 30 days since the package was uploaded? The output would be a .csv that looks something like this:

date          count
01-01-2016    10
01-02-2016    20
    ..        ..
01-05-2018    100

Solution

  • I recommend to use the EXTRACT or MONTH() and to count only the file.project field as it will let the query run faster. the query you could use is:

    #standardSQL
    SELECT
      EXTRACT(MONTH FROM _PARTITIONDATE) AS month_, 
      EXTRACT(YEAR FROM _PARTITIONDATE) AS year_,
      count(file.project) as count
    FROM
      `the-psf.pypi.downloads*`
    WHERE
      file.project= 'pycotools'
        GROUP BY 1, 2
        ORDER by 1 ASC
    

    I tried it with the public dataset:

    #standardSQL
    SELECT
      EXTRACT(MONTH FROM pickup_datetime) AS month_, 
      EXTRACT(YEAR FROM pickup_datetime) AS year_,
      count(rate_code) as count
    FROM
      `nyc-tlc.green.trips_2015`
    WHERE
      rate_code=5
    GROUP BY 1, 2
    ORDER by 1 ASC
    

    or using legacy

    SELECT
      MONTH(pickup_datetime) AS month_, 
      YEAR(pickup_datetime) AS year_,
      count(rate_code) as count
    FROM
      [nyc-tlc:green.trips_2015]
      WHERE
      rate_code=5
      GROUP BY 1, 2
      ORDER by 1 ASC
    

    the result is:

    month_  year_   count    
    1       2015    34228    
    2       2015    36366    
    3       2015    42221    
    4       2015    41159    
    5       2015    41934    
    6       2015    39506        
    

    I see you are using _TABLE_SUFFIX, so if you are querying partitioned table you can use the _PARTITIONDATE column instead of formatting the date and using the date_sub function. This will use less compute time as well.

    To query from one partition:

    SELECT
      [COLUMN]
    FROM
      [DATASET].[TABLE]
    WHERE
      _PARTITIONDATE BETWEEN '2016-01-01'
      AND '2016-01-02'