flynn.gg

Christopher Flynn

Machine Learning
Systems Architect,
PhD Mathematician

Home
Projects
Open Source
Blog
Résumé

GitHub
LinkedIn

Blog


pypistats.org

2018-04-28 Feed

I recently had a week off between jobs to work on some personal projects. One of which was a website I had wanted to build that aggregated download stats for python packages.

Download stats haven’t been available from the Python Package Index (PyPI) for several years now. In order to get any information on downloads you currently have to query directly against raw download records that are hosted on Google Big Query. Doing so involves setting up an account with Google Cloud services, which gives you up to 1 TB of data to query per month for free. Any additional queries would cost you $5 / TB. There are a few python packages that provide a simple interface for getting aggregate download stats from BigQuery, but they still require authentication to Google Cloud and possess the same query limits.

I had recently built a web dashboard for my old job and thought I would do the same for python package downloads. I wrote a few queries to test against BigQuery’s data. After estimating their usage for a month’s worth of queries I realized I could stay under the 1 TB limit and provide a decent amount of data on every package on PyPI.

Great, I thought. I built the website in a few days using the Flask web framework, with a simple GitHub OAuth integration so GitHub users could track small sets of packages they maintain. I also set up a PostgreSQL database in AWS RDS to store the aggregate data. After doing some research I thought it would be cool to deploy serverless to AWS using Zappa.

Zappa deploys Flask or Django projects to AWS serverless infrastructure using API Gateway and Lambda. It’s not as simple as that, however. I spent a ton of time creating a minimal policy for a zappa user to be able to touch every AWS service required to provision the site. Since I was using RDS the Lambda project also had to be part of the same VPC as the database. In addition, the outgoing API calls to GitHub and pypi (for package metadata) meant the Lambda project had to reside in a subnet with access to a NAT Gateway with Elastic IP to communicate with the internet. These services are not part of the free tier and cost about ~$50 a month; significantly more than running a single EC2 instance on elastic beanstalk (EB). Additionally the daily ingestion task was timing out on Lambda.

I decided EB/EC2 was the way to go. Since Zappa was a no go I converted the daily ingestion job to a Celery task. I now had 4 services to run on my free tier instance: Flask, Celery, Celery-beat (scheduler), and redis (message queue). None of the EB supported platforms would allow me to do this so I either had to create one using packer, or deploy using Docker.

After creating images using Docker and docker-compose, the solution I settled on was to create a single docker image and manage the services using supervisor.

FROM python:3.6-slim
RUN apt-get update && apt-get install -y supervisor redis-server
RUN pip install pipenv

ENV WORKON_HOME=/venv

WORKDIR /app

ADD Pipfile /app
ADD Pipfile.lock /app

RUN pipenv install --verbose

ADD . /app

EXPOSE 5000

ENV C_FORCE_ROOT=1


CMD /usr/bin/supervisord

This image installs supervisor and redis and manages the python project dependencies using pipenv. Supervisor is executed on entry and manages the four services on the same instance:

[supervisord]
nodaemon=true

[program:redis]
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
command=redis-server

[program:pypistats]
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
command=bash -c "scripts/run_flask.sh"

[program:celery-worker]
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
user=nobody
command=bash -c "scripts/run_celery.sh"

[program:celery-beat]
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stdout
stderr_logfile_maxbytes=0
command=bash -c "scripts/run_beat.sh"

Each of the startup scripts exports environment variables from file and launches their respective process.

The instance is launched into EB using the eb cli. Security groups have to be adjusted to allow communication with the database and to allow HTTPS. The last part was to set up the domain pypistats.org and an SSL certificate using AWS ACM (also free), and use Route53 to point to the EB load balancer.

PyPIStats.org is now live and provides aggregate daily download stats on all python packages broken down by python major version, python minor version, system and overall (with and without known pypi mirrors). There is also a simple JSON API for retrieving recent download stats and download time series for each package-category.

See the project on GitHub and explore the live site.

Further reading

pypistats.org

AWS Serverless

Docker & Supervisor

Python Package Index

Back to the posts.