flynn.gg

Christopher Flynn

Data Scientist
Data Engineer
PhD Mathematician

Home
Projects
Open Source
Blog
Résumé

GitHub
LinkedIn

Blog


Python development best practices 2019

2019-06-15 Feed

Since joining SimpleBet as Platform Data Architect in April, one of my duties has been to improve and productionize our internal machine learning framework, which was built in Python and is used by our team of 30+ data scientists, sports betting analysts, and now our first class of interns. When I joined the company, the framework was in need of improvement. It was built entirely by the data science team, which consists of some brilliant folks at SimpleBet but with limited software development experience. It needed to be cleaned up quite a bit. In the past few weeks it’s come a long way from where it’s been to something that is much cleaner, organized, and robust. Some of the major improvements to the framework so far include applying the following Python best practices:

Environment Management

Nowadays, using the system-installed Python and pip is a bad way to work on Python projects and will surely lead to trouble down the road, especially when you find yourself working on multiple projects in multiple versions of Python. When I joined this project, some of the data scientists were using virtual environments, some weren’t. Some were using Python3.6, and others on 3.7. We needed to get everyone working in the same environment. The first step was to get set up with pyenv and poetry.

pyenv

pyenv is a tool for managing different versions of Python locally. It’s excellent when you need to maintain multiple Python projects that span multiple versions of Python, although it can be a bit tricky to install. There are a few dependencies required before using pyenv and some additional SDKs needed for MacOSX Mojave. From my experience, the recommended setup steps are as follows:

First we ensure that we have Xcode installed with its command line tools. To download Xcode, make sure you have an Apple ID and follow this link. Once installed, run the command

xcode-select --install

to ensure the CLI tools are installed. If you are using MacOSX Mojave 10.14.+, you should also try running the following command, which contains some additional SDK headers required for installing Python.

sudo installer -pkg /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg -target /

To install pyenv we will use homebrew, a package manager for Mac OSX. To install brew run:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

There are some dependencies we must install first before we setup pyenv, however:

brew install openssl readline sqlite3 xz zlib

These dependencies should ensure that Python is installed without warnings or errors. You should also ensure that these values are included in your ~/.bash_profile or equivalent:

# need this to install cryptography
export LDFLAGS="-L$(brew --prefix openssl)/lib"
export CFLAGS="-I$(brew --prefix openssl)/include"

# need this to install python with pyenv
export LDFLAGS="${LDFLAGS} -L/usr/local/opt/zlib/lib"
export CPPFLAGS="${CPPFLAGS} -I/usr/local/opt/zlib/include"
export PKG_CONFIG_PATH="${PKG_CONFIG_PATH} /usr/local/opt/zlib/lib/pkgconfig"

Restart your terminal.

Next install pyenv:

brew install pyenv

Using pyenv we can install the different versions of Python that we need. To install Python3.7, for instance, we use the command

pyenv install 3.7.3

We can also see which versions we have installed using

pyenv versions

To set the python version for a specific project, we would navigate to the top level folder of a project and run the command

pyenv local 3.7.3

This creates a file .python-version in the folder which indicates to pyenv that it should be using Python 3.7 when you type the command python in the terminal when in this project. Now we’re ready to start using poetry.

poetry

poetry is a tool for managing Python projects. The biggest advantage of poetry is that it manages virtual environments as well as project dependencies. It also has an excellent dependency resolver so you are less likely to find yourself with dependency conflicts in dependency hell.

To install poetry run:

curl -sSL https://raw.githubusercontent.com/sdispater/poetry/master/get-poetry.py | python

In addition, modify your ~/.bashrc to include these lines

# pyenv
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"

# poetry
source $HOME/.poetry/env

And in your ~/.bash_profile include these lines

# pyenv
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"

# poetry
source $HOME/.poetry/env
export PATH="$HOME/.poetry/bin:$PATH"

Restart your terminal.

To initialize a project using poetry, run the command

poetry init

This sets up a pyproject.toml file (the new standard project config file). This command will walk you through defining some metadata about the project and setting up the initial dependencies. I prefer to skip the interactive part of including dependencies here. Instead I think it’s better to add dependencies one at a time using the add command. For example, if we want to add requests as a dependency, we run

poetry add requests

This will create a virtual environment (in Python 3.7 thanks to pyenv and running pyenv local earlier) and install requests and its subdependencies. You will see requests has been added to the pyproject.toml file under the section [tool.poetry.dependencies], but also a new file poetry.lock has been created. This lock file contains all of the dependencies added (requests and its sub-dependencies). Each dependency includes required compatible versions of other packages for the dependency resolver, but also hashes of the packages to ensure that future builds install the exact same version of the software. The order of installation is also maintained in the pyproject.toml to help ensure deterministic builds. This is contrary to pip which will just install packages from requirements.txt in the order of the file (usually alphabetically) which can break builds in which certain packages must be installed before others.

To use poetry to run python commands, any command which you would normally run should now be prefixed with poetry run. For instance, if you want to run your tests using pytest, you should run peotry run pytest.

Alternatively, you can spawn a subshell within the activated virtual environment using poetry shell. This activates the environment and allows you to run commands without the poetry run prefix.

To install development dependencies in our project, such as tools used for testing, formatting, documentation, benchmarking, etc., we pass the -D flag when we install. For instance, if we install the documentation tool sphinx we would run

poetry add -D sphinx

The sphinx package would then be added to the [tool.poetry.dev-dependencies] section of the pyproject.toml file. This allows us to separate the required runtime dependencies from the packages required solely for developing.

Testing

Python comes with a unit testing framework built into the standard library, but it’s severely limited and not very flexible. More recently, pytest has proven to be highly flexible and robust and is easily the standard testing framework within the community. It also has a lot of plugin support which broaden its feature set considerably.

pytest

To install pytest as a development dependency in our project

poetry add -D pytest

For using pytest see the documentation.

Some of the plugins I would recommend would be pytest-cov for integration with the coverage package, and pytest-xdist for using multiprocessing to execute tests in parallel. To leverage both using the pytest command, include a pytest.ini file at the top level of your project with these contents

[pytest]
addopts = --cov=mypackagefolder/ --cov-report html -n auto --dist=loadscope mytestfolder/

coverage

Using the above command will create a folder called htmlcov in the top of your project. You can run

open htmlcov/index.html

which will open the coverage report in your web browser. Here you can easily see which portions of your code base were hit, missed, or skipped from your test suite.

vcrpy

For testing parts of your code that include scraping or hitting a REST API for data payloads, it’s best practice that your tests should mock out the responses of those web services, rather than hitting them directly with each run of the tests. Once way to do this is to use the vcrpy package, which is more or less a port of the vcr package for the Ruby language.

This allows you to mock the responses by only hitting the service once, the first time you run the test. The package will save the response into a yaml file, which is then played back the next time the request is invoked in the test. This prevents you from hitting (and perhaps overloading) a live service if you have hundreds or thousands of unit tests. It’s easy to implement using a context manager and defining where you want to save the response’s cassette file:

import requests
import vcr

with vcr.use_cassette("fixtures/vcr_cassettes/synopsis.yaml"):
    response = requests.get("http://www.iana.org/domains/reserved")
    assert "Example domains" in response.text

Code Formatting

One of the biggest issues with a code base that is written by lots of people with different language experience and different skill levels is that everyone wants to write Python with their own preferred style. This makes reading code difficult as the codebase is inconsistent, and certain folks just don’t care to abide by agreed upon code styles. The Python interpreter also doesn’t really care about what the code looks like, other than the fact that it adheres to proper indentation throughout. To solve this issue at SimpleBet, we use black code formatting.

black

black is now an officially maintained Python package and has quickly become the standard Python code formatter. The advantages of black is that it uses your code’s abstract syntax tree to compare the functionality before and after formatting, in order to ensure that it doesn’t mangle any of your code’s logic in the process. It is also unique in that it exposes almost no configuration options other than the line length for wrapping code.

To install black (as a development dependency) using poetry, run:

poetry add -D black --allow-prereleases

The --allow-prereleases is necessary here because as of the time of writing black has no formal releases and is still in beta.

To run black against your code, simply run

black .

and in CI integrations, use black --check . to return an exit code 1 if the code is not formatted. To add a limited configuration to black include this in your pyproject.toml:

[tool.black]
line-length = 120
target-version = ['py37']
include = '\.pyi?$'
exclude = '''

(
  /(
      \.eggs         # exclude a few common directories in the
    | \.git          # root of the project
    | \.hg
    | \.mypy_cache
    | \.tox
    | \.venv
    | _build
    | buck-out
    | build
    | dist
  )/
  | foo.py           # also separately exclude a file named foo.py in
                     # the root of the project
)
'''

I recommend a line length of 120 in order to be compatible with PyCharm’s default line marker settings in the editor (our data science team uses PyCharm).

isort

The isort library is also very convenient for organizing imports. I prefer to separate imports by source, so that there are three blocks of imports at the top of each file. These blocks consist of standard library packages, 3rd party packages, and 1st party (local) module imports. Within those blocks I also recommend grouping import and from . import statements, and subsequently alphabetizing each of these subblocks independently.

All of this is accomplishable with isort, although it can be difficult sometimes to get it to play nice with black formatting. One configuration I’ve found to work pretty well is the following, which can be added to the pyproject.toml file:

[tool.isort]
force_grid_wrap = 0
force_single_line = true
include_trailing_comma = true
line_length = 120
lines_after_imports = 2
multi_line_output = 3
use_parentheses = true

Continuous Integration

Lacking continuous integration tests can be problematic. There were many occasions in which pull requests were merged into our codebase which made the entire package broken and completely unusable. To remedy this it is standard practice to run all tests on every pull request. This allows maintainers to block merging from breaking code and blocking colleagues from working. At SimpleBet we use circleci which integrates seamlessly with GitHub.

circleci

We use configuration which performs three jobs. The CI jobs leverage the circleci/python Docker image which comes prepackaged with Python and some tools, including poetry.

The first of the jobs builds the project by installing dependencies using poetry, and caching the dependencies in an archive that can be quickly reloaded to speed up future builds as long as the dependencies haven’t changed.

Subsequently, a completed build triggers two other jobs, running tests using pytest, and checking formatting using black. Here is a sample configuration:

version: 2.1
executors:
  myproject:
    docker:
      - image: circleci/python:3.7.3
    working_directory: ~/repo
jobs:
  build:
    executor: myproject
    steps:
      - checkout
      - restore_cache:
          keys:
            - deps-{{ checksum "poetry.lock" }}
      - run:
          name: Install Dependencies
          command: |
            poetry install
      - save_cache:
          key: deps-{{ checksum "poetry.lock" }}
          paths:
            - /home/circleci/.cache/pypoetry/virtualenvs
  test:
    executor: myproject
    steps:
      - checkout
      - restore_cache:
          keys:
            - deps-{{ checksum "poetry.lock" }}
      - run:
          name: Run tests
          command: |
            poetry run pytest -n 2 --dist=loadscope
  black:
    executor: myproject
    steps:
      - checkout
      - restore_cache:
          keys:
            - deps-{{ checksum "poetry.lock" }}
      - run:
          name: Check code formatting
          command: |
            poetry run black --check .

workflows:
    version: 2
    build_and_test:
      jobs:
        - build
        - test:
            requires:
              - build
        - black:
            requires:
              - build

The testing leverages both cores of our 2 VCPU container to speed up tests. The black code formatter is also run using the --check flag to return exit code 1 if the formatting does not conform to the standards defined by the package.

Pre-commit

The pre-commit framework is a tool which implements pre-commit hooks to your project. These hooks are run every time you try to run git commit -m ... and will prevent the commit from succeeding if the hooks fail. To install pre-commit, use

brew install pre-commit

To install black as a pre-commit hook, add the following in a .pre-commit-config.yaml file to the top of your project:

repos:
- repo: https://github.com/python/black
    rev: stable
    hooks:
    - id: black
      language_version: python3.7

To install the hook, go to the top of your project and run

pre-commit install

Now, every time you commit, black will be run against your code base. If black succeeds the commit is added, and if not the commit is prevented until you format your code.

Further reading

Best practices tools

Python Package Index

Back to the posts.