Machine Learning
Systems Architect,
PhD Mathematician
This post describes what I consider to be Python best practices for software development as of 2019:
Nowadays, using the system-installed Python and pip
is a bad way to work on Python projects and will surely lead to trouble down the road, especially when you find yourself working on multiple projects in multiple versions of Python. When I joined this project, some of the data scientists were using virtual environments, some weren’t. Some were using Python3.6, and others on 3.7. We needed to get everyone working in the same environment. The first step was to get set up with asdf
and poetry
.
asdf
is a tool for managing different versions of different languages locally. It’s excellent when you need to maintain multiple Python projects that span multiple versions of Python. Under the hood it uses pyenv
, which can sometimes be a bit tricky to install. There are a few dependencies required before using asdf
with pyenv
and some additional SDKs needed for MacOSX Mojave. From my experience, the recommended setup steps are as follows:
First we ensure that we have Xcode installed with its command line tools. To download Xcode, make sure you have an Apple ID and follow this link. Once installed, run the command
xcode-select --install
to ensure the CLI tools are installed. If you are using MacOSX Mojave 10.14.+, you should also try running the following command, which contains some additional SDK headers required for installing Python.
sudo installer -pkg /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg -target /
To install asdf
we will use homebrew, a package manager for Mac OSX. To install brew run:
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
There are some dependencies we must install first before we setup asdf
, however:
brew install coreutils automake autoconf openssl libyaml readline sqlite3 libxslt libtool unixodbc unzip xz zlib curl
These dependencies should ensure that Python is installed without warnings or errors. You should also ensure that these values are included in your ~/.bash_profile
or equivalent:
# need this to install cryptography
export LDFLAGS="-L$(brew --prefix openssl)/lib"
export CFLAGS="-I$(brew --prefix openssl)/include"
# need this to install python with pyenv
export LDFLAGS="${LDFLAGS} -L/usr/local/opt/zlib/lib"
export CPPFLAGS="${CPPFLAGS} -I/usr/local/opt/zlib/include"
export PKG_CONFIG_PATH="${PKG_CONFIG_PATH} /usr/local/opt/zlib/lib/pkgconfig"
Restart your terminal.
Next install asdf:
brew install asdf
Using asdf
we install plugins for whichever language we want to manage. Since we want to manage Python, install the Python plugin:
asdf plugin-add python
Using asdf
we can install the different versions of Python that we need. To install Python3.7, for instance, we use the command
asdf install python 3.7.3
We can also see which versions we have installed using
asdf list python
To see the full list of available versions:
asdf list-all python
To set the python version for a specific project, we would navigate to the top level folder of a project and run the command
asdf local python 3.7.3
This creates a file .tool-versions
in the folder which indicates to asdf
that it should be using Python 3.7.3 when you type the command python
in the terminal when in this project. Now we’re ready to start using poetry
.
poetry
is a tool for managing Python projects. The biggest advantage of poetry
is that it manages virtual environments as well as project dependencies. It also has an excellent dependency resolver so you are less likely to find yourself with dependency conflicts in dependency hell.
To install poetry run:
curl -sSL https://raw.githubusercontent.com/sdispater/poetry/master/get-poetry.py | python
In addition, modify your ~/.bashrc
to include these lines
# poetry
source $HOME/.poetry/env
And in your ~/.bash_profile
include these lines
# poetry
source $HOME/.poetry/env
export PATH="$HOME/.poetry/bin:$PATH"
Restart your terminal.
To initialize a project using poetry, run the command
poetry init
This sets up a pyproject.toml
file (the new standard project config file). This command will walk you through defining some metadata about the project and setting up the initial dependencies. I prefer to skip the interactive part of including dependencies here. Instead I think it’s better to add dependencies one at a time using the add
command. For example, if we want to add requests
as a dependency, we run
poetry add requests
This will create a virtual environment (in Python 3.7 thanks to asdf and running asdf local
earlier) and install requests
and its subdependencies. You will see requests has been added to the pyproject.toml
file under the section [tool.poetry.dependencies]
, but also a new file poetry.lock
has been created. This lock file contains all of the dependencies added (requests and its sub-dependencies). Each dependency includes required compatible versions of other packages for the dependency resolver, but also hashes of the packages to ensure that future builds install the exact same version of the software. The order of installation is also maintained in the pyproject.toml
to help ensure deterministic builds. This is contrary to pip which will just install packages from requirements.txt
in the order of the file (usually alphabetically) which can break builds in which certain packages must be installed before others.
To use poetry to run python commands, any command which you would normally run should now be prefixed with poetry run
. For instance, if you want to run your tests using pytest, you should run peotry run pytest
.
Alternatively, you can spawn a subshell within the activated virtual environment using poetry shell
. This activates the environment and allows you to run commands without the poetry run
prefix.
To install development dependencies in our project, such as tools used for testing, formatting, documentation, benchmarking, etc., we pass the -D
flag when we install. For instance, if we install the documentation tool sphinx
we would run
poetry add -D sphinx
The sphinx
package would then be added to the [tool.poetry.dev-dependencies]
section of the pyproject.toml
file. This allows us to separate the required runtime dependencies from the packages required solely for developing.
Python comes with a unit testing framework built into the standard library, but it’s severely limited and not very flexible. More recently, pytest
has proven to be highly flexible and robust and is easily the standard testing framework within the community. It also has a lot of plugin support which broaden its feature set considerably.
To install pytest as a development dependency in our project
poetry add -D pytest
For using pytest see the documentation.
Some of the plugins I would recommend would be pytest-cov
for integration with the coverage
package, and pytest-xdist
for using multiprocessing to execute tests in parallel. To leverage both using the pytest
command, include a pytest.ini
file at the top level of your project with these contents
[pytest]
addopts = --cov=mypackagefolder/ --cov-report html -n auto --dist=loadscope mytestfolder/
Using the above command will create a folder called htmlcov
in the top of your project. You can run
open htmlcov/index.html
which will open the coverage report in your web browser. Here you can easily see which portions of your code base were hit, missed, or skipped from your test suite.
For testing parts of your code that include scraping or hitting a REST API for data payloads, it’s best practice that your tests should mock out the responses of those web services, rather than hitting them directly with each run of the tests. Once way to do this is to use the vcrpy
package, which is more or less a port of the vcr package for the Ruby language.
This allows you to mock the responses by only hitting the service once, the first time you run the test. The package will save the response into a yaml file, which is then played back the next time the request is invoked in the test. This prevents you from hitting (and perhaps overloading) a live service if you have hundreds or thousands of unit tests. It’s easy to implement using a context manager and defining where you want to save the response’s cassette file:
import requests
import vcr
with vcr.use_cassette("fixtures/vcr_cassettes/synopsis.yaml"):
response = requests.get("http://www.iana.org/domains/reserved")
assert "Example domains" in response.text
To enforce consistent codestyle everywhere, use black
code formatting.
black
is now an officially maintained Python package and has quickly become the standard Python code formatter. The advantages of black
is that it uses your code’s abstract syntax tree to compare the functionality before and after formatting, in order to ensure that it doesn’t mangle any of your code’s logic in the process. It is also unique in that it exposes almost no configuration options other than the line length for wrapping code.
To install black
(as a development dependency) using poetry, run:
poetry add -D black --allow-prereleases
The --allow-prereleases
is necessary here because as of the time of writing black
has no formal releases and is still in beta.
To run black against your code, simply run
black .
and in CI integrations, use black --check .
to return an exit code 1 if the code is not formatted. To add a limited configuration to black
include this in your pyproject.toml
:
[tool.black]
line-length = 120
target-version = ['py37']
include = '\.pyi?$'
exclude = '''
(
/(
\.eggs # exclude a few common directories in the
| \.git # root of the project
| \.hg
| \.mypy_cache
| \.tox
| \.venv
| _build
| buck-out
| build
| dist
)/
| foo.py # also separately exclude a file named foo.py in
# the root of the project
)
'''
I recommend a line length of 120 in order to be compatible with PyCharm’s default line marker settings in the editor (our data science team uses PyCharm).
The isort
library is also very convenient for organizing imports. I prefer to separate imports by source, so that there are three blocks of imports at the top of each file. These blocks consist of standard library packages, 3rd party packages, and 1st party (local) module imports. Within those blocks I also recommend grouping import
and from . import
statements, and subsequently alphabetizing each of these subblocks independently.
All of this is accomplishable with isort
, although it can be difficult sometimes to get it to play nice with black
formatting. One configuration I’ve found to work pretty well is the following, which can be placed in the isort.cfg
file:
[settings]
force_single_line=True
multi_line_output=3
include_trailing_comma=True
force_grid_wrap=0
use_parentheses=True
line_length=120
Lacking continuous integration tests can be problematic. It is standard practice to run all tests on every pull request. This allows maintainers to block merging from breaking code. I recommend using circleci which integrates seamlessly with GitHub.
We use configuration which performs three jobs. The CI jobs leverage the circleci/python
Docker image which comes prepackaged with Python and some tools, including poetry
.
The first of the jobs builds the project by installing dependencies using poetry
, and caching the dependencies in an archive that can be quickly reloaded to speed up future builds as long as the dependencies haven’t changed.
Subsequently, a completed build triggers two other jobs, running tests using pytest
, and checking formatting using black
. Here is a sample configuration:
version: 2.1
executors:
myproject:
docker:
- image: circleci/python:3.7.3
working_directory: ~/repo
jobs:
build:
executor: myproject
steps:
- checkout
- restore_cache:
keys:
- deps-{{ checksum "poetry.lock" }}
- run:
name: Install Dependencies
command: |
poetry install
- save_cache:
key: deps-{{ checksum "poetry.lock" }}
paths:
- /home/circleci/.cache/pypoetry/virtualenvs
test:
executor: myproject
steps:
- checkout
- restore_cache:
keys:
- deps-{{ checksum "poetry.lock" }}
- run:
name: Run tests
command: |
poetry run pytest -n 2 --dist=loadscope
black:
executor: myproject
steps:
- checkout
- restore_cache:
keys:
- deps-{{ checksum "poetry.lock" }}
- run:
name: Check code formatting
command: |
poetry run isort -y -c
poetry run black --check .
workflows:
version: 2
build_and_test:
jobs:
- build
- test:
requires:
- build
- black:
requires:
- build
The testing leverages both cores of our 2 VCPU container to speed up tests. The isort
and black
code formatters are run using the check flags to return exit code 1 if the formatting does not conform to the standards defined by our configuration.
The pre-commit
framework is a tool which implements pre-commit hooks to your project. These hooks are run every time you try to run git commit -m ...
and will prevent the commit from succeeding if the hooks fail. To install pre-commit, use
brew install pre-commit
To install black
as a pre-commit hook, add the following in a .pre-commit-config.yaml
file to the top of your project:
repos:
- repo: https://github.com/python/black
rev: stable
hooks:
- id: black
language_version: python3.7
To install the hook, go to the top of your project and run
pre-commit install
Now, every time you commit, black
will be run against your code base. If black succeeds the commit is added, and if not the commit is prevented until you format your code.