Machine Learning
Systems Architect,
PhD Mathematician
I use matplotlib’s pyplot regularly to generate plots for product analytics or business intelligence, whether the data comes from internal data warehousing or some external API. The barebones matplotlib package is fine for generating quick and dirty plots, but there are a lot of simple features that can provide some nice enhancements to these figures, making them much easier to comprehend (and making them easier on the eyes). Here is a rundown of the post:
Perhaps one of the simplest changes you can make to your plots is changing the built-in style that pyplot uses to draw plots. There are a few built-in styles that you can choose from. You can see a gallery of most of them here. Many of them are based off of the seaborn package, a statistical visualization library based on matplotlib. To see a list of built-in styles:
import matplotlib.style
print(matplotlib.style.available)
As of writing this post, the styles include:
To change to one of these styles is only one line of code.
import matplotlib.pyplot as plt
plt.style.use('ggplot')
You can also temporarily use one of these styles for a single plot using the style as a context manager.
with plt.style.context(('bmh')):
plt.plot(x, y)
Furthermore, you can design your own fully custom style using a matplotlibrc
file. You can find a full example file here.
I do a lot of analysis of time series data and more often than not plotting that data means that the x-axis values are string representations of dates or timestamps. Luckily matplotlib handles this data graciously as long as it consists of date or datetime objects.
If you have string represented timestamps, you can use python’s built-in datetime
module to convert the values. The method is datetime.strptime()
shown here. If the strings are in a pandas dataframe, you can use the pandas method .to_datetime()
to convert them, documented here. Matplotlib also provides its own helper functions to convert datetimes to Gregorian ordinals and vice versa. Example:
import datetime
import pandas as pd
# Some dates
dates = ['2017-01-01', '2017-01-02', '2017-01-03']
# Convert using strptime()
datetimes = [datetime.datetime.strptime(d, '%Y-%m-%d') for d in dates]
# print(datetimes)
# [datetime.datetime(2017, 1, 1, 0, 0), datetime.datetime(2017, 1, 2, 0, 0), datetime.datetime(2017, 1, 3, 0, 0)]
# Create a dataframe with column 'date'
df = pd.DataFrame(dates, columns=['date'])
# Convert the column of strings to timestamps
df.loc[:, 'date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
# print(df['date'].values)
# array(['2017-01-01T00:00:00.000000000', '2017-01-02T00:00:00.000000000',
# '2017-01-03T00:00:00.000000000'], dtype='datetime64[ns]')
Once you’ve converted to datetime objects, you can plot just as you normally would.
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(df['date'], df['y_values'])
One of the downsides of plotting dates on the x-axis is that the x-tick labels usually end up crowding and overlapping. Using the setp
function of pyplot, we can set properties of plot objects. We can pass the x-tick labels as a first argument. The subsequent arguments are the properties of the tick label text objects that we want to change. We set the rotation to 30 degrees (counter-clockwise), and then set ha
, horizontal alignment, to right
. This sets the right end of the tick label to the position it is labelling.
plt.setp(ax.get_xticklabels(), rotation=30, ha='right')
It turns out that matplotlib also has a function to do this automatically called autofmt_xdate()
. Here is how you use it.
fig.autofmt_xdate()
Tick locators can be used to programmatically set major and minor ticks and tick labels. The matplotlib ticker
module provides a number of tick locators for finding ticks numerically or stripping them altogether. There is also the date
module which is great for setting tick marks on specific dates, i.e. every Monday. This is something I like to use on plots that span several months of time.
from matplotlib.dates import WeekdayLocator, MO # TU, WE, TH, FR, SA, SU
ax.xaxis.set_major_locator(WeekdayLocator(byweekday=MO))
You might not need the full timestamp for tick labels, especially for data that you know is fairly recent. Sometimes the month and day are sufficient. For this there is a date formatter as part of the date API. This allows you to use a datetime datetime.strftime()
string to specify the way the dates should appear.
from matplotlib.dates import DateFormatter
ax.xaxis.set_major_formatter(DateFormatter('%m-%d'))
I also use the function formatter to handle how large numbers are displayed. This formatter is part of the ticker API and accepts a function with two arguments (a tick value x
and a position pos
) which returns a string formatted output. Here we format numbers on the y-axis to use 1 decimal point of precision and thousand-based order of magnitude suffixes (e.g. 4,200 = 4.2K, and 5,000,000 = 5.0M, etc.) in lieu of using commas in the full value.
from matplotlib.ticker import FuncFormatter
def number_formatter(number, pos=None):
"""Convert a number into a human readable format."""
magnitude = 0
while abs(number) >= 1000:
magnitude += 1
number /= 1000.0
return '%.1f%s' % (number, ['', 'K', 'M', 'B', 'T', 'Q'][magnitude])
ax.yaxis.set_major_formatter(FuncFormatter(number_formatter))
There are two other formatters, StrMethodFormatter
and FormatStrFormatter
, which accept new and old style format strings as arguments to define the format.
Sometimes it’s good to provide some simple formulas to explain what’s going on in the plots. You probably want to use \( \LaTeX \) to do it. This requires altering the matplotlib rc
(run-config) to allow the use of the markup.
from matplotlib import rc
rc('text', usetex=True)
ax.set_title(r"Euler's formula: $e^{i\pi} + 1 = 0$")
Be sure to indicate to python that you’re using a raw string with the r
prefix. This means that python treats the string exactly as-is so you can type pure \( \LaTeX \) markup without worrying about which characters to escape.
If you can’t find a spot for the legend that doesn’t obstruct some of your data, you can add some transparency.
ax.legend(loc='best', framealpha=0.5)
For some plots it might be useful to indicate immediately to the viewer which days are part of the weekend. For this we can use the weekday()
method of datetime objects, which returns a value from 0 to 6 corresponding to Monday through Sunday for a particular datetime. Then use the built in date2num()
function from the matplotlib dates
module to get the plot position. Last use the axvspan()
axes method to indicate the start, end, and color of the background adjustment to make around the data point.
from matplotlib.dates import date2num
for d in df['date']:
if d.weekday() in [5, 6]:
pos = date2num(d)
ax.axvspan(pos - 0.5, pos + 0.5, color='#DDDDDD')
At the end of the majority of my plots and before saving to an image I use the tight_layout()
method. This method automatically adjusts the plots to fit tightly within the entire figure space. It makes better use of the full space of the figure and can minimize cropping and overlapping between subplots, but just beware that it doesn’t always work.
plt.tight_layout()
Here’s a simple example using some generated data. We use almost all of the topics discussed in this post.
source:
import datetime
import random
from matplotlib.dates import DayLocator, DateFormatter, date2num
from matplotlib.ticker import FuncFormatter
from matplotlib import rc
import matplotlib.pyplot as plt
import pandas as pd
# Activate latex and the bmh style
rc('text', usetex=True)
plt.style.use('bmh')
# Define the nice number formatter
def number_formatter(number, pos=None):
"""Convert larger number into a human readable format."""
magnitude = 0
while abs(number) >= 1000:
magnitude += 1
number /= 1000.0
return '%.1f%s' % (number, ['', 'K', 'M', 'B', 'T', 'Q'][magnitude])
# Create some data and the dates
data = [400000 + 50* t ** 3 + random.normalvariate(0, 50000) for t in range(30)]
dend = datetime.date.today()
delta = datetime.timedelta(days=1)
dates = [dend - (29 - t) * delta for t in range(30)]
# Set up the dataframe
df = pd.DataFrame(columns=['date', 'visitors'])
df['date'] = dates
df['visitors'] = data
# Create the plot
fig, ax = plt.subplots(figsize=(10,3))
ax.plot(df['date'], df['visitors'], label='Blog Visitors')
# X-axis
plt.setp(ax.get_xticklabels(), rotation=30, ha='right')
ax.xaxis.set_major_locator(DayLocator())
ax.xaxis.set_major_formatter(DateFormatter('%m-%d'))
ax.set_xlim(df['date'].values[0], df['date'].values[-1])
# Y-axis
ax.yaxis.set_major_formatter(FuncFormatter(number_formatter))
ax.set_ylim(bottom=0)
# Labels
ax.set_title(r"Daily Blog Views: $\sum_{i=0}^{\infty} v_i \to \infty$")
ax.set_xlabel('Date')
ax.set_ylabel('Visitors')
ax.legend(loc='best', framealpha=0.5)
# Weekend indicators
for d in df['date']:
if d.weekday() in [5, 6]:
pos = date2num(d)
ax.axvspan(pos - 0.5, pos + 0.5, color='#DDDDDD')
# Use tight layout
fig.tight_layout()
plt.show()
Happy (and better) plotting!