flynn.gg

Christopher Flynn

Machine Learning
Systems Architect,
PhD Mathematician

Home
Projects
Open Source
Blog
Résumé

GitHub
LinkedIn

Blog


Databricks DBAPI using pyhive

2018-09-10 Feed

I’ve been working a lot with Databricks recently to optimize our ETL, analytics, and data science workflows. Databricks is an analytics platform and orchestration tool built on top of the Apache Spark analytics engine with an Apache Hive-compatible SQL dialect. In particular, we’ve been repartitioning some of our larger tables in an attempt to speed up query execution.

Databricks provides a notebook style interface for interacting with tables, but the interface is based in the browser, and can be pretty clunky with the way DOM elements are updated on-the-fly. The syntax highlighting provided by the notebooks is also limited, with no way to be customized. Because of these reasons, I prefer to use JupyterLab when prototyping analytics tasks.

Since Databricks is Hive-like under the hood, one can connect to an interactive Databricks cluster using a JDBC driver for Hive. I use this method to connect via DataGrip for sample queries. Using dropbox’s pyhive package, a DBAPI connection/cursor pair can be made to a Databricks cluster with Python. Here is Databricks’ example using Python2

import os
import sys
from pyhive import hive
from thrift.transport import THttpClient
import base64

assert os.environ.has_key('DB_TOKEN'),\
'Databricks token is required, use DB_TOKEN environment variable'
TOKEN = os.environ['DB_TOKEN']


assert os.environ.has_key('WORKSPACE_NAME'),\
'Workspace name is required, use WORKSPACE_NAME environment variable'
WORKSPACE_NAME = os.environ['WORKSPACE_NAME']

assert len(sys.argv) > 1, "Specify cluster id or name"
CLUSTER = sys.argv[1]

transport = THttpClient.THttpClient('https://%s:443/sql/protocolv1/o/0/%s' % (WORKSPACE_NAME, CLUSTER))
transport.setCustomHeaders({'Authorization': 'Basic %s' % base64.standard_b64encode('token:%s' % TOKEN)})

cursor = hive.connect(thrift_transport=transport).cursor()

cursor.execute('show tables')
for table in cursor.fetchall():
    print(table)

sys.exit()

I’ve written a simple wrapper around pyhive called databricks-dbapi, which allows a dbapi cursor to be obtained with zero configuration besides providing parameters and credentials:

import os
from databricks_dbapi import databricks

TOKEN = os.environ["DATABRICKS_TOKEN"]  # dbapi12345678....
HOST = os.environ["DATABRICKS_HOST"]  # <workspace>.cloud.databricks.com
CLUSTER = "my-interactive-cluster"

cursor = databricks.connect(HOST, CLUSTER, TOKEN).cursor()

cursor.execute("SELECT * FROM my_table LIMIT 100")
for record in cursor.fetchall():
    print(record)

The package makes it trivial to get a connection started with a cluster, and is compatible with Python 2 and 3.

See the project on GitHub for more info.

Further reading

databricks-dbapi

Dependencies

Python Package Index

Back to the posts.