flynn.gg

Christopher Flynn

Machine Learning
Systems Architect,
PhD Mathematician

Home
Projects
Open Source
Blog
Résumé

GitHub
LinkedIn

Blog


Databricks Dialect for SQLAlchemy

2019-08-14 Feed

Thanks to contributions by Evan Thomas, the Python databricks-dbapi package now supports using Databricks as a SQL dialect within SQLAlchemy. This is particularly useful for hooking up Databricks to a dashboard frontend application like Apache Superset. It provides compatibility with both standard Databricks and Azure Databricks.


The SQLAlchemy support comes as an extra integration. It can be installed optionally with pip with the command

pip install databricks-dbapi[sqlalchemy]

This also registers the prefix databricks+pyhive as a dialect-driver combination for SQLAlchemy urls.

To set up a SQLAlchemy engine, use one of the following examples:

from sqlalchemy.engine import create_engine

# Standard Databricks with token
# provide token, company name for url, database name, cluster name
engine = create_engine(
    "databricks+pyhive://token:<databricks_token>@<companyname>.cloud.databricks.com:443/<database>",
    connect_args={"cluster": "<cluster>"}
)

# Azure Databricks with token
# provide token, region for url, database name, http_path (with cluster name)
engine = create_engine(
    "databricks+pyhive://token:<databrickstoken>@<region>.azuredatabricks.net:443/<database>",
    connect_args={"http_path": "<azure_databricks_http_path>"}
)

The urls passed to the create_engine function are a bit different from the connection URLs specified in the Databricks ODBC/JDBC documentation. This is due to the way that SQLAlchemy internally parses database URLs. The actual URL paths are constructed properly in the Databricks connect() function which wraps Hive/Thrift.

For standard Databricks, it is sufficient to pass the cluster name in connect_args in addition to the filled URL example given.

For Azure Databricks, one should pass the full http_path in connect_args from the ODBC/JDBC connection page for the cluster, which includes the account ID and cluster name.

In either case you should be able to access every database regardless of which database (default is default) is passed in the URL.

Further reading

Databricks

Back to the posts.