Machine Learning
Systems Architect,
PhD Mathematician
Thanks to contributions by Evan Thomas, the Python databricks-dbapi package now supports using Databricks as a SQL dialect within SQLAlchemy. This is particularly useful for hooking up Databricks to a dashboard frontend application like Apache Superset. It provides compatibility with both standard Databricks and Azure Databricks.
The SQLAlchemy support comes as an extra integration. It can be installed optionally with pip
with the command
pip install databricks-dbapi[sqlalchemy]
This also registers the prefix databricks+pyhive
as a dialect-driver combination for SQLAlchemy urls.
To set up a SQLAlchemy engine, use one of the following examples:
from sqlalchemy.engine import create_engine
# Standard Databricks with token
# provide token, company name for url, database name, cluster name
engine = create_engine(
"databricks+pyhive://token:<databricks_token>@<companyname>.cloud.databricks.com:443/<database>",
connect_args={"cluster": "<cluster>"}
)
# Azure Databricks with token
# provide token, region for url, database name, http_path (with cluster name)
engine = create_engine(
"databricks+pyhive://token:<databrickstoken>@<region>.azuredatabricks.net:443/<database>",
connect_args={"http_path": "<azure_databricks_http_path>"}
)
The urls passed to the create_engine
function are a bit different from the connection URLs specified in the Databricks ODBC/JDBC documentation. This is due to the way that SQLAlchemy internally parses database URLs. The actual URL paths are constructed properly in the Databricks connect()
function which wraps Hive/Thrift.
For standard Databricks, it is sufficient to pass the cluster
name in connect_args
in addition to the filled URL example given.
For Azure Databricks, one should pass the full http_path
in connect_args
from the ODBC/JDBC connection page for the cluster, which includes the account ID and cluster name.
In either case you should be able to access every database regardless of which database (default is default
) is passed in the URL.