The following post covers the pros, cons and step-by-step instructions for connecting a Databricks cloud data warehouse to Microsoft Excel. The spreadsheet is often the preferred tool for building models and combining various data sets into one analysis. Connecting Databricks to Microsoft Excel can make it easy for business teams to pull real-time data into Excel for analysis. An alternative to Excel is Row Zero, a spreadsheet designed for big data and more easily connects to Databricks and other cloud data warehouses. Read on for more information or skip to specific sections using the table of contents below.
Table of Contents
- Power Query (6 steps + ODBC driver installation)
- Power Pivot (8 steps + ODBC driver installation)
- Microsoft Query (5 steps + ODBC driver installation)
- Open CSV (8 steps) Row Zero - Bonus (3 steps)
- What is Databricks?
- Benefits of connecting Databricks to Microsoft Excel
- Considerations when connecting Databricks to Microsoft Excel
- Conclusion
How to Connect Databricks to Excel
Download and Install the Databricks ODBC Driver
In order to make use of many of the below options for connecting Microsoft Excel to Databricks, you must first download and install the Databricks ODBC driver. Instructions and download links can be found in Databricks ODBC driver documentation. You can also go straight to operating system specific instructions:
Operating system agnostic Databricks ODBC driver instructions are included below (Linux, Mac OS, Windows)
First visit the downloads page and download the appropriate Databricks ODBC driver for your operating system (Linux, Mac, or Windows).
Install the Databricks ODBC Driver Once you've downloaded the correct drive version, unzip or decompress the file in your downloads folder. Inside, you will find an executable file type (e.g. Simba Spark 2.8 64-bit.msi). Double click on the executable to install the driver and follow the steps in the installation wizard.
Configure and Add the Databricks ODBC Driver
In Windows, go to the start menu and search for 'ODBC Data Sources.' Select and open the ODBC Data Source Administrator.
Navigate to the 'Drivers' tab and look for your Databricks driver in the list. If you don't see it, it has not been installed correctly and you should try repeating the previous step. Assuming your driver is visible in the list, click on the 'User DSN' or 'System DSN' tabs, click 'Add,' select the Databricks driver from the list, and click 'Finish.'
Next add the connection credentials for your Databricks database. You will need hostname, which can be found Databricks under 'SQL Warehouses > Connection details (e.g. dbc-7hle3mj4-wu38.could.databricks.com). The port is '443'. The mechanism should be set to 'OAuth 2.0'.
Next click on 'OAuth Options' and on the next screen, under Authentication Flow select 'Brower Based Authorization Code' and uncheck 'Ignore SQL_DRIVER_NONPROMPT.' Then click 'OK'.
Now back on the DSN setup page, click 'HTTP Options'. Navigate to your Databricks account and copy the HTTP path from Connection details (e.g. /sql/1.0/warehouses/09u4rj09jf0439ur39jf). Paste the HTTP path back in your DSN setup.
Next click on 'SSL Options' and select 'Enable SSL' and click 'Ok.'
Lastly, hit 'Test' at the bottom of the DSN Setup window to test the connection.
4 Best Ways to Import Databricks data to Excel
The following instructions cover 4 methods for connecting a Databricks cloud data lakehouse to Excel and one additional option for using a more powerful spreadsheet.
1. Connect Excel to Databricks with Power Query
- The following instructions assume Databricks ODBC driver has already been downloaded.
- Open Excel and click Data > Get Data > From Other Sources > From ODBC.
- In the 'From ODBC' menu select the data source name given when the ODBC driver was installed.
- Click on advanced options to enter a SQL statement to be executed upon connection or use the bottom drop down to select methods for reducing row size for the query results. Row size is important because Microsoft Excel can slow down or crash when it tries to open large data sets.
- When prompted, enter the Databricks username and password for the database. If the database is not password protected select 'default or custom' in the upper right corner. Then press 'Connect'
- A Navigator window will appear showing all the tables in your database. Select the table you want to work with and click 'Load'
- Finally, your data will be visible in Microsoft Excel in a data table. Excel data tables have different properties than data normally displayed in Excel but that is content for a separate blog.
2. Connect Excel to Databricks with Power Pivot
- The following instructions assume the Databricks ODBC driver has already been downloaded.
- Open Excel and click on the 'Power Pivot' heading from the top level menu.
- Once the Power Pivot menu opens, click on Manage > Get External Data > From Other Sources. Get external data from other sources
- Select 'Others (OLEDB/ODBC)' from the list of options
- Click on the 'Build' button under the text box.
- On the Data Link Properties screen click on the 'Provider' tab and select 'Microsoft OLE DB Provider for ODBC Drivers.'
- Return to the connection tab and select your DSN from the 'Data Source Name' drop-down. If you don't see your Databricks DSN, review the Databricks ODBC download instructions.
- Enter your user name and password, and test the connection.
- Click 'Next' and use the 'Select Tables and Views' page of the wizard to choose the tables to import.
3. Connect Excel to Databricks with Microsoft Query
- The following instructions assume the Databricks ODBC driver has already been downloaded.
- In Excel, click on the data tab, then 'other sources' and 'Microsoft Query.'
- In the 'Choose Data Source' Menu, select your DSN from the list.
- Uncheck the box for 'Use the Query Wizard to Create/Edit Queries' and click 'Ok.
- On the following screen select the tables to add to your query and click the 'add' button.
- In the graphical editor there are options to filter, sort, join, and create other parameters for your query.
4. Download and Open
The CSV option is simple and free but is not an actual live connection between Excel and Databricks. CSV import is a one-time data import to Excel that will need to be repeated every time the data in Databricks changes. Follow the steps below for the CSV import from Databricks to Excel.
- Log in to your Databricks Account
- Click on 'SQL Editor' in the Databricks UI
- Write your query. 'Select * from tablename' is an easy way to pull in all data from a particular table. If SQL is a familiar language, write a more complex query to pull the data needed into Excel. You can view the tables in your data warehouse in the menu on the left.
- Run your query with the 'run' button. Using the drop down, you can decide if you want to limit the run to 1000 results or all the results.
- Download results by clicking on the 'Raw Results' button and selecting 'download as csv' or 'Download Excel.' Remember, if you download to Excel and your data is bigger than the Excel row limit, your data won't fit in an Excel spreadsheet. If you download the CSV, you can then open the CSV with Excel or other applications.
- Repeat the process if Databricks data is updated.
Row Zero
Connect Databricks to Row Zero
Row Zero is the world's fastest spreadsheet that runs in the cloud, making it easy to connect to data warehouses, like Databricks, and import giant datasets. Once data is in Row Zero, users can filter/sort, pivot, graph, and use all the spreadsheet functions they are familiar with.
How to Analyze Databricks Data in a Spreadsheet
There are 3 steps to easily connect Databricks to Row Zero and analyze data in a spreadsheet. Follow the instructions below to get started.
- Connect Row Zero to your Databricks account
- Write a query to import data
- Begin analysis in Row Zero
What is Databricks?
Databricks Computing is a cloud data warehousing platform that simplifies data storage and analysis. Unlike traditional data warehousing solutions, Databricks operates on a cloud-native architecture, enabling organizations to seamlessly store, manage, and analyze vast amounts of data across multiple cloud data sources without the constraints of hardware limitations. Databricks utilizes a structured data storage model known as a "data lakehouse," which combines the benefits of data warehousing and data lakes. This architecture allows for separation of storage and compute, providing elasticity and scalability.
Benefits of connecting Databricks to Microsoft Excel
Real-time Data
Connecting Microsoft Excel to Databricks database enables financial and accounting models to be built on production or operational real-time data stored in Databricks. Rather than static models or analysis that are stale as soon as they are built, connecting to a Databricks database ensures models are always up-to-date.
Refresh Live Data Models
Once you connect Databricks to Excel, you have a live connection and it is easy to re-run or refresh the Databricks query to pull the latest dataset into the spreadsheet without having to recreate another static analysis. Live data models are time savers and keep your analyses fresh.
Dashboards
When you connect Databricks to Excel, it is easy to turn a simple analysis into a dashboard or live model. Simply refresh the queried data from Databricks and all graphs, pivot tables, and spreadsheet models will update.
Considerations when connecting Databricks to Microsoft Excel
Security
Importing data from Databricks to Excel poses an information security risk because the data is moving from a cloud data warehouse where activity is traceable and behind layers of security to a locally stored Excel file stored on a computer. Once the data is in Excel, that data goes wherever the computer goes. The computer could be lost, stolen, or shared. The file could also be emailed to personal who wouldn't otherwise have access to the data. A more secure option is a cloud based spreadsheet, like Row Zero
Data Set Size
While Microsoft Excel is a great tool it cannot easily support large data sets. Microsoft Excel will often slow down or crash as data set sizes reach tens or hundreds of thousands rows Excel has a 1,048,576 row limit, which we've written about here. It is common for real-time data sets stored in Databricks's cloud data lakehouse to be quite large. To import data from Databricks to Microsoft Excel, the Databricks query will need to trim the data down to a size Microsoft Excel can manage. There are options within Microsoft Excel, like data model and power query, to enable work with large data sets but they don't allow users to see all the data or use the same spreadsheet flexibility Excel supports with smaller data sets.
Sharing
Microsoft Excel is offered in two versions, desktop and Excel 365. Desktop Excel is the traditional version everyone thinks of. The program runs on local computers and is fast and performant. Desktop Excel is difficult to share due to the fact that it runs on users' local computers. It is common for teams to email around multiple different versions creating static one off analyses that need to be updated, versioned, or can get out of sync if two users make updates to the same version. Companies often rely on cloud data storage repositories like Sharepoint to manage all the files with checkout and check-in logic that preserve spreadsheet version integrity. Excel 365 is the new version of Excel that makes sharing easy, though it is not commonly used for big models or data analysis because it is slower than desktop Excel. If a fast powerful spreadsheet with the ability to share across team members is desired, Row Zero is powerful spreadsheet that runs in the cloud and can be shared with anyone
Complexity
Creating a live connection from Databricks to Microsoft Excel is not as easy and intuitive as one might think. There are often complications that can be hard to resolve. The information below does its best to help prevent errors that will inevitably occur and there is plenty of content online to help resolve each one.
Conclusion
After reviewing many of the most common options for connecting Databricks cloud data warehouse to Excel, it is clear there are many trade-offs to consider. The advantages to a live connection are, making dashboards, and working on live data models. All are possible with a Microsoft Excel connection to Databricks. There are several downsides to connecting Databricks to Excel. First, Many options are somewhat complicated. The step by step instructions outlined in this post are not terribly complicated but can stump even technical analysts navigating different versions of Excel and drivers. Second, due to Excel running locally, the ODBC drivers must be downloaded and installed on each individual computer, which prevents Excel workbooks from being shared with others. Third, Excel begins to slow down the more data you load. Even with Power Pivot and data tables, Excel will slow down or freeze with big data sets. The processing power of each local computer is yet another constraint. While Excel is the ubiquitous analytics software, Row Zero offers a more robust spreadsheet option that functions just like Excel but with more power for bigger data sets, easy shareability, and easier connections from Row Zero to Databricks. Give Row Zero a try for free.
If you use another cloud data warehouse or blob storage, Row Zero also connects to Redshift, Snowflake, Postgres, and S3.