Matillion ETL is a powerful cloud platform to integrate with data warehouses such as Snowflake, Redshift, Microsoft Synapse and Google BigQuery, but have you ever needed to integrate on-premises sources?
Forget about complicated VPNs, SSH tunnels or opening ports!
By combining Matillion ETL with Progress DataDirect Hybrid Data Pipeline, we will explore how easily and securely data behind the firewall can be expose into Matillion ETL by leveraging their support for third party JDBC drivers.
In this tutorial, I will show you the following 4 easy steps, with more detailed instructions below.
Step 1 – Install and Configure Hybrid Data Pipeline
Install Progress DataDirect Hybrid Data Pipeline into AWS using this tutorial.
Be sure to download all of the components you will require from the trial site:
- Hybrid Data Pipeline Server
- JDBC Driver
- On-Premises Connector
After you have installed Hybrid Data Pipeline, you will need to open ports TCP 8443, 40501 and 11280 in the cloud so the on-premises connector can access the server. In my case, I created an AWS security group specifically for the Hybrid Data Pipeline server.
You should be able to access https://<hdpserver>:8443 from the system you plan to install the On-Premises Connector (OPC) on. Be sure you can log in using the d2cadmin credentials you specified during the installation.
Step 2 – Install the Progress DataDirect Hybrid Data Pipeline On-Premises Connector
To enable connecting from the cloud to your on-premises sources, you will need to install the On-Premises Connector. This is a component of the Hybrid Data Pipeline.
First, download the 4 redist files which were created when you installed Hybrid Data Pipeline. You can find these on your linux server here:
Next, place the 4 redist files, along with the On-Premises Connector installer download earlier into a folder on the Windows system you plan to install on. Run the installer following this guide.
Once installed, verify that it connects to the Hybrid Data Pipeline server by opening the OPC configuration tool from your program files menu. Click on the status tab and click Test. All of the tests should return green:
You may also find this tutorial helpful.
Step 3 – Create a data source in Hybrid Data Pipeline
Next, you will define a data source within the Hybrid Data Pipeline interface to connect to your on-premises data source.
- Log into your Hybrid Data Pipeline instance: https://<server>:8443
- Follow this documentation to create a data source. The server name, port and credentials will be from the perspective of the On-Premises connector.
- Be sure to select your On-Premises connector at the bottom of the data source configuration page in the Connector ID drop down menu:
- Click Test to verify the server can successfully connect to your data source.
Step 4 – Install the Hybrid Data Pipeline JDBC Driver into Matillion ETL to load the data
To connect Matillion ETL to your data source, you will need to install the JDBC driver downloaded earlier. I installed it on my local machine where I access the Matillion web interface. Documentation to install the JDBC driver can be found in these directions. Note you will be using the redist files downloaded from the Hybrid Data Pipeline server above.
Once installed you will need to add the driver profile to Matillion ETL:
- Connect to your Matillion server via SSH
- Modify the /usr/share/emerald/WEB-INF/classes/jdbc-providers.properties file as noted in this Matillion article.
- As an example, I added this section to the bottom of the file (note I use a different port in my environment):
- Restart Matillion ETL via the Admin menu.
- In the Matillion ETL Web UI, under Admin, select Manage Database Drivers:
- Select Hybrid Data Pipeline under Group
Browse to the directory where you installed the JDBC driver to upload ddhybrid.jar. On my system, it is here:
/Users/aburg/Progress/DataDirect/Progress DataDirect Hybrid Data Pipeline for JDBC 4.6/lib
Click Test to verify the driver can be loaded.
You should now be able to create a database query orchestration job in Matillion ETL which connects to the on-premises data source. For the connection URL, be sure to define the path to your Hybrid Data Pipeline server and the name of the data source you created.
Username and Password would be your Hybrid Data Pipeline account.
If you are interested in using Progress DataDirect Hybrid Data Pipeline to simplify on-premises data access from Matillion ETL, please download a trial to get connected to your data quickly and safely or contact us to learn more.