Connect to On-Premises databases from Google Composer

Introduction

Google Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow and operated using Python. It enables you to author, schedule and monitor workflows as directed acyclic graphs (DAGs) of tasks.

In this tutorial, we will walk you through how you can connect to an on-premise datasource Hive behind a private firewall from Google Cloud Composer using the Progress DataDirect Hybrid Data Pipeline.

Setting Up Progress DataDirect Hybrid Data Pipeline

Follow these easy instructions to get started.

  1. Install Hybrid Data Pipeline in your DMZ or in the cloud by following the below tutorials for:
  2. To connect to On-Premises databases, you need to install an On-Premises agent on one of your servers behind the firewall that lets the Hybrid Data Pipeline Server communicate with the database.
  3. To install the Hybrid Data Pipeline’s On-Premise Agent and configure it with the cloud service where you installed Hybrid Data Pipeline Server, please follow the below tutorials. If your Hybrid Data Pipeline Server is in:
  4. Download and install the Hybrid Data Pipeline JDBC connector.

Once you have everything set up, navigate to http://<server-address>:8080/d2c-ui or https://<server-address>:8443/d2c-ui to view the Hybrid Data Pipeline UI.

  1. Log in with the credentials d2cadmin and provide the password you used while installing the Hybrid Data Pipeline Server.
  2. Once you have logged in, create a New Data Source by clicking on New Data Source button as shown below.

    New Datasource

  3. You should now see list of all Data Stores as shown below. Choose Hadoop as your data store.

    datastores list

  4. On the Configuration, fill out all the connection parameters that you would generally use to connect to your Oracle database and set the Connector ID. The Connector ID is the ID of the on-premises connector that you have installed for this server. If you have installed and configured the on-premise connector, you should automatically see the Connector ID In drop down.

    Edit data source configuration

  5. Now Click on Test Connection and you will be able to connect to your Hadoop Hive data source On-Premise. Click on UPDATE button to save the configuration.

Configure Google Cloud Composer

  1. After you have created your Google Cloud Composer environment, go to PYPI packages tab and add the below packages to your nodes.
    • Jaydebeapi
    • Jpype1==0.6.3

    Install Pypi package

  2. Next go to the Environment configuration and from there open your DAGS folder in Google Storage.
  3. Upload the DataDirect Hybrid Data Pipeline JDBC connector to this folder.
  4. Note that all the items copied to DAGS folder will be available at the path /home/airflow/gcs/dags/ on the cluster.
  5. Go to Airflow Web UI and under Admin menu -> Create New Connection.

    Create connection

  6. Provide the following details in your configuration:
    • Conn Id: hdpjdbc
    • Conn Type: Jdbc Connection
    • ConnectionURL: jdbc:datadirect:ddhybrid://hdp.demo.datadirect.com;hybridDataPipelineDatasource=datalake
    • Login: <Your HDP username>
    • Password: <Your HDP Password>
    • Driver Path: /home/airflow/gcs/dags/ddhybrid.jar
    • Driver Class: com.ddtek.jdbc.ddhybrid.DDHybridDriver

      Create JDBC Connection

Creating a DAG

  1. For this tutorial, we will be reading data from on-premises Hadoop instance and display the records in DAG logs.
  2. Below is the DAG Python program. Save it to a Python file, for example hdp.py. Alternatively, you can access the code on GitHub.
    from datetime import timedelta
      
    import airflow
    import os
    import csv
    from airflow import DAG
    from airflow.operators.python_operator import PythonOperator
    from airflow.operators.dummy_operator import DummyOperator
    from airflow.hooks.jdbc_hook import JdbcHook
     
    #Creating JDBC connection using Conn ID
    JdbcConn = JdbcHook(jdbc_conn_id='hdpjdbc')
    def getconnection():
        JdbcConn.get_connection('hdpjdbc')
        print("connected")
    def writerrecords():
        id=JdbcConn.get_records(sql="SELECT * FROM customer")
        with open('records.csv', 'w') as csvFile:
            writer = csv.writer(csvFile)
            writer.writerows(id)
    def displyrecords():
        with open('records.csv', 'rt')as f:
            data = csv.reader(f)
            for row in data:
                print(row)
    default_args = {
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': airflow.utils.dates.days_ago(2),
        'email': ['airflow@example.com'],
        'email_on_failure': False,
        'email_on_retry': False,
        'retries': 1,
        'retry_delay': timedelta(minutes=5),
        # 'queue': 'bash_queue',
        # 'pool': 'backfill',
        # 'priority_weight': 10,
        # 'end_date': datetime(2016, 1, 1),
    }
      
    dag = DAG(
        'datadirect_sample', default_args=default_args,
        schedule_interval="@daily")
      
    t1 = PythonOperator(
        task_id='getconnection',
        python_callable=getconnection,
        dag=dag,
    )
    t2 = PythonOperator(
        task_id='WriteRecords',
        python_callable=writerrecords,
        dag=dag,
    )
      
    t3 = PythonOperator(
        task_id='DisplayRecords',
        python_callable=displyrecords,
        dag=dag,
    )
      
    t1 >> t2 >> t3

  3. Upload the Python program to the DAGS Folder in Google Bucket.

    Upload to bucket

Running the DAG

  1. On the Airflow Web UI, you should see the DAG as shown below.

    List of DAGS
  2. Click on the trigger button under links to manually trigger it. Once the DAG has started, go to the graph view to see the status of each individual task. All the tasks should be green to confirm proper execution.

    Graph view of DAGS

  3. After the execution, if you click on the task DisplayRecords and go to logs, you should see all of your data printed there as shown below.

    Airflow Tasks Log

We hope this tutorial helped you to get started with how you can access on-premises Hadoop Data from Google Cloud Composer. Please contact us if you need any help or have any questions.

Cloud and Hybrid, JDBC TUTORIAL

Connect to On-Premises databases from Google Composer

View all Tutorials

Connect any application to any data source anywhere

A product specialist will be glad to get in
touch with you