Data & AI

Tutorial: Using Google Cloud Dataflow to Ingest Data Behind a Firewall

by Saikrishna Teja Bobba Posted on February 27, 2018

In this tutorial, you'll learn how to easily extract, transform and load (ETL) on-premises Oracle data into Google BigQuery using Google Cloud Dataflow.

Google Cloud Dataflow is a service for processing and enriching real-time streaming and batch data. Dataflow uses the Apache Beam SDK for Java for data inflow and outflow. As you might expect with a cloud-based solution, the Java I/O has a list of predefined data stores which are primarily cloud and Big Data.

However, Dataflow can be expanded broadly beyond Big Data and the Cloud to many other sources through the JDBC interface. Using Progress DataDirect JDBC connectors, you can open Google Dataflow's processing power to a wide range of on-premises data including Oracle, SQL Server, IBM DB2, Postgres and many more. The capability to expand your data sources means that you can integrate diverse external databases with the Google ecosystem, eliminating non-Google data silos.

Combining on-premises data with cloud technologies almost always raises immediate concerns about security, but the DataDirect Hybrid Data Pipeline lets you securely access data behind any firewall without the requirement to make complex network configurations such as SSH tunnels, reverse proxies or VPNs. It can also be deployed to work with existing network configurations, which is often required in industries such as financial services.

Firewall Friendly Access to On-Premises Data Sources

The DataDirect Hybrid Data Pipeline JDBC driver can be used to ingest both on-premises and cloud data to Google Cloud Dataflow through the Apache Beam Java SDK interface. We've written a detailed tutorial to show you how to extract, transform and load (ETL) on-premises Oracle data into Google BigQuery using Google Cloud Dataflow.

Our tutorial demonstrates how to connect to an on-premises Oracle database, read the data, apply a simple transformation and write it to BigQuery. This does not require any additional components from the database vendors.

You can use a similar process with any of the Hybrid Data Pipeline’s supported data sources like SQL Server, Hive, IBM DB2, Salesforce, Amazon Redshift, etc. Check out the tutorial and please contact us if you need any help or have any questions.

View the Tutorial

Saikrishna Teja Bobba

Saikrishna is a DataDirect Developer Evangelist at Progress. Prior to working at Progress, he worked as Software Engineer for 3 years after getting his undergraduate degree, and recently graduated from NC State University with Masters in Computer Science. His interests are in the areas of Data Connectivity, SaaS and Mobile App Development.

Related Tags

Cloud data connectivity Google hybrid data tutorial

Progress DataDirect Now Connects to Denodo

Progress DataDirect has added Denodo, a data virtualization software platform, to its catalog of connectors.

Data & AI DataDirect

Todd Wright April 07, 2023

Progress DataDirect Achieves Google Cloud Ready—AlloyDB Designation

Progress DataDirect’s Drivers for Google AlloyDB offer a high-performing, secure and reliable connectivity solution for JDBC applications to access data in AlloyDB.

Data & AI DataDirect

Todd Wright March 29, 2023

Powerful Data Connectors Unlock Valuable Data within Progress OpenEdge

Only Progress DataDirect offers direct access to Progress OpenEdge data.

Application Development Data & AI DataDirect OpenEdge

Jessica (Malakian) Newton January 31, 2023

Tutorial: Using Google Cloud Dataflow to Ingest Data Behind a Firewall

Saikrishna Teja Bobba

Related Tags:

Related Products:

DataDirect

Related Tags

Related Articles

Tutorial: Using Google Cloud Dataflow to Ingest Data Behind a Firewall

Saikrishna Teja Bobba

Related Tags:

Related Products:

DataDirect

Related Tags

Related Articles

Latest Stories in Your Inbox