Build, protect and deploy apps across any platform and mobile device
Deliver Awesome UI with the most complete toolboxes for .NET, Web and Mobile development
Build rich, smart HTML5 and JavaScript apps for any platform, browser or device
Automate UI, load and performance testing for web, desktop and mobile
Use Angular, TypeScript or JavaScript to build truly native mobile apps
Rapidly develop, manage and deploy business apps, delivered as SaaS in the cloud
Automate decision processes with a no-code business rules engine
Build mobile apps for iOS, Android and Windows Phone
A complete cloud platform for an app or your entire digital business
Deploy automated machine learning to accurately predict machine failures with technology optimized for Industrial IoT.
Optimize data integration with high-performance connectivity
Connect to any cloud or on-premises data source using a standard interface
Build engaging multi-channel web and digital experiences with intuitive web content management
Tutorial: Discover how to build a pipeline with Kafka leveraging DataDirect PostgreSQL JDBC driver to move the data from PostgreSQL to HDFS. Let’s go streaming!
Apache Kafka is an open source distributed streaming platform which enables you to build streaming data pipelines between different applications. You can also build real-time streaming applications that interact with streams of data, focusing on providing a scalable, high throughput and low latency platform to interact with data streams.
Earlier this year, Apache Kafka announced a new tool called Kafka Connect which can helps users to easily move datasets in and out of Kafka using connectors, and it has support for JDBC connectors out of the box! One of the major benefits for DataDirect customers is that you can now easily build an ETL pipeline using Kafka leveraging your DataDirect JDBC drivers. Now you can easily connect and get the data from your data sources into Kafka and export the data from there to another data source.
Note: Image From https://kafka.apache.org/
Before proceeding any further with this tutorial, make sure that you have installed the following and are configured properly. This tutorial is written assuming you are also working on Ubuntu 16.04 LTS, you have PostgreSQL, Apache Hadoop and Hive installed.
To make the installation process easier for people trying this out for the first time, we will be installing Confluent Platform. This takes care of installing Apache Kafka, Schema Registry and Kafka Connect which includes connectors for moving files, JDBC connectors and HDFS connector for Hadoop.
wget -qO - http://packages.confluent.io/deb/2.0/archive.key | sudo apt-key add -
sudo add-apt-repository "deb http://packages.confluent.io/deb/2.0 stable main"
sudo apt-get update sudo apt-get install confluent-platform-2.11.7
sudo apt-get update
sudo apt-get install confluent-platform-2.11.7
java -jar PROGRESS_DATADIRECT_JDBC_POSTGRESQL_ALL.jar
name=test-postgres-jdbc
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:datadirect:postgresql://<;
server
>:<
port
>;User=<
user
>;Password=<
password
>;Database=<
dbname
>
mode=timestamp+incrementing
incrementing.column.name=<
id
timestamp.column.name=<
modifiedtimestamp
topic.prefix=test_jdbc_
table.whitelist=actor
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
topics=test_jdbc_actor
hdfs.url=hdfs://<;
flush.size=2
hive.metastore.uris=thrift://<;
hive.integration=true
schema.compatibility=BACKWARD
ln -s /path/to/datadirect/lib/postgresql.jar /path/to/hive/lib/postgresql.jar
export CLASSPATH=/path/to/datadirect/lib/postgresql.jar
cd /path/to/hadoop/sbin
./start-dfs.sh
./start-yarn.sh
zookeeper-server-start /path/to/zookeeper.properties
kafka-server-start /path/to/server.properties
schema-registry-start /path/to/ schema-registry.properties
To start ingesting data from PostgreSQL, the final thing that you have to do is start Kafka Connect. You can start Kafka Connect by running the following command:
connect-standalone /path/to/connect-avro-standalone.properties \ /path/to/postgres.properties /path/to/hdfs.properties
This will import the data from PostgreSQL to Kafka using DataDirect PostgreSQL JDBC drivers and create a topic with name test_jdbc_actor. Then the data is exported from Kafka to HDFS by reading the topic test_jdbc_actor through the HDFS connector. The data stays in Kafka, so you can reuse it to export to any other data sources.
We hope this tutorial helped you understand on how you can build a simple ETL pipeline using Kafka Connect leveraging DataDirect PostgreSQL JDBC drivers. This tutorial is not limited to PostgreSQL. In fact, you can create ETL pipelines leveraging any of our DataDirect JDBC drivers that we offer for Relational databases like Oracle, DB2 and SQL Server, Cloud sources like Salesforce and Eloqua or BigData sources like CDH Hive, Spark SQL and Cassandra by following similar steps. Also, subscribe to our blog via email or RSS feed for more awesome tutorials. Try any DataDirect JDBC Driver Free
Saikrishna is a DataDirect Developer Evangelist at Progress. Prior to working at Progress, he worked as Software Engineer for 3 years after getting his undergraduate degree, and recently graduated from NC State University with Masters in Computer Science. His interests are in the areas of Data Connectivity, SaaS and Mobile App Development.
Copyright © 2017 Progress Software Corporation and/or its subsidiaries or affiliates. All Rights Reserved.
Progress, Telerik, and certain product names used herein are trademarks or registered trademarks of Progress Software Corporation and/or one of its subsidiaries or affiliates in the U.S. and/or other countries. See Trademarks for appropriate markings.