Build, protect and deploy apps across any platform and mobile device
Leverage a complete UI toolbox for web, mobile and desktop development
Automate UI, load and performance testing for web, desktop and mobile
Rapidly develop, manage and deploy business apps, delivered as SaaS in the cloud
Automate decision processes with a no-code business rules engine
Build mobile apps for iOS, Android and Windows Phone
Deploy automated machine learning to accurately predict machine failures with technology optimized for Industrial IoT.
Optimize data integration with high-performance connectivity
Connect to any cloud or on-premise data source using a standard interface
Build engaging multi-channel web and digital experiences with intuitive web content management
Saikrishna recently experimented with bulk exporting huge datasets from Hadoop to SQL Server and measured the performance of various scenarios. Take a look at the results.
With the rise of big data in the cloud and demand for open analytics, we’re seeing bulk data movement as a common data integration pattern.
Big data platforms are migrating to the cloud (such as Amazon EMR, IBM BigInsights on Cloud, Microsoft Azure HDInsight and SAP Altiscale) to better handle elastic workloads that consume expanding cloud-resident data sets and streams. Advanced analytics techniques are then used to crunch these data sets into valuable insights.
But how can these insights be exported to enterprise on-premises business analytics platforms for distribution, such as Microsoft Power BI or Oracle Analytics?
This article looks into the performance of moving big data insights from the cloud to on-premises data warehouses or marts using a cloud-agnostic hybrid data pipeline.
The following are the observations recorded while performing Bulk Export of data to SQL Server using Apache Sqoop in various scenarios using Progress DataDirect Hybrid Connectivity solutions.
This scenario was set up to check the Bulk Export capabilities to SQL Server using DataDirect Cloud. Here I have a Hadoop Single Node that was installed in Pseudo distributed mode on AWS EC2, where I had Sqoop and Hive installed alongside.
Behind the firewall, we have SQL Server 2016 installed on an on-premises Progress vCloud. A huge dataset was exported from Hadoop to SQL Server using Sqoop and the JDBC client from the DataDirect Cloud service.
New York City Taxi Data 2016
sqoop export --connect
'jdbc:datadirect:ddcloud:database=SQLNonBulk;user=<user>;password=<password>; TransactionMode=ignore' --driver com.ddtek.jdbc.ddcloud.DDCloudDriver --table 'NYCTAXIDATA' --export-dir /user/hive/warehouse/datadirect.db/nyctaxidata --input-lines-terminated-by "\n" --input-fields-terminated-by ',' --batch -m 10
Note: The batch sizes in hybrid connectivity scenarios are configured in the DataDirect Cloud service, rather than the JDBC client.
The performance gains demonstrated in this experiment, up to 2,300%, prove that cloud-based Hadoop platforms make it possible to integrate data insights with enterprise analytics platforms for inclusion in intra-day reporting and analytics.There were discrepancies in throughput due in part to using non-production infrastructure, but the bulk export performance is consistently improved with DataDirect hybrid bulk export capabilities.
While the experiment is with a SQL Server database, similar bulk movement facilities are available with other popular DataDirect hybrid data sources such as Oracle and Salesforce.
As always with any data connectivity performance testing, Progress recommends testing in a production-like environment.
Try the experiment against your big data platform.
Saikrishna is a DataDirect Developer Evangelist at Progress. Prior to working at Progress, he worked as Software Engineer for 3 years after getting his undergraduate degree, and recently graduated from NC State University with Masters in Computer Science. His interests are in the areas of Data Connectivity, SaaS and Mobile App Development.
Copyright © 2017 Progress Software Corporation and/or its subsidiaries or affiliates.
All Rights Reserved.
Progress, Telerik, and certain product names used herein are trademarks or registered trademarks of Progress Software Corporation and/or one of its subsidiaries or affiliates in the U.S. and/or other countries. See Trademarks for appropriate markings.