Microsoft BCP Performance on Sqoop EXPORT to SQL Server from Hadoop

Microsoft BCP Performance on Sqoop EXPORT to SQL Server from Hadoop

April 23, 2018 0 Comments
Microsoft BCP Performance on Sqoop EXPORT to SQL Server from Hadoop_870x450

In this tutorial, you'll learn how to export data out of Hadoop to boost throughput using the DataDirect SQL Server JDBC driver and Apache Sqoop.

We've gotten everyone connected to SQL Server using Progress DataDirect's exclusive support for both NTLM and Kerberos authentication from Linux with Sqoop. Now, we plan to blow your minds with high flying bulk insert performance into SQL Server using Sqoop's Generic JDBC Connector. Linux clients will get similar throughput to the Microsoft BCP tool.

sqoop-logo

So far, Cloudera and HortonWorks have been pointing shops to the high performance DataDirect SQL Server JDBC driver to help load data volumes anywhere from 10GB to 1TB into SQL Server Data Marts and Warehouses. It's common for the DataDirect SQL Server JDBC driver to speed up load times by 15-20X; and Sqoop will see similar improvement since it leverages JDBC batches that we transparently convert into SQL Server's native bulk load protocol. Moving data out of Hadoop and into external JDBC sources are exciting projects that represent the democratization of big data for downstream application consumers. You're definitely doing something right if you are ready to read on!

Get Started with Fast Performance for Sqoop EXPORT to SQL Server

  1. Download the DataDirect Connect for JDBC drivers for and follow the quick start guides supplied with the download.
  2. Copy the sqlserver.jar file to $SQOOP_HOME/lib directory on your client machine. (This will be /usr/lib/sqoop/lib if you installed from an RPM or Debian package). The JDBC driver needs to be installed only on the machine where Sqoop is executed; and not on each node in your Hadoop cluster.
  3. Verify the Database’s recovery mode per the msdn article on Considerations for Switching from the Full or Bulk-Logged Recovery Model. To verify the recovery mode, the database user can run the following query:
    SELECT name, recovery_model_desc
    FROM sys.databases
    WHERE name = ‘database_name’ ;

    Note the recovery_model_desc returned by this query (expect to return,’BULK_LOGGED’).

  4. From command line, run the Sqoop export command using similar properties as below. Or specify the equivalent using the Hue web UI for Sqoop jobs. sqoop export --connect 'jdbc:datadirect:sqlserver://nc-sqlserver:1433;database=test;user=test01;password=test01;EnableBulkLoad=true;BulkLoadBatchSize=1024;BulkLoadOptions=0' --driver com.ddtek.jdbc.sqlserver.SQLServerDriver --table 'blah_1024MB' --export-dir /user/hdfs/blah_1024MB/ --input-lines-terminated-by "n" --input-fields-terminated-by ',' --batch -m 10

Notes

  • --batch mode is used for underlying insert statement execution.
  • --driver must be specified when using a Generic JDBC connector.
  • --connect is the JDBC URL. “EnableBulkLoad=true” authorizes the DataDirect SQL Server driver to utilize the bulk load protocol for the inserting of rows. The “BulkLoadBatchSize” value indicates to the driver the number of rows it will attempt to bulk load on a single roundtrip to the server. If this value is less than the sqoop.export.records.per.statement value, then each call to “executeBatch” will result in more than one round trip to the server in order to insert the batch of rows.
  • --table: the table to be populated in the target relational database as data is transferred from HDFS
  • --export-dir: identifies the HDFS directory which contains the Hadoop table to be exported.
  • --input-lines-terminated-by: identifies the character which separates rows in the HDFS files.
  • --input-fields-terminated-by: identifies the character which separates columns in the HDFS files.
  • -D sqoop.export.records.per.statement is not recommended nor the equivalent of JDBC batch size. Rather, it specifies the number of rows per SQL statements for data sources that support multi row inserts such as Postgres.
INSERT INTO films (code, title, did, date_prod, kind) VALUES
  ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
  ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');

View the Sqoop user's guide for complete reference.

Special thanks to Mike Spinak, Principal Software Engineer and Danh Huynh, Systems Administrator with their help with setup and testing in the 6 node Cloudera CDH5.2.0-1.cdh5.2.0.p0.36 cluster to export data into SQL Server 2008 R2.

Show Me the Numbers

Results are still coming in from several shops and the best to date is a load of 40 GB into SQL Server within 10 minutes. In the above system, we were loading 37 GB in 18.5 minutes. There are several properties across Hadoop, Sqoop, JDBC driver and SQL Server you can tune to improve performance even further.

Download your trial driver today and tweet your load times to @DataDirect_News. If you have questions, please contact us; or call us at 1-800-876-3101.

Download Free Trial

Saikrishna-Teja-Bobba_164x164.jpg

Saikrishna Teja Bobba

Saikrishna is a DataDirect Developer Evangelist at Progress. Prior to working at Progress, he worked as Software Engineer for 3 years after getting his undergraduate degree, and recently graduated from NC State University with Masters in Computer Science. His interests are in the areas of Data Connectivity, SaaS and Mobile App Development.

Comments
Comments are disabled in preview mode.
Topics
 
 
Latest Stories in
Your Inbox
Subscribe
More From Progress
OData: The Fastest Way to RESTify Your Databases
OData: The Fastest Way to RESTify Your Databases
Download Whitepaper
 
thumbnail_Kinvey_BusinessValue_232_132
The Business Value of Kinvey
Download Whitepaper
 
EME_Webinar_thumbnail
Finding App ROI Through Time And Money
Watch Webinar