Toad World will be down for maintenance on July 3rd, 2015 from approximately 8:00 PM PST to 11:00 PM PST.
We are sorry for any inconvenience this may cause. Thank you.
Log-In to post
Written by Deepak Vohra
In an era of big data, transferring big data between different components of big data analytics ecosystem is a common requirement. Sqoop is a top-level Apache project for bulk transferring data between Apache Hadoop ecosystem and a structured datastore, such as the relational database Oracle Database. In this article we discuss how to transfer data between Oracle Database and HDFS (Hadoop Distributed File System), Apache Hive, and Apache HBase.
The Sqoop tool provides the following commands to bulk transfer data between the Hadoop ecosystem and Oracle database.
Imports a database table to HDFS as a text file or binary representation format such as Avro or SequenceFiles. Command line arguments may be used to import to Hive or HBase.
Exports HDFS files to a pre-existing Oracle database table.
Imports all tables into HDFS.
Creates a saved job to be run later.
Combine two datasets in HDFS.
Creates a metadata repository for creating and running saved jobs.
Generates Java classes to encapsulate and interpret imported records.
Creates a table in Hive metastore based on a database table.
Runs SQL queries on a database table for evaluation for subsequent import.
Lists all the database schemas. Oracle database XE does not support the command.
Lists all the tables in a database.
We shall discuss the import and export commands in subsequent sections.
Only certain versions of Hadoop run with certain versions of Sqoop and Hive. Any Oracle Database version may be used. The following versions combination was used.
Apache Hadoop (1.0.0)Apache Hive (0.90)Apache Sqoop (Sqoop 1.4.1 incubating)Apache HBase (0.94.1)Java 6Oracle Database (Oracle Database 10g Express Edition used in the article)
We also need to set the environment variables for Sqoop, Hive and HBase, and Oracle database.
sudo vi ~/.bashrc
As the Oracle database driver is used to connect to Oracle database with Sqoop copy the Oracle Database JDBC driver JAR file to Sqoop classpath.
sudo mv $ORACLE_HOME/jdbc/lib/ojdbc14_g.jar /usr/local/sqoop-1.4.1-incubating__hadoop-1.0.0/lib
The sqoop import command is used to import bulk data into Hadoop filesystem, Hive and HBase. To bulk transfer from Oracle Database HR.Departments table to HDFS specify the target HDFS directory with the –target-dir argument (arg), the Oracle database connection parameters with the –connect arg, the table name with the –table arg, and the columns with the –columns arg.
bin/sqoop import --connect "jdbc:oracle:thin:@localhost:1521:XE" --password "calgary10" --username "SYSTEM" --table "HR.DEPARTMENTS" --columns "DEPARTMENT_ID,DEPARTMENT_NAME,MANAGER_ID,LOCATION_ID" --target-dir "/tmp/hdfs_sqoopimport" --verbose
The output from the Sqoop import shows the running of the Map/Reduce job.
The Sqoop import into HDFS may be run with a saved job. First, create a saved job with the sqoop job command and the –create arg.
bin/sqoop job --create import_job -- import --connect "jdbc:oracle:thin:@localhost:1521:XE" --password "calgary10" --username "SYSTEM" --table "HR.DEPARTMENTS" --columns "DEPARTMENT_ID,DEPARTMENT_NAME,MANAGER_ID,LOCATION_ID" --target-dir "/tmp/hdfssqoopimport" --verbose
Subsequently, run the import_job job with the following command.
bin/sqoop job --exec import_job
As the output shows the HR.DEPARTMENTS table gets imported into HDFS as with the sqoop import command.
It may be required to bulk export data from the HDFS to Oracle database. In the previous section we bulk imported data from Oracle database to HDFS into the tmp/hdfs_sqoopimport directory. In this section we export the HDFS data back to Oracle database. As the HR.DEPARTMENTS table already has data, create a copy of the table (HR.DEPARTMENTSCOPY) without the data. Run the sqoop export command to export from HDFS to Oracle database.
bin/sqoop export --connect "jdbc:oracle:thin:@localhost:1521:XE" --hadoop-home "/usr/local/hadoop-1.0.0" --password "calgary10" --username "SYSTEM" --export-dir "/tmp/hdfs_sqoopimport" --table "HR.DEPARTMENTSCOPY" --verbose
The export directory is specified with the –export-dir arg, and the Oracle database table with the –table arg. The Map/Reduce export job runs as shown below.
As the following output indicates 27 records get exported to Oracle Database table HR.DEPARTMENTSCOPY.
If a SELECT query is run on the HR.DEPARTMENTSCOPY table the 27 exported records get listed.
To import into HBase specify the --hbase-create-table arg to create the HBase table, specify the HBase table name with the –hbase-table arg and specify the HBase table column family with the –column-family arg. Specify the Oracle database table to import with the –table arg.
sqoop import --connect "jdbc:oracle:thin:@localhost:1521:XE" --hadoop-home "/usr/local/hadoop-1.0.0" --password "calgary10" --username "SYSTEM" --hbase-create-table --hbase-table "DEPARTMENTS" --column-family "dept" --table "HR.DEPARTMENTS"
The output from the command shows that 27 records get transferred to HBase from Oracle Database table HR.DEPARTMENTS.
The HR.DEPARTMENTS table gets imported into HBase and may be listed with the scan command in the HBase shell.
The Sqoop tool may also be used to import into Hive using the –hive-import arg to the sqoop import command. Specify the Hive table with the –hive-table arg and that the Hive table be created with the –create-hive-table arg. Specify the Oracle database to import with the –table arg.
bin/sqoop import --connect "jdbc:oracle:thin:@localhost:1521:XE" --hadoop-home "/usr/local/hadoop-1.0.0" --password "calgary10" --username "SYSTEM" --hive-import --create-hive-table --hive-table "depts" --table "HR.DEPARTMENTS" --verbose
As the output indicates the HR.DEPARTMENTS table gets imported into Hive.
Sqoop imports into a Hive metastore directory.
Sqoop imports/exports bulk data from/to Oracle database by first getting the metadata about the Oracle database table using the SQL statement SELECT t.* FROM DEPARTMENTS t WHERE 1=0. Sqoop may not be able to get all the required metadata such as column names using the SQL query. It is recommended to specify the column names in the Sqoop import command as we did. Also, Oracle database table columns may be required to be cast to less precise types by Sqoop. To specify the mapping for columns other than the default, the --map-column-java or --map-column-hive args are provided.
In this article we learnt to bulk transfer data between Oracle database and Hadoop ecosystem; HDFS, Hive and HBase.
Great article!Really very useful...!