by Deepak Vohra

In an earlier tutorial on Oracle Data Integrator we discussed integrating a Apache Hive table into Oracle Database 11g with Oracle Data (ODI) 11g. The IKM File-Hive to Oracle (OLH-OSCH) knowledge module, which was used for the Hive integration may also be used for integrating a HDFS file into Oracle Database. In this tutorial we shall create an ODI data Model for HDFS file data and subsequently use the IKM File-Hive to Oracle (OLH-OSCH) knowledge module to integrate a HDFS file into Oracle Database. This tutorial has the following sections.

Setting the Environment

We have used Oracle Linux 6.5 on Oracle VirtualBox 4.3. The pre-requisite software is the same as in the Hive integration tutorial except that Hive is not required. The following software is required to be installed.

  • Oracle Data Integrator 11g
  • Oracle Database 11g
  • Hadoop 2.0.0 CDH 4.6
  • Oracle Loader for Hadoop 3.0.0
  • Java 7

The procedure to install and configure Oracle Data Integrator (ODI), Hadoop and Oracle Loader for Hadoop (OLH) was discussed in the Hive integration tutorial. Set the environment variables for Oracle Database, Hadoop, OLH, and Java in the bash shell.

vi ~/.bashrc

export ODI_HOME=/home/dvohra/dbhome_1
export HADOOP_PREFIX=/odi/hadoop-2.0.0-cdh4.6.0
export HADOOP_CONF=$HADOOP_PREFIX/etc/hadoop
export JAVA_HOME=/odi/jdk1.7.0_55
export ORACLE_HOME=/home/oracle/app/oracle/product/11.2.0/dbhome_1
export ORACLE_SID=ORCL
export OLH_HOME=/odi/oraloader-3.0.0-h2
export HADOOP_MAPRED_HOME=/odi/hadoop-2.0.0-cdh4.6.0/bin
export HADOOP_HOME=/odi/hadoop-2.0.0-cdh4.6.0/share/hadoop/mapreduce2
export HADOOP_CLASSPATH=$HADOOP_HOME/*:$HADOOP_HOME/lib/*:$OLH_HOME/jlib/*
export ODI_ADDITIONAL_CLASSPATH=$HADOOP_HOME/*:$OLH_HOME/jlib/*:$HADOOP_CONF
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_MAPRED_HOME:$ORACLE_HOME/bin

Create the master_odi_repo user as discussed in the Hive tutorial if not already created. Also, create the master repository, a login, and a work repository as discussed in the Hive tutorial. Start HDFS (NameNode and DataNode).

Creating Source and Target Data Stores

Next, create the source datastore, which is a text file in HDFS. Create the following text file wlslog.txt.

Apr-8-2014-7:06:16-PM-PDT Notice,WebLogicServer AdminServer BEA-000365 Server state changed to STANDBY       
Apr-8-2014-7:06:17-PM-PDT Notice WebLogicServer AdminServer BEA-000365 Server state changed to STARTING     
Apr-8-2014-7:06:19-PM-PDT Notice WebLogicServer AdminServer BEA-000365 Server state changed to ADMIN         
Apr-8-2014-7:06:20-PM-PDT Notice WebLogicServer AdminServer BEA-000365 Server state changed to RESUMING     
Apr-8-2014-7:06:22-PM-PDT Notice WebLogicServer AdminServer BEA-000331 Started WebLogic AdminServer         
Apr-8-2014-7:06:23-PM-PDT Notice WebLogicServer AdminServer BEA-000365 Server state changed to RUNNING       
Apr-8-2014-7:06:24-PM-PDT Notice WebLogicServer AdminServer BEA-000360 Server started in RUNNING mode        

As we shall be designing the data model for the HDFS file in ODI including adding the columns the HDFS file data must be uniformly formatted with the same size column values for a particular column. Column values in a row must start and end at the same column index.

Create a /wls directory in HDFS and put the wlslog.txt file in HDFS.

hdfs dfs -mkdir hdfs://10.0.2.15:8020/wls
hdfs dfs -put wlslog.txt hdfs://10.0.2.15:8020/wls

Run the following SQL script in SQL Plus to create an Oracle Database table OE.WLSSERVER.

CREATE TABLE OE.wlsserver (time_stamp VARCHAR2(4000), category VARCHAR2(4000), type VARCHAR2(4000), servername VARCHAR2(4000), code VARCHAR2(4000), msg VARCHAR2(4000));

Creating the Topology for HDFS File

A HDFS specific technology is not provided in ODI. The File technology is used for a file in HDFS. The HDFS file topology consists of the following components:

  • A File Data Server
  • A Physical Schema
  • A Logical Schema

To configure a data server for the text file data store select Topology>Physical Architecture>File. Right-click on File>FILE_GENERIC and select Open. In Definition specify the Data Server Name. The Technology is pre-selected as File. Specify the Connection User and Password for the Linux login.

Select the JDBC tab. The JDBC Driver and JDBC Url fields must be kept empty. Click on Save to save the FILE_GENERIC data server.

  

Next, create a physical schema for the data server. Right-click on FILE_GENERIC data server and select New Physical Schema.

  

In the Definition for the data server in the Directory (Schema) field specify the hdfs URL to the HDFS directory in which the wlslog.txt file is put. The directory is hdfs://10.0.2.15:8020/wls. Specify the same directory in the Directory (Work Schema) field. Click on Save to save the physical schema.

  

A physical schema gets added to the data server.

  

Next, create a logical schema associated with the physical schema. Right-click on Topology>Logical Architecture>Technologies>File and select New Logical Schema.

  

In the Logical Schema Definition specify a Name, HDFS for example. Select the physical schema created earlier for the Global context. Click on Save.

  

A new logical schema gets added to the File Technology in the Logical Architecture.

  

Creating the Topology for Oracle Database

Create the topology for Oracle Database 11g table datastore as discussed in the Hive integration tutorial, if not already created for the Hive integration.

Creating the Design Data Model for HDFS File

Next we shall create the design model for the HDFS file source datastore. Select Designer>Models>New Model Folder. In Model Folder Definition specify a folder Name (HDFSFile for example). Click on Save.

  

A Model folder gets created. Right-click on the HDFSFile Model Folder and select New Model.

  

Specify a Model Name (HDFSFile). Select the File Technology and the HDFS Local Schema created earlier. Select the Action Group as <Generic Action>. Click on Save to save the Model.

  

Next, add a datastore to the model. Right-click on the HDFSFile model and select New Datastore.

  

Specify a Datastore Name, and select Datastore Type as Table. Specify Resource Name as wlslog.txt.

  

Select the Columns tab. Click on Add Column to add new columns for the HDFS file data. Add the TIME_STAMP, CATEGORY, TYPE, SERVERNAME, CODE, and MSG columns. Also select their Type, which is String for all the columns. Specify the Start column index for each of the columns. For example, the fourth column SERVERNAME starts at index 49 in the HDFS text file wlslog.txt in each of the data rows. Specify the Physical Length for each of the columns. For example, the TIME_STAMP column physical length is 26 in each of the data rows. Specify the Logical Length, which is the same as the Physical Length, for each of the columns.

  

Click on Save. The datastore gets added to the model.

   

Creating the Design Data Model for Oracle Database

The data model (OracleDB) and the datastore (WLSSERVER) for Oracle Database table OE.WLSSSERVER is created as discussed in the Hive integration tutorial.

  

Creating an Integration Project

Create an integration project (Export HDFS File To Oracle Database) to integrate the HDFS datastore with the Oracle Database table datastore. An integration project is created as discussed in the Hive integration tutorial. Add the IKM File-Hive to Oracle knowledge module to the project.

  

In the Definition for the IKM File-Hive to Oracle knowledge module select Source Technology as <Undefined> and Target Technology as Oracle.

  

 

Creating an Integration Interface

Create an interface (HDFS-OracleDB) for defining the mapping, data flow between the source and target datastores with the same procedure as for Hive integration.

  

In the interface Definition select Staging Area Different From Target and select the File:HDFS as staging area.

  

Add the HDFS file datastore HDFS-WSLOG as a Source dataset. Add the Oracle Database datastore WLSSERVER as the Target Datastore. Map the columns of the source datastore to the same name columns of the target datastore.

  

The Quick-Edit should list Staging Area in the Execute On column for the Target Datastore WLSSERVER.

  

The Flow of data between the source and target datastores is displayed in the Flow tab. The data flows from the Staging Area in the HDFS to the Oracle Database table OE.WLSSERVER. The IKM Selector should be IKM File-Hive to Oracle.

Running the Interface

In this section we shall run the interface HDFS-OracleDB. Right-click on HDFS-OracleDB interface and select Execute.

  

HDFS file data gets integrated into a Oracle Database table. Select the Operator tab to display the sequence of integrations performed for integrating HDFS file data into Oracle Database.

  

The Oracle staging table, the temp files and data files used in the integration are deleted after the integration has completed.

  

To view the data integrated in Designer>Models>OracleDB right-click on WLSSERVER and select View Data.

  

The data integrated gets displayed.

  

Run a SELECT * FROM OE.WLSSERVER SQL query in SQL Plus to select-list the data integrated.

  

The output from the SQL query lists the 7 rows of log data from the wlslog.txt file in HDFS integrated into Oracle Database.

SQL> SELECT * FROM OE.WLSSERVER;

TIME_STAMP
--------------------------------------------------------------------------------
CATEGORY
--------------------------------------------------------------------------------
TYPE
--------------------------------------------------------------------------------
SERVERNAME
--------------------------------------------------------------------------------
CODE
--------------------------------------------------------------------------------
MSG
--------------------------------------------------------------------------------
Apr-8-2014-7:06:16-PM-PDT
Notice,
WebLogicServer
AdminServer
BEA-000365
Server state changed to STANDBY

Apr-8-2014-7:06:17-PM-PDT
Notice
WebLogicServer
AdminServer
BEA-000365
Server state changed to STARTING

Apr-8-2014-7:06:19-PM-PDT
Notice
WebLogicServer
AdminServer
BEA-000365
Server state changed to ADMIN

Apr-8-2014-7:06:20-PM-PDT
Notice
WebLogicServer
AdminServer
BEA-000365
Server state changed to RESUMING

Apr-8-2014-7:06:22-PM-PDT
Notice
WebLogicServer
AdminServer
BEA-000331
Started WebLogic AdminServer

Apr-8-2014-7:06:23-PM-PDT
Notice
WebLogicServer
AdminServer
BEA-000365
Server state changed to RUNNING

Apr-8-2014-7:06:24-PM-PDT
Notice
WebLogicServer
AdminServer
BEA-000360
Server started in RUNNING mode


7 rows selected.

SQL>

In this tutorial we integrated a HDFS file into a Oracle Database 11g table in Oracle Data Integrator.