Apache Flume is designed for collecting logs. In earlier tutorials on Flume we discussed streaming Oracle Database logs to HDFS, HBase and Oracle NoSQL Database, all of which are designed for storing large quantities of data. HDFS, HBase and Oracle NoSQL Database do provide query support, but are not designed for search as the main feature.   Apache Solr is designed for indexing and searching data. Flume provides the MorphlineSolrSink to stream data to Solr. Combining Flume and Solr provides a suitable platform to collect, index and search logs.   In this tutorial we shall collect Oracle Database logs in Solr using Flume.  Subsequently we shall search the indexed logs in Solr. The tutorial has the following sections.
 
 

Setting the Environment

 
The following software is required for this tutorial.
 
-Apache Flume 1.4
-Apache Solr 4.10.3
-Oracle Database 11g or 12c
-Java 7
-Maven 3.x
 
Create a directory /flume on Linux  (Oracle Linux 6.6 used in the tutorial) and set its permissions to global (777).
 
mkdir /flume
chmod -R 777 /flume
cd /flume
 
Maven is required to compile and install the flume-ng-morphline-solr-sink jar file. Download Maven tar.gz file and extract the file.
 
wget http://mirror.olnevhost.net/pub/apache/maven/binaries/apache-maven-3.2.2-bin.tar.gz
tar xvf apache-maven-3.2.2-bin.tar.gz
 
Donwload and extract Flume 1.4 tar.gz file.
 
wget http://archive-primary.cloudera.com/cdh4/cdh/4/flume-ng-1.4.0-cdh4.6.0.tar.gz
tar -xvf flume-ng-1.4.0-cdh4.6.0.tar.gz
 
Donwload and extract the Apache Solr 4.10.3 tgz file.
 
wget http://apache.mirror.gtcomm.net/lucene/solr/4.10.3/solr-4.10.3.tgz
tar xvf  solr-4.10.3.tgz
 
Create a morphlines.conf file in the Flume configuration directory conf. The morphlines.conf file shall be used to configure the Solr server URL, collection, and the Solr commands to parse the source input data. 
 
vi /flume/apache-flume-1.4.0-cdh4.6.0-bin/conf/morphlines.conf
 
Create the flume-env.sh file from the template.
 
cp $FLUME_HOME/conf/flume-env.sh.template $FLUME_HOME/conf/flume-env.sh
 
Add the following JAVA_OPTS to the flume-env.sh file.
 
JAVA_OPTS="-Xms1000m -Xmx1000m -Xss128k -XX:MaxDirectMemorySize=256m
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC"
 
We need to download and add the following jar files to the Flume lib directory.
 
Jar File
Description
lucene-core-4.10.3.jar
Apache Lucene Java Core
lucene-codecs-4.10.3.jar
Codecs and postings formats for Lucene
lucene-analyzers-common-4.10.3.jar
Lucene Common Analyzers
lucene-analyzers-kuromoji-4.10.3.jar
Lucene Kuromoji Japanese Morphological Analyzer
lucene-spatial-4.10.3.jar
Spatial Strategies for Lucene
lucene-phonetic-3.6.2.jar
Lucene Phonetic Analyzer
org.restlet-2.3.1.jar
Restlet Core - API and Engine
org.apache.servicemix.bundles.restlet-1.1.10_3.jar
This OSGi bundle, which wraps org.restlet, and com.noelios.restlet 1.1.10 jar files
spatial4j-0.4.1.jar
 
Spatial4j, a general purpose spatial / geospatial ASL licensed open-source Java library.
org.apache.commons.fileupload-1.2.2.LIFERAY-PATCHED-1.jar
Apache Commons File Upload
lucene-analyzers-phonetic-4.10.3.jar
Lucene Phonetic Filters
lucene-queries-4.10.3.jar
Lucene Queries Module
kite-morphlines-core-0.18.0.jar
Kite Morphlines Core
metrics-core-3.0.2
Metrics Core Library
metrics-healthchecks-3.0.2
Metrics Health Checks
config-1.2.1.jar
Config
flume-ng-morphline-solr-sink-1.5.2.jar
Flume Morphline Solr Sink
 
To the bash shell file add the environment variables for Solr, Flume, Maven and Java.
 
vi ~/.bashrc
 
export SOLR_HOME=/flume/solr-4.10.3/example/solr/collection1
export SOLR_CONF=/flume/solr-4.10.3/example/solr/collection1/conf
export MAVEN_HOME=/flume/apache-maven-3.2.2-bin
export FLUME_HOME=/flume/apache-flume-1.4.0-cdh4.6.0-bin
export FLUME_CONF=/flume/apache-flume-1.4.0-cdh4.6.0-bin/conf
export JAVA_HOME=/flume/jdk1.7.0_55
export PATH=$PATH:$FLUME_HOME/bin:$SOLR_HOME/bin:$MAVEN_HOME/bin
export CLASSPATH=$SOLR_HOME/lib/*:$FLUME_HOME/lib/*
 
Change directory (cd) to the /flume/apache-flume-1.4.0-cdh4.6.0-bin/flume-ng-sinks
 directory and run the mvn install command to generate the Flume Morphline Sink jar file.
 
cd /flume/apache-flume-1.4.0-cdh4.6.0-bin/flume-ng-sinks
cd flume-ng-morphline-solr-sink
mvn install
 
Alternatively download the flume-ng-morphline-solr-sink-1.4.0.jar file.
 
wget http://central.maven.org/maven2/org/apache/flume/flume-ng-sinks/flume-ng-morphline-solr-sink/1.4.0/flume-ng-morphline-solr-sink-1.4.0.jar
 
Copy the flume-ng-morphline-solr-sink-1.4.0.jar file to the Flume lib directory.
 
cp flume-ng-morphline-solr-sink-1.4.0.jar $FLUME_HOME/lib
 

Configuring Flume

 
In this section configure Flume in the $FLUME_HOME/conf/flume.conf file. Set the following properties.
 
Property
Description
Value
agent.sources
Flume  source name
exec1
agent.sinks
Flume  sink name
sink1
agent.channels
Flume agent channel
ch1
agent.channels.ch1.type
Flume channel type
memory
agent.sources.exec1.type
Flume source type
exec
agent.sources.exec1.command
Exec command to get Oracle Database log file data
cat /home/oracle/app/oracle/diag/rdbms/orcl/ORCL/
trace/alert_ORCL.log
agent.sources.exec1.channels
Sets channel on source
ch1
agent.sinks.sink1.morphlineId
Sets sink morphline id
morphline1
agent.sinks.sink1.type
Sets Flume sink class
org.apache.flume.sink.solr.morphline.
MorphlineSolrSink
agent.sinks.sink1.channel
Sets the channel on the sink
ch1
agent.sinks.sink1.morphlineFile
Sets the Morphline configuration
file
/flume/apache-flume-1.4.0-cdh4.6.0-bin/conf/morphlines.conf
agent.sinks.sink1.batchSize
Sets the batch size, the maximum number of events per Flume event
1
agent.sinks.sink1.batchDurationMillis
Sets the maximum duration per Flume transaction. The transaction commits itself after the specified duration or if the bath size is exceeded, whichever is first.
10000
agent.channels.ch1.capacity
Sets the channel capacity
1000000
 
The flume.conf file is listed:
 
agent.sources=exec1
agent.sinks=sink1
agent.channels=ch1
agent.channels.ch1.type=memory
 
agent.sources.exec1.type=exec
agent.sources.exec1.command=cat /home/oracle/app/oracle/diag/rdbms/orcl/ORCL/trace/alert_ORCL.log
agent.sources.exec1.channels=ch1
 
agent.sinks.sink1.morphlineId = morphline1
agent.sinks.sink1.type= org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent.sinks.sink1.channel=ch1
agent.sinks.sink1.morphlineFile = /flume/apache-flume-1.4.0-cdh4.6.0-bin/conf/morphlines.conf
 
agent.sinks.sink1.batchSize = 1
agent.sinks.sink1.batchDurationMillis = 10000
 
agent.channels.ch1.capacity = 1000000
 
Next, configure the morphlines.conf file, which is in a slightly modified JSON format called HOCON format.  Specify the Solr server location/s in the SOLR-LOCATOR variable, which consists of the collection, solrUrl and solrHomeDir attributes. Specify collection as collection1, solrUrl as http://localhost:8983/solr/, and solrHomeDir as /flume/solr-4.10.3/example/solr/collection1. Morphlines are used to consume Flume events and convert/transform them into a stream of records and pipe the stream to the subsequent morphline for eventual consumption by Solr. Morphlines are essentially commands and the readLine command is used to read a line of input. As the default character set is not supported specify the readLine charset as UTF-8. Solr schema requires id as one of the fields. The readLine generates a field called message for each line parsed. Add a unique id field using the  generateUUID command. The sanitizeUnknownSolrFields command removes fields not specified in the Solr schema.xml file. The logDebug command logs the record at DEBUG level. The        loadSolr command loads the record to Solr server. The morphlines.conf file is listed:
 
# Specify server locations in a SOLR_LOCATOR variable; used later in
# variable substitutions:
SOLR_LOCATOR : {
  # Name of solr collection
  collection : collection1
 
  solrUrl : "http://localhost:8983/solr/" 
 
  solrHomeDir: "/flume/solr-4.10.3/example/solr/collection1" }
 
# Specify an array of one or more morphlines, each of which defines an ETL
# transformation chain. A morphline consists of one or more potentially
# nested commands. A morphline is a way to consume records such as Flume events,
# HDFS files or blocks, turn them into a stream of records, and pipe the stream
# of records through a set of easily configurable transformations on the way to
# a target application such as Solr.
morphlines : [
  {
    id : morphline1
    importCommands : ["com.cloudera.**", "org.apache.solr.**", "org.kitesdk.**"]
 
    commands : [
      {
        readLine {
           charset : UTF-8
 
        }
      }
     
      {
          generateUUID {
              field : id
          }
      }
 
      # Consume the output record of the previous command and pipe another
      # record downstream.
      #
      # This command deletes record fields that are unknown to Solr
      # schema.xml.
      #
      # Recall that Solr throws an exception on any attempt to load a document
      # that contains a field that is not specified in schema.xml.
      {
        sanitizeUnknownSolrFields {
          # Location from which to fetch Solr schema
          solrLocator : ${SOLR_LOCATOR}
        }
      } 
           
      # log the record at DEBUG level to SLF4J
      { logDebug { format : "output record: {}", args : ["@{}"] } }   
     
      # load the record into a Solr server or MapReduce Reducer
      {
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}
        }
      }
 
    ]
  }
]
 
 

Configuring Solr

 
Configure the fields to be stored from the Flume stream using the Solr schema.xml file.   The id field is required in each record streamed to Solr server by default and the id field is a unique field.   The version field is also used in each record but is generated automatically. The version field should not be added to the record in the morphlines.conf file as the field is generated automatically and if the version field is added in morphlines.conf an error is generated. The readLine command generates the message field. Add the message field to the schema.xml. A snippet from schema.xml (including the vi command to list the schema.xml) is listed:
 
vi /flume/solr-4.10.3/example/solr/collection1/conf/schema.xml
 
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">
<field name="message" type="string" indexed="true"  stored="true"  multiValued="false" />
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
 <uniqueKey>id</uniqueKey>
    
</schema>
 
The Solr server auto-commits updates made to it after a max time of 15000 ms (15 seconds). The openSearcher in autoCommit is set to false be default, which implies a new searcher is not created after a commit has been made. As a new searcher is not opened the newly indexed data is not available to the searcher.
 
<autoCommit>
    <maxTime>15000</maxTime>
    <openSearcher>false</openSearcher>
  </autoCommit>
 
If the default setting of false is used for openSearcher the Solr server must be restarted before the newly indexed data becomes  available for search. Alternatively, set openSearcher to true in solrConfig.xml.
 
<autoCommit>
    <maxTime>15000</maxTime>
    <openSearcher>true</openSearcher>
  </autoCommit>
 
Start the Solr server with the following commands.
 
cd /flume/solr-4.10.3/example/
java -jar start.jar
 
The Solr server gets started.
 
 

Streaming Oracle Database Logs to Solr

 
With the Solr server running Oracle Databass logs may be streamed to the server using Flume. Run the following command to start the Flume agent.
 
flume-ng agent  --conf $FLUME_HOME/conf/ -f $FLUME_HOME/conf/flume.conf -n agent -Dflume.root.logger=INFO,console
 
The Flume channel, source and sink get started. As indicated in the output the sink is a Morphline sink.
 
 
A more detailed output from the Flume agent is listed:
 
[root@localhost flume]# flume-ng agent  --conf $FLUME_HOME/conf/ -f $FLUME_HOME/conf/flume.conf -n agent -Dflume.root.logger=INFO,console
Info: Sourcing environment configuration script /flume/apache-flume-1.4.0-cdh4.6.0-bin/conf/flume-env.sh
Configuration provider starting
2015-02-19 20:17:12,719 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)] Reloading configuration file:/flume/apache-flume-1.4.0-cdh4.6.0-bin/conf/flume.conf
2015-02-19 20:17:12,897 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty(FlumeConfiguration.java:1016)] Processing:sink1
2015-02-19 20:17:13,277 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:140)] Post-validation flume configuration contains configuration for agents: [agent]
2015-02-19 20:17:13,281 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.loadChannels(AbstractConfigurationProvider.java:150)] Creating channels
2015-02-19 20:17:13,504 (conf-file-poller-0) [INFO - org.apache.flume.channel.DefaultChannelFactory.create(DefaultChannelFactory.java:40)] Creating instance of channel ch1 type memory
2015-02-19 20:17:13,560 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.loadChannels(AbstractConfigurationProvider.java:205)] Created channel ch1
2015-02-19 20:17:13,572 (conf-file-poller-0) [INFO - org.apache.flume.source.DefaultSourceFactory.create(DefaultSourceFactory.java:39)] Creating instance of source exec1, type exec
2015-02-19 20:17:13,757 (conf-file-poller-0) [INFO - org.apache.flume.sink.DefaultSinkFactory.create(DefaultSinkFactory.java:40)] Creating instance of sink: sink1, type: org.apache.flume.sink.solr.morphline.MorphlineSolrSink
2015-02-19 20:17:14,192 (conf-file-poller-0) [INFO - org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:119)] Channel ch1 connected to [exec1, sink1]
2015-02-19 20:17:14,754 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:145)] Starting Channel ch1
2015-02-19 20:17:15,323 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:119)] Monitored counter group for type: CHANNEL, name: ch1: Successfully registered new MBean.
2015-02-19 20:17:15,330 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: CHANNEL, name: ch1 started
2015-02-19 20:17:15,342 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink sink1
2015-02-19 20:17:15,345 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source exec1
2015-02-19 20:17:15,386 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.source.ExecSource.start(ExecSource.java:163)] Exec source starting with command:cat /home/oracle/app/oracle/diag/rdbms/orcl/ORCL/trace/alert_ORCL.log
2015-02-19 20:17:15,387 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.sink.solr.morphline.MorphlineSink.start(MorphlineSink.java:88)] Starting Morphline Sink sink1 (MorphlineSolrSink) ...
2015-02-19 20:17:15,457 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:119)] Monitored counter group for type: SINK, name: sink1: Successfully registered new MBean.
2015-02-19 20:17:15,480 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: SINK, name: sink1 started
2015-02-19 20:17:15,559 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:119)] Monitored counter group for type: SOURCE, name: exec1: Successfully registered new MBean.
2015-02-19 20:17:15,567 (lifecycleSupervisor-1-3) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: SOURCE, name: exec1 started
2015-02-19 20:17:17,109 (pool-3-thread-1) [INFO - org.apache.flume.source.ExecSource$ExecRunnable.run(ExecSource.java:370)] Command [cat /home/oracle/app/oracle/diag/rdbms/orcl/ORCL/trace/alert_ORCL.log] exited with 0
2015-02-19 20:17:17,977 (lifecycleSupervisor-1-1) [INFO - org.kitesdk.morphline.api.MorphlineContext.importCommandBuilders(MorphlineContext.java:89)] Importing commands
2015-02-19 20:18:29,208 (lifecycleSupervisor-1-1) [INFO - org.kitesdk.morphline.api.MorphlineContext.importCommandBuilders(MorphlineContext.java:106)] Done importing commands
2015-02-19 20:18:29,715 (lifecycleSupervisor-1-1) [INFO - org.apache.solr.core.SolrResourceLoader.<init>(SolrResourceLoader.java:136)] new SolrResourceLoader for directory: '/flume/solr-4.10.3/'
2015-02-19 20:18:42,328 (lifecycleSupervisor-1-1) [INFO - org.apache.solr.core.SolrConfig.initLibs(SolrConfig.java:565)] Adding specified lib dirs to ClassLoader
 [INFO - org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:574)] unique key field: id
2015-02-19 20:19:56,768 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.sink.solr.morphline.MorphlineSink.start(MorphlineSink.java:101)] Morphline Sink sink1
 started.
 
The Oracle Database logs get streamed to Solr server.
 

Searching in Solr

 
Start the Solr Admin console with the URL http://localhost:8983/solr/. Select collection1 in the Solr Admin console. Click on Overview. The statistics list the instance as collection1, and the Data and Index directories.
 
 
Specify the URL http://localhost:8983/solr/#/collection1/query. The query handler page gets displayed with the Request-Handler as /select. The default query is *:*, which is to select all data in the collection.
 
 
The start and rows are 0 and 10 respectively, which implies 10 rows are selected starting from row 0. The default document format of the search result is json and is set in the wt selection list.  Some of the other options are xml and csv. Click on Execute Query to search the collection1 using the specified query.
 
 
A JSON result set gets listed. The numFound attribute in the response indicates the number of documents found in the collection1.
 
 
Only the documents specified in start and rows get listed, not all the 6921 documents indicated by numFound. The start indicates the zero based position of the first result and the rows indicates the number of records or documents on the page.
 
 
The start and rows may be set to any value within the range of the number of documents in the collection. For example, to list 20 documents starting from position 10 specify start as 10 and rows as 20.
 
 
The json format limits the number of documents that may be listed on a page as a JSON document is a multi-lined document. If start is selected as 0 and rows as 6921 the json format does not list all the documents with a query. The csv format is more compact and lists all the data in the collection. To list the query result as csv select wt as csv.
 
 

Searching with Curl

 
The curl tool may be used to transfer data from the Solr server. To list 10 documents starting with the first run the following curl command.
 
curl http://localhost:8983/solr/select?q=*:*&rows=0
 
Curl lists the 10 documents from Solr server.
 
 
 A particular document may also be listed by specifying the id in the curl command as follows.
 
 
To delete all documents from the server run the following curl commands.
 
curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
 
 
In this tutorial we collected and indexed Oracle Database logs in Solr server and subsequently searched the logs from the Solr Admin console and using the curl tool.