Written by Juan Carlos Olamendy Turruellas

 

Introduction

 

In this article, I want to talk about the evolution of data pipeline architectures from the traditional centralized, batch-oriented and report-only data warehouses towards the modern one based on distributed data stores, distributed computing, near real-time processing and the used of machine learning and analytics to support decision making process in today fast-changing business environment.

 

Big data has matured and become in one of the pillar of any business strategy today, specifically in the area of sales and digital marketing in order to increase the revenue and customer loyalty and expectation. In a highly competitive and regulated environment, businesses are required every day to make decisions based on data, instead of intuition. In order to make good decisions, it’s necessary to process a huge amount of data in an efficient way (the less possible computing resources and a minimum processing latency), add new data sources (structure, semi-structure and unstructured ones such as UI activities, logs, performance events, sensor data, emails, documents, social media, etc) and support the decisions using machine learning algorithms and visualization techniques.

 

Some companies such as Netflix are publicly declared data-oriented because their core business, products and services are based on insights derived from data analysis.

 

So, let’s see how we can transform our traditional data warehouse architecture into a modern one to support the challenges related to big data and high computing.

 

Traditional data warehouse architecture

 

A traditional data warehouse architecture comprises of the following elements:

  • Transactional systems (OLTP). Produce and record the transactional data/facts resulted from business operations
  • A data warehouse (DWH). Centralized data stores that integrates and consolidates the transactional data
  • ETL processes. Batch processes that move the transactional data from OLTP towards DWH
  • Data marts and cubes. Representing basically a derived and aggregated view of the DWH
  • Analytical and visualization tools. Enabling the visualization of data stored in the data marts and cubes for reporting and auditing purposes. They do first-generation analytics using data mining techniques

 

This kind of architecture can be illustrated in the following figure.

 


 

Figure 01

 

This architecture has some drawbacks as shown below:

  • Data sources are limited only to transactional systems (OLTP).
  • The major workload is based on ETL batch processing (jobs). It’s well-known that there is a loss of data in this step
  • The integration and consolidation of data is very complex due to the rigid nature of ETL jobs.
  • Data schema is not very flexible (in the OLTP, DWH, data marts and cubes sides) to be extended for new analytics use cases.
  • It’s very complex to integrate semi- and un-structure data sources. So, we lose very important information in the form of log files, sensor data, emails and documents.
  • It doesn’t support naturally real-time and interactive analytics. It’s thought to be batch-oriented.
  • It’s very limited when we need to scale the solution.
  • It’s designed to be used in on-premise environment, so it’s very complex to extend and deploy in cloud- and hybrid-based environments.

 

So, in order to overcome the limitations of the previous architecture, we need to think using new paradigms.

 

Modern pipeline architectures

 

The modern pipeline architect is an evolution from the previous one integrating new data sources and using new computing paradigms as well as the integration of artificial intelligence, machine learning algorithm and cognitive computing.

 

In this new approach, we have a pipeline engine with the following features:

  • Unified data architecture for all the data sources no matter the structure of the origin. Integration with existing data marts and data warehouses. It appears the concept of Enterprise Data Lake
  • Flexible data schemas designed to be changed frequently. Use of NoSQL-based data stores
  • Unified computing platform for processing any kind of workload (batch, interactive, real-time, machine learning, cognitive, etc). Use of distributed platforms such as Hadoop, Spark, Kafka, Flume, etc
  • Deployable on hybric- and cloud-based environments
  • Horizontal scalability, so we can process unlimited data volume by just adding new nodes to the cluster

 

Although, there are a huge amount (as big as the big data itself) of technologies related to big data and analytics, I’ll show a referential architecture for a modern data pipeline. I’ll illustrate the functional aspect of every layer using particular technologies for you to research further on this and learn more (see the figure 02).

 


 

Figure 02

 

From this referential architecture, we can derive specific use cases to be used in your business.

 

Use Case 01. Data warehouse offloading

 

Data warehouse has being in major companies for many years. With the exponential growth of data, the DWHs are reaching their capacity limits and batch windows are also increasing putting at risk the SLA. One approach is to migrate heavy ETL and calculation workloads into Hadoop in order to achieve faster processing time, lower costs per stored data and free DWH capacity to be used in other workloads.

 

Here we have two major options:

  • One option is to load raw-data from OLTP into Hadoop, then transform the data into the required models, and finally move the data into the DWH. We can also extend this scenario by integrating semi-, un-structured and sensor-based data sources. In this sense, the Hadoop environment acts as an Enterprise Data Lake.
  • Another option is moving data from the DWH into Hadoop using Sqoop in order to do pre-calculations, and then the result is stored in data marts to be visualized using traditional tools. Because the storage cost on Hadoop is much lower than on a DWH, we can save money and keep the data for longer time. We can also extend this use case by taking advantage of analytical power and creating predictive models using Spark MLLib, Mahout or R language or cognitive computing using IBM Watson in order support future decisions of the business.

 

We can visualize this scenario as shown in the figure 03.

 


 

Figure 03

 

Use Case 02. Near real-time analytics on data warehouse

 

In this scenario, we have Fume agents installed on every data source for ingesting data into the pipeline. We can also use Kafka as a streaming data source (front-end for real-time analytics) to store the incoming data in the form of events/messages. As part of the ingestion process, the data is stored directly on the Hadoop file system, or some scalable, fault torelant and distributed database such as HBase or Cassandra. Then the data is computed and some predictive models are created using Spark, Scala and MLLib technologies. The result is stored in ElasticSearch for improving the searching capabilities of the platform, predictive models can be stored in Hadoop file system and the result of calculations can be stored in Cassandra. The data can be consumed by traditional tools as well as by Web and mobile applications via API Rest.

 

This architectural design has the following benefits:

  • Add near real-time processing capabilities over batch-oriented processing
  • Lower the latency for getting actionable insights, impacting positively on the agility of the business on making decisions
  • Lower the storage cost significatively comparing to traditional data warehousing technologies because we can create commodity and low-cost cluster of data and processing nodes on-premise and in the cloud
  • This architecture is prepared to be migrated to the cloud and take advantage of elasticity, so adding computing resources when the workload is increasing and relieving computing resources when they’re not needed

 

We can visualize this scenario as shown in the figure 04.

 


 

 

Figure 03

 

Conclusion

 

In this post, I've talked about the evolution of data pipeline architecture towards modern ones today. Using the architectural patterns and strategies explained before, you can adapt your data pipeline architectures to be more scalable, resilient and help make better decisions in today changing world.