IoT – What is a Lambda Architecture?

by Sherwin Jaleel
986 views

Distributed Systems 

The value that IoT solutions deliver lies in the streaming data that these systems process. When you consider the importance of data created by IoT systems, you soon begin to understand the central importance of accessing this data when and where needed reliably. Typically, there are four components in an IoT solution: sensors/devices, connectivity, data processing, and a user interface. The four components are interconnected, and data must flow between them to derive value from data. This characteristic puts IoT systems into the category of a distributed system.

CAP Theorem

In theoretical computer science, a fundamental theorem associated with distributed systems is the CAP Theorem. Distributed systems must make trade-offs between the ‘C,’ ‘A’ and ‘P’ in CAP theorem. They respectively stand for the qualities of Consistency, Availability, and Partitions tolerance.

It is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Consistency, Availability and Partition tolerance.
CAP Theorem

Consistency: Consistency implies clients reading data must see the same data, no matter which distributed component the data is accessed from.

Availability: Availability assures the clients making requests for data that they will get a response, even if one or more components in the distributed system are down.

Partition tolerance: A partition is a communications break within a distributed system—a lost or temporarily delayed connection between two components.

Partition tolerance is non-negotiable for most IoT systems. Therefore, in theory, and according to the CAP Theorem, IoT systems can have only one of the other two qualities: Consistency or Availability. Most IoT solutions must deal with this conflict between consistency and availability. It is this conflict between Consistency or Availability that the Lambda architecture address.

Batch and Real-Time Stream Data Processing

The distinction between batch processing and real-time processing is one of the most fundamental principles in the IoT world.  In a batch processing model, a collection of data points from a data stream is grouped for a specific time interval and processed. Under the real-time model, data is processed in near real-time. Both models are valuable for IoT solutions. However, each addresses a different use case.

Consider a scenario when a disaster management operations centre must detect and deal with forest fires before they spread and become uncontrollable, causing chaos and destruction. An operator might want to be alerted whenever the carbon content in the atmosphere increases by 1% more than three times in under 10 seconds.  Such analytics needs a small window of streaming data (about 10 seconds). If this data were processed using a batch processing model, the operator would not get the critical alerts until after the batch processing is completed, which could take hours or sometimes even days. In the IoT world, real-time processing of data streams deals with such use cases.

On the other hand, consider a scenario where a data scientist wants to train a machine learning model to predict forest fires before they occur.  The data scientist will want to know carbon content reading leading up to all known previous incidents when an alarm was raised. A wide window of data collected over several months or years  must be trawled to obtain the required data. Batch processing will be more suitable for such a use case.

Data from IoT solutions cater for both use cases. IoT data required in near rea-time usually flows through what is referred to as the “hot-path” with an IoT system.  IoT data that is required at a slower rate usually flow through what is referred to as the “cold-path.”

Lambda Architecture

Lambda architecture is a workload pattern for handling batch and real-time streaming workloads within a single system. The main components of the lambda architecture are the Streaming Layer (which implements the “hot path”), Batch Layer (which implements the “cold path”), and a common Serving Layer that combines the outputs for both paths in order to present consistent data to anyone or any downstream system that wants this data.

The goal of the Lambda architecture is to present a holistic view of an organization’s data, both from history and in near real-time, within a combined Serving layer. Thereby delivering the quality of consistency of the CAP Theorem. 

Availability is ensured by deploying the IoT solution into a resilient ecosystem, such as the cloud (think Azure, AWS etc).

The image above is an overview of the Lambda architecture. There are three layers in a Lambda Architecture – Streaming, Batch and Serving Layers. In a Lambda architecture, all new data is continuously fed to the batch and speed layers simultaneously.

Data fed into the Batch layer goes through an ETL process and is ultimately persisted in a physical data store. The Batch layer typically processes data at predefined schedules and at regular intervals. It is responsible for managing the master dataset and creating pre-compute views into the data. The Streaming layer manages the most recently added data, which will not yet be available at the Batch layer. Since there is a lag for incoming data to be available at the Batch layer, the Speed layer exists to narrow the latency gap. The Speed layer complements the Batch layer by making available all the new data in near real-time. This layer manages the most recently added data, which will not yet be available at the batch layer. The Batch and Speed later (i.e. the hot and cold paths) converge at the Serving Layer. The Batch layer feeds into the Serving layer, which indexes the batch view for optimized and efficient querying. The Speed layer also updates the serving layer with incremental updates based on the most recent data. The Serving Layer presents a holistic and consistent view of the data that cater to user queries.

As an example, a baseline of Azure based services that can be used to build a lambda architecture is shows in the diagram below. Similar baseline architectures can be constructed for an AWS cloud services ecosystem.

Azure IoT Hub caters to the demands of the Stream layer. It provides bi-directional communication between IoT devices and a scalable message queue that can handle millions of simultaneously connected devices. Azure Stream Analytics, another component shown the Stream Layer. It can run continual queries against an unbounded stream of data in near real-time. It can aggregate the data based on temporal windows and persist results to sinks such as storage, databases, or directly to reports in Power BI. Azure Time Series is another Azure service that can operate within the Stream Layer of a Lambda architecture. It can collect, process and store time-series data. Azure Time Series is designed for ad hoc data exploration and operational analysis and can uncover hidden trends, spotting anomalies in near real-time.

Azure Event Hubs can ingest millions of event messages per second and fits in both the Stream and Bath layers. Azure Data Factory is Azure’s cloud ETL service suitable for data integration and data transformation – both functions of the Batch layer in a Lambda architecture. Azure Gen2 Data Lake can store and service multiple petabytes of information while sustaining hundreds of gigabits of throughput. It allows the efficient manage massive amounts of data. Within the Batch layer, Azure Gen2 Data Lake stores data continuously fed to the Batch and Stream layers simultaneously.

Azure Synapse is a distributed query system and enterprise analytics service that accelerates time to insight across data warehouses and big data systems. A suitable service in the Serving layer for its capability to query and serve data from several sources. Azure Stream Analytics can output to a dedicated SQL pool table in Azure Synapse Analytics and can process throughput rates up to 200MB/sec. It can support real-time analytics and hot-path data processing needs for workloads such as reporting and dashboarding. Data scientists and engineers use Azure ML to train and deploy models. Data for these models come from the Batch layer (training data) and Stream Layer (inference data). Power BI turns unrelated sources of data into coherent, visually immersive, and interactive insights, providing consistent data views from both the Batch and Stream layers.

Limitations of Lambda Architecture

A drawback to the lambda architecture is its complexity. Synchronization between the three layers in a Lambda architecture is complex. Running both batch and speed layers synchronously requires more computational power and time. Moreover, the processing logic appears in two different places (the cold and hot paths), leading to duplication of computational logic.