Data platform modernization

Published in

LUMIQ

7 min readJun 23, 2021

A real need or just another sales pitch?

The first wave of analytics started with managers needing information about how the business was operating. The analytical output produced was reports, and later, dashboards. Often, these analytical systems were built directly on top of the transactional databases or file systems.

As the adoption of these reports increased and other analytical needs arose (including dashboards and slicing/dicing capabilities), segregation of the analytical needs was the logical next step.

This is how the first warehouse of the world was born.

The warehouses were populated with structured data from the underlying operational systems. After that, all queries for analysis (most often SQL queries running on top of RDBMS) were performed within the warehouses.

As analytical needs evolved from knowing what happened in the past to understanding what will happen in the future, a whole new era of analytics emerged that focused on discovering the decisions that would create favorable business outcomes.

However, as with most things in life, before understanding the “modernization” (or the future), it is crucial to understand what happened in the past and how we reached where we are today.

Here is the bird’s eye view of the evolution of analytical platforms that we’ll be diving into deeper later:

Rewinding to the 1980s

On the whole, data that the enterprises generated could be grouped as -

Transactional Data — These represented the daily operations of an organization (i.e. described business events and LOB applications. Think of policy admin in insurance, core banking system in banking, and so on).
Analytical data — These data supported BI, dashboarding, query, analysis, and essentially decision-making (i.e. described business performance — dashboards and BI tools, data warehouses including the data marts, ODS, statistical analysis, and so on).
Master data — These represented the key business entities upon which transactions were executed and the dimensions around which analysis was performed (i.e. described key business entities).

ETL (Extract, Transform, and Load) was the primary process to move the master data and structured transactions into the analytical stores. It was a process that extracted data from different sources, transformed it (applying calculations for example), and finally loaded it into the data warehouse architecture.

External, surrogate, or partner data was also routed similarly. Data marts were created to drive the BI, reporting, and dashboards.

Fast forward to the 2010s

One of the fundamental limitations of the data warehouses was that the semi-structured and unstructured data could not fit into it.

The response was the Data Lake architecture which promised elastic storage and elastic compute — the flexibility to decouple compute and storage. The ecosystem of Hadoop and its variants emerged and primarily solved the problems of data storage.

The creation of the data lake was accompanied by the emergence of three significant trends at the beginning of the 2010s:

Big Data

A tsunami of structured, semi-structured, and unstructured data (best described as Doug Laney’s classic three V’s) became mainstream. The data warehouses of the 80s were not prepared to store, process, and analyze this data.

The birth of the new ‘Cloud’ paradigm

This was made possible due to the emergence of AWS, Azure, and GCP to name a few which could easily compute, store and manage services on the cloud.

The arrival of data science

With the democratization of AI/ML technologies, the cost to implement them was dramatically reduced. More and more businesses wanted to leverage the newly available data and move into predictive and prescriptive analytics. The use cases of AI/ML started becoming mainstream.

The data lake architecture was an attempt at fixing the problems of the data warehouse and accommodating new technological advancements.

However, the operation of AI/ML initiatives in a data lake architecture was not streamlined. It lacked concurrency, security, and governance. Above all, it lacked the reliability and performance speed that was the hallmark of the data warehouse. Multiple versions of the same data existed across the ecosystem of the lake. Data scientists struggled with the tools and managing the life cycle of models required a “jugaad” or “hack”.

Fast forward to the 2020s

Data-driven initiatives consolidated into three broad themes — BI (business intelligence), AI (artificial intelligence), and DI (data interchange).

BI addressed the descriptive and diagnostic aspects. The combination of BI and DI resulted in a strong focus on understanding customers and anticipating their behavior which, in turn, led to a new era of cross-company collaborations. There was an explosion of innovative business models that relied heavily on frequent data exchange patterns (like e-commerce platforms selling insurance and other financial products).

Nowadays, cloud-native applications leveraging serverless computing have emerged as strong contenders for the internal and external facing data interchange/exchange applications. Following functional architecture represents how AI, BI, and DI use cases can be deployed seamlessly using the lake house architecture.

The need for modernization

Modernization is required to cater to the needs of the new analytical landscape that has emerged. The new age of analytics requires managing all kinds of data (structured, semi-structured, unstructured, or streaming) and aggregating it.

Working with modern-day data also includes an ability to handle multiple storage formats. These include open and standardized formats such as parquet, or any other columnar/row optimized storage, including RDBMS and NoSQL. The transaction support/ACID is available so that multiple processes can read and write.

Additionally, API capabilities are often required so that various tools and engines, including machine learning code in Python or R, can efficiently access the data directly. From a use case perspective, either one of AI, BI, or DI is required to be facilitated right from development in the lab to deployment for production.

A modern data warehouse is capable of working with the data needs of today. The warehouse is essentially a concept and can be best described as a “platform”- a platform that has a warehouse architecture as a part of it but has many other architectural components. This platform runs on modern cloud-native architecture, fulfills all the modern analytical needs while also servicing the traditional warehouse requirements. The platform allows solid data plumbing, which allows to bring in fast and slow, new and old data from virtually any data source while leveraging tech as varied as Change Data Capture (CDC), Apache Kafka, Apache Airflow, AWS Glue, or Informatica to name a few.

There also needs to be a visualization front-end like Tableau or other BI tools. Data marts; mechanisms for building and deploying data science models and their lifecycle maintenance; tools for data governance are also needed. Most importantly, modern data warehouses require people and integrations with business processes that consume the analytics and do the decision-making whether it is fully automated, manual, or assisted.

*Characteristics of a modern data platform*

All in all, the modern data platform becomes a catalyst and an accelerator for businesses who are looking to adopt predictive/ prescriptive analytics or have already enabled analytical workloads of AI, BI, or DI. The platform can help them with quicker go-to-market and value realization.

Like the data warehouses of the old world, these “data platforms” are already becoming an integral part of the data landscape, especially with companies who are innovating with data.

Lumiq’s approach to dealing with data

Lumiq has executed over 20 implementations of the functional architecture across several financial services companies. The starting points for the journeys have been AI or BI, and in some cases, the modernization, but the experiences and critical learnings may be summarised as follows:

Crawl-walk-run, and most importantly, “learn.”
Experience first. Build use case by use case but ensure that the same data is not loaded again and again.
Build like LEGO. There will always be new tools coming up and will be better than the previous ones.
Everything is a trade-off and hybrid approaches work best. It is not about cloud or on-premises. PII may be stored on-premises and the rest of the data may be on the cloud. It is not a managed database service on cloud vs a cloud warehouse vs a virtual machine hosting a database.

Similarly, in the case of polyglot persistence vs maintaining just a single copy of data, the architecture must facilitate as much flexibility and adaptability as possible.

Migrations will happen during which data may get organized and re-organised. Carry forward the learning and, most importantly, the metadata.
Embrace uncertainty since the requirements will keep on evolving.

Do the principles outlined above feel overwhelming to practice? We have worked with them as core stipulations on every client project successfully. If our approach intrigues you, we would love to discuss tailored options for your data transformation journey. Contact us here.

Data platform modernization

Rewinding to the 1980s

Fast forward to the 2010s

Fast forward to the 2020s

The need for modernization

Written by LUMIQ.AI