Skip to Content

From OLTP to Lakehouse : An engineering problem-solution chronicle

A concise storyline on how modern data systems evolved, why, and with what trade‑offs.
September 18, 2025 by
NNZ


Period 0 - Codd's Relational Model (1970s-1980s)

In the 1960s and 1970s, as industries like banking, airlines, and retail became increasingly computerized, they faced a problem: the need for systems that could process a high volume of concurrent transactions quickly and correctly. 

The dominant database models of the time was hierarchical and network databases. They were ill-equipped to handle the desired flexibility. 

These systems were fundamentally navigational. People would write complex application code that traversed physical data pointers and understood the on-disk layout of information. This created a tight coupling between applications and storage layers, making systems brittle and difficult to scale. 

Retrieving data, even for simple queries, required specialized technical mastery and significant programming effort.

The theoretical breakthrough that solved this impasse came in 1970 from Edgar F. "Ted" Codd, a mathematician at IBM's San Jose research laboratory. In his seminal paper, A Relational Model of Data for Large Shared Data Banks [1], Codd proposed a radical new approach. The core motivation was to achieve data independence: meaning a sharp, clean separation between the logical representation of data (how users and applications view it) and its physical storage and organization.


POC. IBM System R (1974–1979) proves performance is achievable at scale:


  • SQL as the declarative interface.
  • Cost‑based optimizer selects access paths using stats.
  • Compilation/caching amortizes parse/plan overhead.


Mechanism. IBM's System R project demonstrated that a relational system could achieve high performance for production use.


TLDR. the goal is data independence (logical ≠ physical)Codd's central idea was to represent all data in simple, two-dimensional tables, which he called "relations," and to define relationships between these tables based on common data values, not on machine-level pointers stored within the records.


Edgar "Ted" CoddEdgar "Ted" Codd -  source: https://www.ibm.com/history/relational-database


Learn more.

Interlude - let's take a quick stop to take a look at SQL

Why it matters. SQL is the relational model’s executable interface. Its history explains why declarative query + optimizer became the universal contract across engines.

Origins (1973–1979). At IBM, Chamberlin & Boyce design SEQUEL for System R; renamed SQL. System R proves the loop: declarative query → cost‑based plan → compiled execution. SQL covers DDL, DML, and DCL (privileges).

Early products (1979–1989). Oracle ships a commercial SQL DB (1979). IBM follows with SQL/DS (1981) and DB2 (1983). INGRES (QUEL) and Postgres initially use non‑SQL languages but later adopt SQL; Sybase/Microsoft release SQL Server (late 1980s); MySQL appears (1995); PostgreSQL  adopts SQL (mid‑1990s).

Why SQL won.

  • Declarative intent + optimizer decouples what from how.
  • Sound transformations: Backed by relational algebra/calculus.
  • Ecosystem lock‑in to interoperability: vendors, tools, analysts, and training standardized on SQL; dialects differ, but the core travels.

SQL won as opposed to what. SQL's primary rival was QUEL. developed for the Ingres database at UC Berkeley. Many academic experts and database pioneers considered QUEL a superior language. However, SQL won due to commercial and strategic factors, not just technical merit.

Contrast with Pre-Relational DMLs. SQL's DDL/DML structure also represented a major leap from the languages of earlier network databases like CODASYL. The terms DDL and DML were first introduced with the CODASYL model, but its DML was procedural and navigational. it consisted of verbs embedded in a host programming language (like COBOL) that required programmers to manually traverse record-by-record pointer chains. SQL's declarative, set-based DML abstracted this complexity away entirely.

Read more. 

  • SQL vs QUEL
  • What goes around comes around. Michael Stonebraker, Joseph M. Hellerstein.

Practical notes. Read EXPLAIN plans. READ the execution plans. REACH for dialect features (e.g., T‑SQL, PL/SQL, HQL) when justified; know your engine’s deviations (nulls, time semantics, ANSI vs legacy joins).


Period 1 - OLTP (1980s) - Running the Business

Problem. Banks/retail/airlines need fast, correct transactions. A bank’s ATM withdrawal is the canonical OLTP operation.

Mechanism. ACID RDBMS with concurrency control (locks/MVCC), write‑ahead logging, recovery. Short indexed reads/writes; normalized schemas.

Why it worked. The ACID guarantees (Atomicity, Consistency, Isolation, Durability) mean you can trust that balances are correct and orders are not lost.

Limit. These systems are optimized for transactions. They are not built for multi‑terabytes scans or complex historical analytics.

TLDR. OLTP is about NOW. Small, precise updates that must never be wrong.


Period 2 - the Data Warehouse (1990s) - Analytics

Problem. Executives want integrated, historical analytics without taxing OLTP (running these large queries on OLTP systems would grind them to a halt).

Solution. A separate Data Warehouse for Online Analytical Processing (OLAP) (Inmon’s EDW; Kimball’s dimensional models).

Properties. Still ACID; data arrives via ETL into star/snowflake schemas on MPP systems; It provides a curated single source of truth for Business Intelligence (BI).

Trade‑offs. Strong governance and query performance but rigid onboarding for new data sources; weak fit for un/semi‑structured data and fast schema drift.

TLDR. Two systems: ACID OLTP to run the business, ACID OLAP to understand it.


Period 3 - the Data Lake (2000s) - Volume, variety, velocity

Problem. The rise of "Big Data" from web companies (Google, Yahoo, Facebook) produced massive, messy, and rapidly evolving datasets (logs, clickstreams, social media). Forcing this data through the rigid modeling of a Data Warehouse was too slow and costly.

Solution. The Data Lake on HDFS (later cloud object storage, like S3). Land raw files cheaply. Query with schema‑on‑read engines: Hive, Impala; later Presto/Trino and Spark SQL.

Trade‑offs. Storage/compute decoupled and cheap, but reliability/governance suffered.

TLDR.

  note

Schema‑on‑write (DataWarehouse): validate/conform before load → strong guarantees, slower ingestion.


Schema‑on‑read (DataLake): ingest first, interpret at query → flexible, but fragile.

HDFS/object stores have no native transactions. If a write fails mid‑way, partial files remain; readers may see garbage. 

For example: If a Spark job writing 100 Parquet files fails after writing file #57, those 57 partial files remain. A reader querying that directory has no way to know the job failed.


Hive later added transactions (with constraints and costs), but the storage layer remained non‑transactional.


Period 3.Bis - the Swamp Problem (2010s) -Unreliable lakes

Symptoms. Partial and orphan files, duplicated truths, missing schema, PII drift, small‑file explosions, brittle partitions, unknown lineage.

Root causes. No ACID at storage; concurrent writers; failed batch commits; hard deletes/updates; weak metadata discipline.

TLDR. A lake without transactions and metadata discipline naturally becomes a swamp.


Period 4 - the DataLakehouse (2020s) - Governance

Core idea. Keep cheap lake storage; add a transaction log + table metadata so readers see a consistent snapshot of which files comprise a table version.

Implementations. Delta Lake, Apache Iceberg, Apache Hudi (different logs/manifests; but same goal).

Implementations. Open table formats like Delta Lake, Apache Iceberg, and Apache Hudi achieve this goal using different architectural approaches.

  • Delta Lake: Uses an ordered, append-only transaction log (_delta_log) as the single source of truth. It is deeply integrated with the Apache Spark ecosystem.  
  • Apache Iceberg: Employs a hierarchical, snapshot-based metadata tree (metadata files point to manifest lists, which point to manifest files). It is designed to be engine-agnostic and highly scalable for massive tables.  
  • Apache Hudi: Built around a timeline that tracks all actions. It is differentiated by its support for both Copy-on-Write (read-optimized) and Merge-on-Read (write-optimized) storage, making it ideal for streaming and CDC use cases.  


TLDR. Reliability of a warehouse, flexibility/cost of a lake → the Lakehouse.


Summary

PeriodPrimary PressureDominant SolutionWhat it fixedWhat it broke / exposed
0. 1970sDecouple apps from physical storage; simplify data accessRelational model + SQL + cost‑based optimizationData independence; declarative queries; optimizer picks access pathsNeeded years of engineering to match navigational DB performance
1. 1980sCorrect, concurrent transactionsOLTP on ACID RDBMSReliable ATM‑class updates; isolation/recoveryBad at historical analytics
2. 1990sEnterprise analytics without harming OLTPData Warehouse (OLAP) on MPPCurated, ACID analytical store; consistent BIRigid schema‑on‑write; high cost; slow on-boarding
3. 2000sBig Data: volume/variety/velocityData Lake (HDFS → object storage) + schema‑on‑readCheap, elastic storage; flexible ingest; open formatsNo storage‑level ACID → partial/orphan files; governance gaps
3.b 2010sUnreliable lakes; governance debt__Swamps: small files, duplicate truths, unknown lineage
4. late‑2010s to 2020sReliability/governance on the lakeLakehouse (Delta/Iceberg/Hudi)ACID over files; snapshots/time travel; stats & layoutRequires compaction/clustering discipline; migration work
OrgCentralized teams bottleneckData Mesh (org pattern)Domain‑owned data products; self‑serve platform; federated governanceSocio‑technical change; standards & platform maturity

Sources & References

[1] E. F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, 1970. https://dl.acm.org/doi/10.1145/362384.362685

[2] M. Astrahan et al., “A History and Evaluation of System R,” Communications of the ACM / historical reports. https://dsf.berkeley.edu/cs262/SystemR-annotated.pdf

[3] P. Selinger et al., “Access Path Selection in a Relational DBMS,” SIGMOD, 1979. https://dl.acm.org/doi/10.1145/582095.582099

[4] J. Gray, A. Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann, 1992.

[5] C. Mohan et al., “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks,” ACM TODS, 1992. https://dl.acm.org/doi/10.1145/146935.146936


Credits

Photo by Giammarco Boscaro on Unsplash

Citation

Please cite this article as :

NNZ. (Sept 2025). From OLTP to Lakehouse : An engineering problem-solution chronicle NonNeutralZero.https://www.nonneutralzero.com/blog/engineering-7/from-oltp-to-lakehouse-an-engineering-problem-solution-chronicle-24/.

or

@article{nnz2025oltptolakehouse, 
 title   = "From OLTP to Lakehouse : An engineering problem-solution chronicle",
 author  = "NNZ",
 journal = "nonneutralzero.com",
 year    = "2025",
 month   = "sept",
 url     = "httpshttps://www.nonneutralzero.com/blog/engineering-7/from-oltp-to-lakehouse-an-engineering-problem-solution-chronicle-24"
}


Share this post
Tags
Archive
Sign in to leave a comment