Period 0 - Codd's Relational Model (1970s-1980s)
In the 1960s and 1970s, as industries like banking, airlines, and retail became increasingly computerized, they faced a problem: the need for systems that could process a high volume of concurrent transactions quickly and correctly.
The dominant database models of the time was hierarchical and network databases. They were ill-equipped to handle the desired flexibility.
These systems were fundamentally navigational. People would write complex application code that traversed physical data pointers and understood the on-disk layout of information. This created a tight coupling between applications and storage layers, making systems brittle and difficult to scale.
Retrieving data, even for simple queries, required specialized technical mastery and significant programming effort.
The theoretical breakthrough that solved this impasse came in 1970 from Edgar F. "Ted" Codd, a mathematician at IBM's San Jose research laboratory. In his seminal paper, A Relational Model of Data for Large Shared Data Banks [1], Codd proposed a radical new approach. The core motivation was to achieve data independence: meaning a sharp, clean separation between the logical representation of data (how users and applications view it) and its physical storage and organization.
POC. IBM System R (1974–1979) proves performance is achievable at scale:
- SQL as the declarative interface.
- Cost‑based optimizer selects access paths using stats.
- Compilation/caching amortizes parse/plan overhead.
Mechanism. IBM's System R project demonstrated that a relational system could achieve high performance for production use.
TLDR. the goal is data independence (logical ≠ physical). Codd's central idea was to represent all data in simple, two-dimensional tables, which he called "relations," and to define relationships between these tables based on common data values, not on machine-level pointers stored within the records.
Edgar "Ted" Codd - source: https://www.ibm.com/history/relational-database
Learn more.
- [1] A relational model of data for large shared data banks
- [2] Relational Database: A Practical Foundation for Productivity
Interlude - let's take a quick stop to take a look at SQL
Why it matters. SQL is the relational model’s executable interface. Its history explains why declarative query + optimizer became the universal contract across engines.
Origins (1973–1979). At IBM, Chamberlin & Boyce design SEQUEL for System R; renamed SQL. System R proves the loop: declarative query → cost‑based plan → compiled execution. SQL covers DDL, DML, and DCL (privileges).
Early products (1979–1989). Oracle ships a commercial SQL DB (1979). IBM follows with SQL/DS (1981) and DB2 (1983). INGRES (QUEL) and Postgres initially use non‑SQL languages but later adopt SQL; Sybase/Microsoft release SQL Server (late 1980s); MySQL appears (1995); PostgreSQL adopts SQL (mid‑1990s).
Why SQL won.
- Declarative intent + optimizer decouples what from how.
- Sound transformations: Backed by relational algebra/calculus.
- Ecosystem lock‑in to interoperability: vendors, tools, analysts, and training standardized on SQL; dialects differ, but the core travels.
SQL won as opposed to what. SQL's primary rival was QUEL. developed for the Ingres database at UC Berkeley. Many academic experts and database pioneers considered QUEL a superior language. However, SQL won due to commercial and strategic factors, not just technical merit.
Contrast with Pre-Relational DMLs. SQL's DDL/DML structure also represented a major leap from the languages of earlier network databases like CODASYL. The terms DDL and DML were first introduced with the CODASYL model, but its DML was procedural and navigational. it consisted of verbs embedded in a host programming language (like COBOL) that required programmers to manually traverse record-by-record pointer chains. SQL's declarative, set-based DML abstracted this complexity away entirely.
Read more.
- SQL vs QUEL
- What goes around comes around. Michael Stonebraker, Joseph M. Hellerstein.
Practical notes. Read EXPLAIN plans. READ the execution plans. REACH for dialect features (e.g., T‑SQL, PL/SQL, HQL) when justified; know your engine’s deviations (nulls, time semantics, ANSI vs legacy joins).
Period 1 - OLTP (1980s) - Running the Business
Problem. Banks/retail/airlines need fast, correct transactions. A bank’s ATM withdrawal is the canonical OLTP operation.
Mechanism. ACID RDBMS with concurrency control (locks/MVCC), write‑ahead logging, recovery. Short indexed reads/writes; normalized schemas.
Why it worked. The ACID guarantees (Atomicity, Consistency, Isolation, Durability) mean you can trust that balances are correct and orders are not lost.
Limit. These systems are optimized for transactions. They are not built for multi‑terabytes scans or complex historical analytics.
TLDR. OLTP is about NOW. Small, precise updates that must never be wrong.
Period 2 - the Data Warehouse (1990s) - Analytics
Problem. Executives want integrated, historical analytics without taxing OLTP (running these large queries on OLTP systems would grind them to a halt).
Solution. A separate Data Warehouse for Online Analytical Processing (OLAP) (Inmon’s EDW; Kimball’s dimensional models).
Properties. Still ACID; data arrives via ETL into star/snowflake schemas on MPP systems; It provides a curated single source of truth for Business Intelligence (BI).
Trade‑offs. Strong governance and query performance but rigid onboarding for new data sources; weak fit for un/semi‑structured data and fast schema drift.
TLDR. Two systems: ACID OLTP to run the business, ACID OLAP to understand it.
Period 3 - the Data Lake (2000s) - Volume, variety, velocity
Problem. The rise of "Big Data" from web companies (Google, Yahoo, Facebook) produced massive, messy, and rapidly evolving datasets (logs, clickstreams, social media). Forcing this data through the rigid modeling of a Data Warehouse was too slow and costly.
Solution. The Data Lake on HDFS (later cloud object storage, like S3). Land raw files cheaply. Query with schema‑on‑read engines: Hive, Impala; later Presto/Trino and Spark SQL.
Trade‑offs. Storage/compute decoupled and cheap, but reliability/governance suffered.
TLDR.
note
Schema‑on‑write (DataWarehouse): validate/conform before load → strong guarantees, slower ingestion.
Schema‑on‑read (DataLake): ingest first, interpret at query → flexible, but fragile.
HDFS/object stores have no native transactions. If a write fails mid‑way, partial files remain; readers may see garbage.
For example: If a Spark job writing 100 Parquet files fails after writing file #57, those 57 partial files remain. A reader querying that directory has no way to know the job failed.
Hive later added transactions (with constraints and costs), but the storage layer remained non‑transactional.
Period 3.Bis - the Swamp Problem (2010s) -Unreliable lakes
Symptoms. Partial and orphan files, duplicated truths, missing schema, PII drift, small‑file explosions, brittle partitions, unknown lineage.
Root causes. No ACID at storage; concurrent writers; failed batch commits; hard deletes/updates; weak metadata discipline.
TLDR. A lake without transactions and metadata discipline naturally becomes a swamp.
Period 4 - the DataLakehouse (2020s) - Governance
Core idea. Keep cheap lake storage; add a transaction log + table metadata so readers see a consistent snapshot of which files comprise a table version.
Implementations. Delta Lake, Apache Iceberg, Apache Hudi (different logs/manifests; but same goal).
Implementations. Open table formats like Delta Lake, Apache Iceberg, and Apache Hudi achieve this goal using different architectural approaches.
- Delta Lake: Uses an ordered, append-only transaction log (_delta_log) as the single source of truth. It is deeply integrated with the Apache Spark ecosystem.
- Apache Iceberg: Employs a hierarchical, snapshot-based metadata tree (metadata files point to manifest lists, which point to manifest files). It is designed to be engine-agnostic and highly scalable for massive tables.
- Apache Hudi: Built around a timeline that tracks all actions. It is differentiated by its support for both Copy-on-Write (read-optimized) and Merge-on-Read (write-optimized) storage, making it ideal for streaming and CDC use cases.
TLDR. Reliability of a warehouse, flexibility/cost of a lake → the Lakehouse.
Summary
| Period | Primary Pressure | Dominant Solution | What it fixed | What it broke / exposed |
| 0. 1970s | Decouple apps from physical storage; simplify data access | Relational model + SQL + cost‑based optimization | Data independence; declarative queries; optimizer picks access paths | Needed years of engineering to match navigational DB performance |
| 1. 1980s | Correct, concurrent transactions | OLTP on ACID RDBMS | Reliable ATM‑class updates; isolation/recovery | Bad at historical analytics |
| 2. 1990s | Enterprise analytics without harming OLTP | Data Warehouse (OLAP) on MPP | Curated, ACID analytical store; consistent BI | Rigid schema‑on‑write; high cost; slow on-boarding |
| 3. 2000s | Big Data: volume/variety/velocity | Data Lake (HDFS → object storage) + schema‑on‑read | Cheap, elastic storage; flexible ingest; open formats | No storage‑level ACID → partial/orphan files; governance gaps |
| 3.b 2010s | Unreliable lakes; governance debt | _ | _ | Swamps: small files, duplicate truths, unknown lineage |
| 4. late‑2010s to 2020s | Reliability/governance on the lake | Lakehouse (Delta/Iceberg/Hudi) | ACID over files; snapshots/time travel; stats & layout | Requires compaction/clustering discipline; migration work |
| Org | Centralized teams bottleneck | Data Mesh (org pattern) | Domain‑owned data products; self‑serve platform; federated governance | Socio‑technical change; standards & platform maturity |
Sources & References
[1] E. F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, 1970. https://dl.acm.org/doi/10.1145/362384.362685
[2] M. Astrahan et al., “A History and Evaluation of System R,” Communications of the ACM / historical reports. https://dsf.berkeley.edu/cs262/SystemR-annotated.pdf
[3] P. Selinger et al., “Access Path Selection in a Relational DBMS,” SIGMOD, 1979. https://dl.acm.org/doi/10.1145/582095.582099
[4] J. Gray, A. Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann, 1992.
[5] C. Mohan et al., “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks,” ACM TODS, 1992. https://dl.acm.org/doi/10.1145/146935.146936
Credits
Photo by Giammarco Boscaro on Unsplash
Citation
Please cite this article as :
NNZ. (Sept 2025). From OLTP to Lakehouse : An engineering problem-solution chronicle NonNeutralZero.https://www.nonneutralzero.com/blog/engineering-7/from-oltp-to-lakehouse-an-engineering-problem-solution-chronicle-24/.
or
From OLTP to Lakehouse : An engineering problem-solution chronicle