The challenge
- A legacy on-premise big-data platform constrained scalability and slowed pipeline development.
- High-volume machine and engineering data demanded elastic compute and incremental processing.
- Logic was duplicated across applications, with limited testing and governance.
Our approach
- Migrated the pipelines to a medallion (bronze / silver / gold) lakehouse on Azure Databricks with Unity Catalog governance.
- Built incremental ingestion with Structured Streaming and Change Data Feed, merging sources into master Delta tables.
- Applied a clean architecture — pure, fully-tested business-logic modules wired to readers and writers through a data-getter abstraction.
- Extracted reusable infrastructure into a versioned internal Spark library, published and shared across the team’s applications.
- Hardened delivery with Azure DevOps CI/CD, Databricks Asset Bundles for per-environment jobs, and strict quality gates (linting, typing, 90%+ test coverage).
Architecture
Source systems
- Machine & engineering data
Bronze
- Structured Streaming
- Change Data Feed
Silver
- Business-logic transforms
- Tested modules
Gold
- Master Delta tables
- Merge / CDC
Dashboard
- Engineering analytics
Outcomes
- A scalable, governed cloud lakehouse replacing the constrained legacy platform.
- Incremental, CDC-based master tables instead of costly full rebuilds.
- A reusable Spark library that standardizes pipelines across multiple applications.
- Multi-environment, automated deployments with high test and code-quality standards.
