DataAxis
ASML
Data Engineering · Data Engineer

Migrating big-data pipelines from on-premise to Azure Databricks

Re-platformed high-volume engineering data pipelines from a legacy on-premise system onto Azure Databricks, using a CDC-driven medallion lakehouse and a shared, versioned Spark library to power an engineering dashboard.

Semiconductors & High-Tech

The challenge

  • A legacy on-premise big-data platform constrained scalability and slowed pipeline development.
  • High-volume machine and engineering data demanded elastic compute and incremental processing.
  • Logic was duplicated across applications, with limited testing and governance.

Our approach

  • Migrated the pipelines to a medallion (bronze / silver / gold) lakehouse on Azure Databricks with Unity Catalog governance.
  • Built incremental ingestion with Structured Streaming and Change Data Feed, merging sources into master Delta tables.
  • Applied a clean architecture — pure, fully-tested business-logic modules wired to readers and writers through a data-getter abstraction.
  • Extracted reusable infrastructure into a versioned internal Spark library, published and shared across the team’s applications.
  • Hardened delivery with Azure DevOps CI/CD, Databricks Asset Bundles for per-environment jobs, and strict quality gates (linting, typing, 90%+ test coverage).

Architecture

Source systems

  • Machine & engineering data

Bronze

  • Structured Streaming
  • Change Data Feed

Silver

  • Business-logic transforms
  • Tested modules

Gold

  • Master Delta tables
  • Merge / CDC

Dashboard

  • Engineering analytics
On-premise pipelines migrated to a CDC-driven medallion lakehouse on Azure Databricks (Unity Catalog governance).

Outcomes

  • A scalable, governed cloud lakehouse replacing the constrained legacy platform.
  • Incremental, CDC-based master tables instead of costly full rebuilds.
  • A reusable Spark library that standardizes pipelines across multiple applications.
  • Multi-environment, automated deployments with high test and code-quality standards.

Have a data, ML, or AI challenge?

Book a 30-minute call. We'll tell you straight whether and how we can help.

Book a meeting