ASML — Migrating big-data pipelines from on-premise to Azure Databricks

Background

ASML builds the photolithography machines that the world’s leading chipmakers depend on. Each machine and manufacturing step generates enormous volumes of engineering and sensor data that engineers use to monitor quality and performance. We joined a platform team responsible for the data pipelines behind an internal engineering dashboard, with the goal of moving that workload off a constrained legacy on-premise platform and onto a modern, scalable cloud lakehouse — without disrupting the engineers who rely on it daily.

The challenge

A legacy on-premise big-data platform constrained scalability and slowed pipeline development as data volumes grew.
High-volume machine and engineering data demanded elastic compute and incremental processing rather than expensive full rebuilds.
Business logic was duplicated across applications, with limited automated testing and inconsistent governance.
The migration had to preserve correctness and continuity for an engineering dashboard already in active use.

Our approach

Designed and migrated the pipelines to a medallion (bronze / silver / gold) lakehouse on Azure Databricks with Unity Catalog governance.
Built incremental ingestion with Structured Streaming and Change Data Feed, merging report and event-log sources into master Delta tables.
Applied a clean architecture — pure, fully-tested business-logic modules wired to readers and writers through a data-getter abstraction so pipelines never hardcode sources or sinks.
Extracted reusable infrastructure (readers, writers, Spark helpers, logging) into a versioned internal Spark library, published as a wheel and shared across the team’s applications.
Hardened delivery with Azure DevOps CI/CD, Databricks Asset Bundles for per-environment jobs, and strict quality gates — linting, type-checking, and 90%+ test coverage on business logic.

My role on the project

Designing and building the cleaning and event-log processing pipelines end to end.
Implementing the streaming / Change Data Feed logic that incrementally builds master tables.
Generalising reusable components into the shared Spark library and versioning its releases.
Setting up CI/CD, environment promotion (dev / acc / prod), and the automated quality gates.

Architecture

Source systems

Machine & engineering data

Bronze

Structured Streaming
Change Data Feed

Silver

Business-logic transforms
Tested modules

Gold

Master Delta tables
Merge / CDC

Dashboard

Engineering analytics

On-premise pipelines migrated to a CDC-driven medallion lakehouse on Azure Databricks (Unity Catalog governance).

Outcomes

A scalable, governed cloud lakehouse replacing the constrained legacy platform.
Incremental, CDC-based master tables instead of costly full rebuilds.
A reusable Spark library that standardizes pipelines across multiple team applications.
Multi-environment, automated deployments with high test and code-quality standards.