← All Projects
Data Engineering Planned

Real-Time Streaming Pipeline

Kafka + Flink pipeline processing live event data at scale

KafkaFlinkJavaDockerPostgreSQL

A streaming data pipeline that ingests, processes, and aggregates real-time event data. Uses Apache Kafka for message brokering and Apache Flink for stateful stream processing with exactly-once semantics.

What It Solves

Operational event streams need to be processed while the data is still useful. This project is designed around the problems that show up in real systems: late-arriving events, stateful aggregations, replay, and failure recovery.

What I Built

The pipeline ingests events through Kafka, processes them with Flink, and writes aggregates to serving and analytical stores. It includes windowed aggregations, watermarking, and recovery scenarios so the implementation can be tested beyond the happy path.

How It Works

  1. Kafka topics receive live event data from producers.
  2. Flink jobs apply validation, enrichment, windowing, and stateful aggregations.
  3. PostgreSQL stores serving-layer aggregates for dashboard or API access.
  4. A batch-friendly sink preserves processed data for historical analysis.

Skills Demonstrated

Data Engineering

  • Streaming pipeline design
  • Event-time processing
  • Windowed aggregations
  • Data validation

Infrastructure

  • Dockerized local services
  • Failure simulation
  • Recovery testing

Backend

  • Java stream processing
  • PostgreSQL serving tables
  • Operational workflow design

What I Learned

  • How stream processors handle late data with watermarks.
  • How to design a pipeline that can recover from service interruptions.
  • How to separate real-time serving needs from historical analysis needs.

Next Improvements

  • Add a lightweight dashboard on top of the serving aggregates.
  • Add automated tests for replay and duplicate-event scenarios.