Data Engineering Planned

Real-Time Streaming Pipeline

Kafka + Flink pipeline processing live event data at scale

KafkaFlinkJavaDockerPostgreSQL

A streaming data pipeline that ingests, processes, and aggregates real-time event data. Uses Apache Kafka for message brokering and Apache Flink for stateful stream processing with exactly-once semantics.

Problem

What It Solves

Operational event streams need to be processed while the data is still useful. This project is designed around the problems that show up in real systems: late-arriving events, stateful aggregations, replay, and failure recovery.

Solution

What I Built

The pipeline ingests events through Kafka, processes them with Flink, and writes aggregates to serving and analytical stores. It includes windowed aggregations, watermarking, and recovery scenarios so the implementation can be tested beyond the happy path.

Architecture

How It Works

Kafka topics receive live event data from producers.
Flink jobs apply validation, enrichment, windowing, and stateful aggregations.
PostgreSQL stores serving-layer aggregates for dashboard or API access.
A batch-friendly sink preserves processed data for historical analysis.

Skills

Skills Demonstrated

Data Engineering

Streaming pipeline design
Event-time processing
Windowed aggregations
Data validation

Infrastructure

Dockerized local services
Failure simulation
Recovery testing

Backend

Java stream processing
PostgreSQL serving tables
Operational workflow design

Experience

What I Learned

How stream processors handle late data with watermarks.
How to design a pipeline that can recover from service interruptions.
How to separate real-time serving needs from historical analysis needs.

Roadmap

Next Improvements

Add a lightweight dashboard on top of the serving aggregates.
Add automated tests for replay and duplicate-event scenarios.