Real-Time Streaming Pipeline
Kafka + Flink pipeline processing live event data at scale
A streaming data pipeline that ingests, processes, and aggregates real-time event data. Uses Apache Kafka for message brokering and Apache Flink for stateful stream processing with exactly-once semantics.
Problem
What It Solves
Operational event streams need to be processed while the data is still useful. This project is designed around the problems that show up in real systems: late-arriving events, stateful aggregations, replay, and failure recovery.
Solution
What I Built
The pipeline ingests events through Kafka, processes them with Flink, and writes aggregates to serving and analytical stores. It includes windowed aggregations, watermarking, and recovery scenarios so the implementation can be tested beyond the happy path.
Architecture
How It Works
- Kafka topics receive live event data from producers.
- Flink jobs apply validation, enrichment, windowing, and stateful aggregations.
- PostgreSQL stores serving-layer aggregates for dashboard or API access.
- A batch-friendly sink preserves processed data for historical analysis.
Skills
Skills Demonstrated
Data Engineering
- Streaming pipeline design
- Event-time processing
- Windowed aggregations
- Data validation
Infrastructure
- Dockerized local services
- Failure simulation
- Recovery testing
Backend
- Java stream processing
- PostgreSQL serving tables
- Operational workflow design
Experience
What I Learned
- How stream processors handle late data with watermarks.
- How to design a pipeline that can recover from service interruptions.
- How to separate real-time serving needs from historical analysis needs.
Roadmap
Next Improvements
- Add a lightweight dashboard on top of the serving aggregates.
- Add automated tests for replay and duplicate-event scenarios.