Beijing Taxi Monitoring System

Real-time processing of 10,000+ taxis with Apache Flink, Kafka, and Redis

System Overview

Real-time Taxi Monitoring Platform

Processing GPS data from 10,357 taxis across Beijing with Apache Flink

Data Source

Taxi GPS Data Files

Each taxi has a separate file with timestamped GPS coordinates:

1,2008-02-02 15:36:08,116.51172,39.92123
1,2008-02-02 15:46:08,116.51135,39.93883
1,2008-02-02 15:56:08,116.51627,39.91034
...
10,357 individual taxi files

Key Features

Real-time Processing

Processes GPS events with sub-second latency using Apache Flink

Speed Calculation

Computes real-time speed between consecutive GPS points

Geofencing

Monitors taxis within 15km radius of Beijing center

System Architecture

End-to-End Data Pipeline

How GPS data flows through the system in real-time

Data Producer

Python script processing GPS files

Apache Kafka

Distributed event streaming

Apache Flink

Stream processing engine

Redis

In-memory data store

Dashboard

Real-time visualization

Detailed Data Flow

Data Producer

  • Reads 10,357 taxi data files
  • Processes in parallel batches
  • Publishes to Kafka topic
  • Maintains chronological order

Kafka Cluster

  • Topic: taxi-locations
  • High throughput (15k msg/sec)
  • Partitioned by taxi_id
  • Watermark events

Redis Store

  • Hashes for current locations
  • Sorted sets for active taxis
  • Lists for trajectories
  • TTL for auto-cleanup

Dashboard

  • Real-time updates
  • Interactive map
  • Alert notifications

Data Storage

Redis Data Structures

Optimized for low-latency real-time access

Current Locations

Stored as Redis Hashes with TTL (5 minutes)

HSET location:1001
lat "39.915"
lon "116.404"
time "2008-02-02 15:36:08"
speed "45.2"
EXPIRE location:1001 300

Taxi Metrics

Stored in Redis Hashes for fast access

HSET metrics:speed
1001 "45.2"
1002 "32.5"
1003 "28.7"
HSET metrics:distance
1001 "12.5"
1002 "8.3"

Active Taxis

Sorted Set with timestamp as score

ZADD taxi:active
1204567890 "1001"
1204567891 "1002"
1204567892 "1003"
ZREMRANGEBYSCORE
taxi:active -inf (now-300)

RedisSink.java Implementation

Optimizations

  • Pipelined writes for batch operations
  • 5 minute TTL for all location data
  • Asynchronous writer thread with queue
  • Automatic retry on connection failures

Data Structures

location:<taxi_id> - Hash
metrics:speed - Hash
metrics:distance - Hash
taxi:active - Sorted Set
// Process record in Redis pipeline
private void processRecord(Pipeline pipelined, T input) {
long currentTimeSeconds = System.currentTimeMillis() / 1000;
if (input instanceof TaxiLocation) {
TaxiLocation location = (TaxiLocation) input;
String locationKey = "location:" + location.getTaxiId();
// Store current location with expiration
pipelined.hset(locationKey, "lat", String.valueOf(location.getLatitude()));
pipelined.hset(locationKey, "lon", String.valueOf(location.getLongitude()));
pipelined.hset(locationKey, "time", location.getTimestamp());
pipelined.expire(locationKey, LOCATION_TTL_SECONDS);
}
}

Development

Development & Operations Commands

Essential commands for building, deploying and monitoring the system

Build Project Using Dockerized Maven

Build the Java project using a temporary Maven container.

Windows (run from project root):

docker run -it --rm ^
      -v "%cd%\taxi_locations\consumer\taxi_flink:/app" ^
      -w /app ^
      maven:3.8.6-eclipse-temurin-17 ^
      mvn clean package

macOS/Linux (run from project root):

docker run -it --rm \
      -v "$(pwd)/taxi_locations/consumer/taxi_flink:/app" \
      -w /app \
      maven:3.8.6-eclipse-temurin-17 \
      mvn clean package

Visualization

Pseudo Implementation of Real-time Monitoring Dashboard

Interactive visualization of taxi movements and alerts

Beijing Taxi Monitoring

Live 08:00:00

Beijing City Map

System Metrics

Active Taxis 3
Total Distance 1,248 km
Avg Speed 42.5 km/h

Recent Alerts

Speed Alert 08:02:15
Taxi #1001: 82 km/h
Zone Alert 08:01:30
Taxi #1003 left monitored area

Taxi Details

Taxi #1001
Speed
45.2 km/h
Status
In Zone
Last Update
08:00:00
Location
39.915, 116.404

Data Polling

Refresh Interval 5 second
Dashboard polls Redis for updates every second

Trajectory Smoothing

Interpolation Linear Interpolation
Raw GPS points are smoothed for better visualization

Alert Notification

Delivery Latency < 500ms
Alerts appear on dashboard within half a second