UK Real Estate Data Warehouse & OLAP Cube

docker
Docker

End-to-end data-engineering pipeline over ~1.3M UK property sales: from raw CSVs to a MySQL star schema, a Mondrian OLAP cube, and a PDF analytical report. Two interchangeable ETL paths (Pentaho + pure SQL), reproducible via Docker.

A complete data-warehousing pipeline around the UK Government’s Price Paid Data, built for the Data Analysis and Integration course at IST.

What I built

Dimensional model

A textbook Kimball star schema with one fact table and three dimensions:

                          ┌─────────────────┐
                          │    dim_time     │
                          └────────┬────────┘
                                   │
 ┌──────────────────┐      ┌───────┴──────────┐      ┌──────────────────┐
 │   dim_location   │──────┤   fact_sales     │──────│   dim_property   │
 └──────────────────┘      └──────────────────┘      └──────────────────┘

Two interchangeable ETL paths

Pentaho (.ktr)Pure SQL
ToolPentaho Data Integration GUIPlain SQL scripts
Editable as textPainful (XML graph)Trivial
Execution modelStreaming, row-by-rowSet-based, index-friendly
CI-friendlyHardEasy (just pipe SQL)

Both populate the same warehouse tables; a sanity-check query proves parity between the source totals and the fact aggregate. The pure-SQL path uses INSERT IGNORE, a recursive CTE for the calendar dimension, and INSERT ... ON DUPLICATE KEY UPDATE for the fact.

OLAP cube and analytics

I built a Mondrian XML cube on top of the warehouse with three hierarchies:

Two measures: Sales (sum of price) and Transactions (count).

I wrote MDX analyses for Greater London breakdowns and a PDF analytical report in Pentaho Report Designer that lists counties above £1 bn with a top-5 pie chart driven by a Groovy JFreeChart post-processor.

Reproducibility

The whole pipeline runs in Docker Compose with MySQL 8 pre-configured for local_infile and CTE recursion, plus cross-platform bootstrap scripts (PowerShell + Bash). End-to-end run: under a minute.