// projects/apple-health-data-lake.md

Apple Health Data Lake

Name: Apple Health Data Lake
Author: Daniel La Corte

Two years of heart rate, HRV, sleep, and steps, sitting in the Apple Health app, visible through charts, but not usable. I wanted to run SQL queries, explore correlations, and build my own aggregates. Not inside another app, but on raw data I fully control.

The missing piece was an iOS app called Health Auto Export. It exposes Apple Health data over a local TCP server on port 9000 using JSON RPC 2.0, as long as the app is in the foreground and both devices are on the same WiFi. No cloud, no third-party data sharing.

On top of that I built a three-layer pipeline following the Medallion architecture:

Bronze: a Python CLI (health-sync) connects to the iPhone, pulls raw JSON, and uploads it to S3 as-is. The initial pull retrieved 730 days of data in a single request.

Silver: a Lambda triggered by S3 events normalizes each Bronze file into flat Parquet rows, one per measurement, partitioned by measurement date. Bulk historical files get split into ~730 daily partitions.

Gold: a second Lambda runs daily at 02:00 UTC, reads a Silver partition, and writes daily aggregates (avg, min, max, count per metric) as Gold Parquet.

Athena: AWS Glue registers the schema, Athena provides SQL access without running a database server.

The entire system (CLI, two Lambdas, CDK infrastructure stack) is roughly 500 lines of Python. The hardest part was not AWS, but the quirks of the iPhone’s TCP server: which metric names it accepts, how the data is structured, and how to pull two years in a single connection.

github.com/dlacorte/apple-health-data-lake