We were building Reblaze (started 2011), a cloud WAF / DDoS-mitigation platform. Every HTTP request—good, bad, or ugly—had to be stored for offline anomaly-detection and clustering.
Traffic profile
- Baseline: ≈ 15 B requests/day
- Under attack: the same 15 B can arrive in 2-3 hours
Why BigQuery (even in alpha)?
It was the only thing that could swallow that firehose and stay query-able minutes later — crucial when you’re under attack and your data source must not melt down.
Pipeline (all shell + cron)
Edge nodes → write JSON logs locally and a local cron push to Cloud Storage
Tiny VM with a cron loop
- Scans `pending/`, composes many small blobs into one “max-size” blob in `processing/`.
- Executes `bq load …` into the customer’s isolated dataset.
- On success, moves the blob to `done/`; on failure, drops it back to `pending/`.
Downstream ML/alerting* pulls straight from BigQuery
That handful of `gsutil`, `bq`, and `mv` commands moved multiple petabytes a week without losing a byte. Later pipelines—Dataflow, Logstash, etc.—never matched its throughput or reliability.
We were building Reblaze (started 2011), a cloud WAF / DDoS-mitigation platform. Every HTTP request—good, bad, or ugly—had to be stored for offline anomaly-detection and clustering.
Why BigQuery (even in alpha)?It was the only thing that could swallow that firehose and stay query-able minutes later — crucial when you’re under attack and your data source must not melt down.
Pipeline (all shell + cron)
Edge nodes → write JSON logs locally and a local cron push to Cloud Storage
Tiny VM with a cron loop
Downstream ML/alerting* pulls straight from BigQueryThat handful of `gsutil`, `bq`, and `mv` commands moved multiple petabytes a week without losing a byte. Later pipelines—Dataflow, Logstash, etc.—never matched its throughput or reliability.