> If you can make it support the autoscaling Spark jobs, the jupyter-hub env the...

FridgeSeal · on Aug 1, 2022

> What is the difficulty in autoscalling Spark jobs?

I mean, running spark is an awful experience at the best of times, but let’s just go from there.

Spark drivers pull messages off Kafka, and scale executors up or down dynamically based off how much load/messages they have coming through. This means you need stable host names, ideally without manual intervention. The drivers and the executors should also use workload-specific roles - we use IRSA for this, and it works quite nicely. Multiple spark clusters will need to run, be oblivious to each other, and shouldn’t “tread on each other” so the provisioning topology should avoid scheduling containers from the same job (or competing jobs) onto the same node. Similarly, a given cluster should ideally be within an AZ to minimise latency, doesn’t matter which one, but it shouldn’t be hardcoded, because the scheduler (I.e. K8s) should be able to choose based on available compute. Some of the jobs load models, so they need an EBS attached as scratch space.

> What jupyter env do they use?

We use jupyterhub wired up to our identity provider. So an analyst logs on, a pod is scheduled somewhere in the cluster with available compute, and their unique EBS volume is automounted. If they go to lunch, or idle, state is saved, and the pod is scaled down and the EBS volume is auto-unattached.

> That's genericm what do those services really do? Sync request/response backed by a db? Async processing? What infra do they need to "work"

The API stuff is by far the easiest. Request response, backed by DB’s, the odd analytics tool and monitoring tool as well. Servers themselves autoscale horizontally, some of them use other services hosted within the same cluster, all east-west traffic within the cluster is secured with mTLS via linkerd, between that and our automatic metric collection, we get automatic collection of metrics, latencies, etc. like what you get with the AWS dashboards, but more detail (memory usage for one), automatic log collection too.

> Where are you running your workloads?

AWS, but minimal contact - S3 is basically a commodity API nowadays, the only tight integration is IRSA, which I believe GCP has a very similar version of. So most of this should work in any of the other clouds, with minimal alteration.