These would be the exact details I'd need to advise you. It depends on how many ...

These would be the exact details I'd need to advise you.

It depends on how many jobs you have, how much churn there is, and what exactly you want to monitor. If jobs are short-lived then tracking individual jobs would be unwise, something like the ELK stack intended for event logging would be better. If jobs are long-lived and there's not many of them then you should be okay. Otherwise you'd just be looking at tracking system rather than per-job stats.

To give a very rough idea, if you can keep it below say 10M metrics across the history a single Prometheus server has that should be okay with the current implementation.