Data scientists shouldn’t need to know Kubernetes

antman · on Sept 25, 2021

There are people mostly with an IT background who think that for data science you don’t need to know math and just monkey see monkey do sutoml based on atutorial, inspirational MOOCs and libraries that appeared magically out of thin air.

There are people with a math background who think data science is just an extension of statistics, so business, knowledge of scalable information storages, and productization is irrelevant.

There are both kind of posts here on HN. My take has been to hire math people with some cs msc, cs people with datascience msc, and business people that also know sales.

For me that has worked painlessly but your milage may vary. I haven’t seen that black swan CV capable in all three disciplines, but I have seen CVs that seem to think that they can tackle every problem because they have read all towardsds and kaggle tutorials. Marginalization? Kubeflow? POV?, 2 out of 3 are usually foreign concepts.

urthor · on Sept 25, 2021

It's mostly work ethic I find.

I've met quite a lot of Black Swans, and been employed alongside precisely zero.

I know one hard science PhD who runs their own K8s cluster at home and plays with Linux distros.

They describe themselves as "a statistician who can program."

Generally speaking it's more common for them to come from the math side of the fence. From the IT side I'll say the math is a bit harder than the computer stuff.

It's genuinely, 100% work ethic.

musingsole · on Sept 25, 2021

> I know one hard science PhD who runs their own K8s cluster at home and plays with Linux distros.

That's super awesome for that data scientist, but the question for a business is can/should you structure yourself in such a way that you NEED employees with that cornercase level of joint expertise.

The answer is you really can't. Individuals have awesome strengths that they developed for reasons particular to them. Use those strengths when you can. But the business has to rely on a common denominator of a role or else it'll never fill it when their unicorn leaves to go backpacking in Europe.

urthor · on Sept 25, 2021

Agree. You need to structure your talent pipeline, and organization, based on the average level of talent you can likely receive at your compensation bracket. You cannot create a single point of dependency on an employee who you'll never be able to replace for the same amount of money.

However, the issue is that productivity is logarithmic.

The unfortunate truth the school of hard knocks has shown me is that someone without the "roll your sleeves up" attitude to learn Docker is generally speaking just not going to be that effective when push comes to shove.

Now if you're using tools to abstract the time of data scientists who are CAPABLE of learning Docker, that is a different story.

But someone who starts grumbling about having to learn the command line to containerize their pipeline is generally speaking on the west side of the Pareto principle.

I can only guess at which particular area they will trip up, but it'll be somewhere.

antman · on Sept 25, 2021

Agree thar work ethic is the most important thing since complicated qualitative things cannot be measured, trust precedes everything. But work ethic does not complete the puzzle because people dont know always what they dont know.

For the example you mentioned, I will use a simplification I make to explain levels of expertise of challenging knowledge: 1.ABOUT: Know about something (heard it, know some examples) 2. KNOW: Know that something well (I now understand it and can leverage it towards an end to end a useful thing, also know its weaknesses) 3. HUMBLE: Realize I did not know many things about it but now know many ways of using it, can correct and extend other people's work, most of the time. 4. EXPERT: Know why it was structured that way. Contribute to the knowledge/tool itself.

So for that PhD an initial estimate would be a 3 or 4 scale on the math level, 1 or 2 on the kubernetes level (don't know him ofcourse I can be wrong without first discussing). If he works independently level 2 kubernetes is pretty great. If he needs to be part of a larger support team, a level 3 knowledge based on my (admittedly back of the napkin and ambiguous) categorization might prove to be less risky.

ozim · on Sept 25, 2021

And what makes you think that when presented with a problem those people who can grasp 2 concepts cannot get the 3rd one.

Was it shooting questions from the hip on the spot while interviewing them?

Or you hired 10 people and worked with them for at least 6 months to really know what they are capable of?

I think the former because no one has enough budget to hire people stick with them for 6 months just to see how they fare.

So what is your N to back up your claim?

Because it sounds like you really have something to say.

antman · on Sept 25, 2021

Smart and creative peope can grasp a lot of things but not everything is pure thought. Experience and experimentation time is required and there are only 24 hours in a day. Also the ds field has a lot of young people that did not have that much time or opportunity yet.

My N is a few hundred, not all my personal hires. I have visibility because now I do project management office duties (build sub teams per project), lead most of the interviews on the ds side, internal technical consulting duties. Ten you mentioned is my target number for hires previous and next week approx.

My claim is based on experience from the academic and the consulting space for global corp (which included consulting for other corps to build their ds teams, rarely though). I hope my claim appears logical and is useful.

ozim · on Sept 27, 2021

Few hundreds seems quite fair to have a good opinion.

_6pvr · on Sept 25, 2021

Most people involved in tech, including most devs, shouldn't need to know/care about Kubernetes. The reason anyone thinks otherwise is the massive amounts of marketing money vested parties have pumped into sales (read: DevRel/Dev Evangelism, dev influencers).

throwaway894345 · on Sept 25, 2021

The objective is to minimize how much devs need to know. There are a couple of ways to do that. The first is to pull out the traditional ops skill set into a traditional ops team so the app devs throw code over the fence to the ops team to operate, and hilarity ensues because the ops team is measured on uptime but they can’t affect the first order causes of downtime (code issues), so instead they try to make it harder to ship changes frequently which slows the business.

The other solution is that devs operate the apps themselves. This is infeasible with a traditional VM setup because managing VMs effectively involves tons of specialist knowledge and it’s unreasonable to expect dev teams to master it while also being expert developers.

Enter Kubernetes. Now you have a core DevOps/SRE team managing the “platform” (the Kubernetes cluster and various add-ons such as operators) which gives the application developers a high-level interface for operating their applications. They need to know a bit of Kubernetes, but it’s a whole lot less than mastering the traditional VM-based ops skill set. Moreover, as the Kubernetes ecosystem continues to mature, the surface area with which developers interact becomes smaller.

perfectspiral · on Sept 25, 2021

"Enter Kubernetes. Now you have a core DevOps/SRE team managing the “platform” (the Kubernetes cluster and various add-ons such as operators) which gives the application developers a high-level interface for operating their applications."

I've personally changed my opinion on this in the last ~2 years ... observing at work what it takes for people to stand up and manage a Kubernetes platform it really just feels like incredible waste that we have hundreds and thousands of SREs across our industry all building their own unique compute platforms when the public cloud vendors have already done that work.

The Serverless paradigm just seems fundamentally superior to me, but it also seems like it inherently requires vendors to be very opinionated in order to provide a good developer experience with it which AWS is not and aren't ... at least not yet anyway.

throwaway894345 · on Sept 25, 2021

> I've personally changed my opinion on this in the last ~2 years ... observing at work what it takes for people to stand up and manage a Kubernetes platform it really just feels like incredible waste that we have hundreds and thousands of SREs across our industry all building their own unique compute platforms when the public cloud vendors have already done that work.

I wasn’t suggesting standing up K8s from scratch, but rather extending GKE or EKS or similar with things like cert-manager snd external DNS.

I was skeptical coming from a shop that was deeply invested in AWS and serverless, but Kubernetes has a lot less friction and the abstractions can be pretty high level. For example, we can create a service with HTTPS, fully managed certificates, reverse proxy, and DNS just by creating an ingress resource for that service. It’s a lot nicer than cobbling together ACM, Route 53, API Gateway, etc (even though I have plenty of experience with the latter). A lot of this is possible because Kubernetes is extensible and there’s a big ecosystem for it. AWS isn’t (particularly) extensible, so you end up depending on them to support your use case. When you have a competent SRE team managing your platform and providing high level abstractions, Kubernetes kind of feels like what serverless promised to be—much more so than AWS’s serverless offerings (and I still like AWS!).

perfectspiral · on Sept 25, 2021

Ye to be clear, I meant even building on top of GKE/EKS it seems like a lot of work configuring things, testing etc ... at least from the outside looking in.

throwaway894345 · on Sept 25, 2021

I think it still is a fair bit of work, but it’s something that an SRE team can manage and hand off to users a higher level abstraction than what could be provided atop proprietary cloud APIs (there’s not really a good way to abstract over things in AWS-land because AWS isn’t really extensible).

So you actually get some nice separation between folks who manage the Kubernetes-based platform and the developers who interact with high-level Kubernetes resources. The salient point is that developers aren’t the ones doing all of that work and they don’t need to coordinate with the platform team on any regular basis.

urthor · on Sept 25, 2021

Managed Kubernetes is the serverless paradigm, done right I find.

Kubernetes is hard because it needs to be hard, stateful machines are stateful because they have to be stateful.

Managed Kubernetes is the compromise between vendors desire for vendor lock-in, and customers wanting a standardized interface for serverless applications.

zapita · on Sept 25, 2021

You don’t need Kubernetes to implement an embedded SRE model or an internal platform. You’re describing a good organizational model but making the mistake of crediting a tool for it.

throwaway894345 · on Sept 26, 2021

I’m not saying that Kubernetes is required but you need something to empower devs. Rolling their own VMs is too much work and too easy to mess up.

alexchamberlain · on Sept 25, 2021

Not sure I agree to be honest. I don't think most developers should know how to run K8s, but I think most developers should know how to run their code on K8s. These guys aren't idiots - putting abstractions and guide rails in the way is just patronising.

That's not to say everyone has to be an expert either - there's a place for experts to optimise setups etc too.

_6pvr · on Sept 25, 2021

> Not sure I agree to be honest. I don't think most developers should know how to run K8s, but I think most developers should know how to run their code on K8s.

This is silly. Most devs have too many other things they know they don't know, to also add on something like kubernetes.

marvelous · on Sept 25, 2021

IMO, it's not silly at all. Most devs have to know the commands and configuration to do rolling deployments on the target infrastructure, fetch logs, how the readiness protocoll integrates with automatic restarts, ingress, etc. With k8s, all this is standard and transferable. With ad-hoc simpler solutions, this is all per-team tribal knowledge, and in my experience it's not even simpler to use for us devs.

_6pvr · on Sept 25, 2021

> Most devs have to know the commands and configuration to do rolling deployments on the target infrastructure, fetch logs, how the readiness protocoll integrates with automatic restarts, ingress, etc.

Is this serious? You think _most devs_, meaning a group that includes FE devs, mobile app devs, IoT, open source, DBAs, security engineers, need to know these things?

> With k8s, all this is standard and transferable. With ad-hoc simpler solutions, this is all per-team tribal knowledge, and in my experience it's not even simpler to use for us devs.

Most teams do not have to manage most/all of the things you're describing.

This really feels like more K8s marketing disguised as a HN post.

marvelous · on Sept 25, 2021

Yes, it's serious for _most devs_ who have code that could run on k8s: back-end devs. And _most_ devs who touch anything in FE, mobile apps, IoT, DBAs also have to touch the corresponding back-end and its associated platform tooling where k8s is a bliss compared to all the team-specific stuff we encounter in the back-end. Now I agree that k8s is a nightmare for the infra people who run it, but honestly it is insanely comfy for the (back-end) devs who need to use it.

_6pvr · on Sept 26, 2021

> Yes, it's serious for _most devs_ who have code that could run on k8s: back-end devs

What an impactful little nuance to leave out of all previous conversation :)

> And _most_ devs who touch anything in FE, mobile apps, IoT, DBAs also have to touch the corresponding back-end

I do not agree with this statement at all. If your org is large enough to use K8s at scale, your mobile devs aren't touching your infra. Being aware that infra exists is not the same as modifying and managing infra.

> Now I agree that k8s is a nightmare for the infra people who run it, but honestly it is insanely comfy for the (back-end) devs who need to use it.

If your BE devs involvement in K8s is cloning a repository that may or may not contain a directory of k8s config that they never open, yes.

tluyben2 · on Sept 25, 2021

Absolutely. It is a timesink and really not very valuable for most devs: they will not ever use it themselves anyway and there is too much to learn while the normal dev stuff already has that as well. In bigger (only marginally bigger than a one person shop) companies you have admins/devops and they don't want you to touch any of it anyway.

strzibny · on Sept 25, 2021

I completely agree with you. And if you want backend engineers to know more about ops, sure. But let them learn the groundwork, not forced them into K8s. As for data scientists needing K8s knowledge, that's ridiculous to me.

tomrod · on Sept 25, 2021

Data scientist here with very recent learning on K8s space. Exposure and general conceptual understanding is extremely helpful to have to assist in design of solutions. However, agreed that expecting me to maintain or lead the ownership of a K8s standup is outside the wheelhouse.

danjac · on Sept 25, 2021

They shouldn't have to, true. But enough companies have bought the Coolaid that it's a job requirement and you'll have to learn it anyway, which means developers will try and shoehorn it into their projects whether it makes sense or not so they can have it on their resume and then companies will need to make it a job requirement when those developers leave and they need to maintain it...

throwaway894345 · on Sept 25, 2021

What’s the alternative? Devs master a VM/Ops skill set (strictly more work)? Or devs throw code over the wall to an ops team (and progress grinds to a trickle)? https://news.ycombinator.com/item?id=28652561

scrose · on Sept 25, 2021

Prefacing this with the fact that I’ve only worked at smaller startups(<500 people)

Arguably, most of these places do not need dedicated ops teams, nor do they need to host and manage their own infrastructure, yet they do.

The most productive startup I worked at used Heroku to bootstrap many of their applications and we didn’t need a single ops person. People were able to switch between teams and follow the same short and standardized process to build and deploy code. They didn’t need to ‘master’ any specialized ops skills and there was typically someone on each team who could quickly debug failing deploys.

The least efficient startup I worked at insisted on hosting all their own infra because managed solutions like Heroku were ‘too expensive’. Except we ended up with multi-month long infrastructure rollouts, process additions, changes and infra upgrades that likely cost many orders of magnitude more than managed solutions to implement, with less features than we’d get out of the box with a managed service like Heroku. We also had nowhere near the scale necessary, or headcount, for it to be worth it to self-manage.

I’m typically the guy who works on the backend but also gets called in for ops and infrastructure work, and at least for smaller companies that aren’t dealing with hundreds of millions of requests per day, I think the managed route makes way more sense, even if you feel like you’re overspending on infrastructure.

throwaway894345 · on Sept 25, 2021

I was comparing Kubernetes to self-managed VMs, not Heroku. Heroku absolutely makes a lot of sense in many cases (small teams, simple use cases, etc).

catlifeonmars · on Sept 25, 2021

Managed platforms. Take Shopify for instance. It’s a platform that allows individuals with very little programming knowledge to build, ship, and operate online retail services, but doesn’t suffer from the segmentation of product lifecycle into dev and ops. The platform user still owns the end to end product lifecycle.

throwaway894345 · on Sept 25, 2021

Yeah, I buy that.

Bertram_Oglebs · on Sept 25, 2021

'...um before all these tools being able to reach similar conclusions ?' (-;

(OT) ^^ somehow related comic: https://ibb.co/JktgqSV

best,

ab0aa907 · on Sept 25, 2021

100% agreed. Kubernetes/DevOps is huge cognitive load, way over-engineered for an average project. Kubernetes should not come into picture unless you can afford a full-time DevOps person for your team. If you can’t then you are not big enough or haven’t solved a real problem yet.

commandlinefan · on Sept 25, 2021

> shouldn't need to know

Hm - maybe shouldn’t need to, but why wouldn’t you want to? Even if its not strictly your job/responsibility, its always helpful to know how things work when things go wrong.

_6pvr · on Sept 26, 2021

Because if you followed this logic there would be many lifetimes of things to pay attention to, most of which are just noise surrounding topics you find valuable.

Jugurtha · on Sept 25, 2021

We "do ML" for large organizations as a tiny consultancy. The way we've been able to improve the working conditions for ourselves (developers and data scientists) was by focusing on two things:

- Process: we analyzed what worked and what didn't in past projects. Continuously auditing and trying to extract learnings. We made sure people we built for at the client organization were involved. We scoped more thoroughly. We involved parts client organization that could torpedo the project downstream (legal, security, etc) upfront. Made fewer assumptions. Listened more.

- Tooling: we built a machine learning platform[0] to make sure a data scientist doesn't tap on anyone's shoulder to troubleshoot their system, set up their computing environment, or deploy their model. They could do it themselves. Furthermore, it wasn't necessary to get people who could move across the stack.

Changing our processes and the way we do consulting had a huge impact. A badly scoped project will in some way or another create toil downstream and create a situation where you need people to do full-stack and you need "all-hands-on-deck" constantly. That's just bad, and after we ruthlessly reworked the process, we had better results, better relations with clients, better cadence, etc. I emphasize on this because we were a larger team at some point running around working on so many projects simultaneously that everyone was practically burned out.

-[1]: https://news.ycombinator.com/item?id=28373127

teruakohatu · on Sept 25, 2021

It looks good. Resubmit that to Show NH.

Jugurtha · on Oct 7, 2021

Hi, I re-submitted it here: https://news.ycombinator.com/item?id=28777589

Jugurtha · on Sept 25, 2021

Thanks. It fell between the cracks on HN, and I didn't want to re-submit it not to be spammy.

Although we technically added multi Kubernetes cluster support. It was only GKE, and now it runs notebooks and workloads on AWS EKS, Azure AKS, and DigitalOcean as well. I'm not sure it's enough of an improvement according to the Show HN rules to re-submit. Plus I'm reworking the landing page and docs to add more clarity on what this thing does, with gifs showing RTC and all.

Do you have any feedback?

boulos · on Sept 25, 2021

Your headline "Get Data Products Right" is much more vague than the first sentence of your Show HN: "iko.ai offers real-time collaborative notebooks to train, track, deploy, and monitor models"

I would update both the title tag and that headline to be a condensed version of that sentence. I'd also suggest considering the buzzword "lifecycle" to merge write/deploy/track/monitor (test?): "Collaborative notebooks for your ML-model lifecycle".

Jugurtha · on Oct 7, 2021

Hi,

I submitted a Show HN here: https://news.ycombinator.com/item?id=28777589 after revamping the landing page. What do you think?

Jugurtha · on Sept 25, 2021

Thanks, boulos. (I considered sendig you a weird incident on GCP, by the way).

>Your headline "Get Data Products Right" is much more vague than the first sentence of your Show HN: "iko.ai offers real-time collaborative notebooks to train, track, deploy, and monitor models"

In the current draft, the headlie stays because it's the goal but the sentence "The machine learning platform for real world projects" is replaced by "Real-time collaborative notebooks to train, track, deploy, and monitor your machine learning models".

>I'd also suggest considering the buzzword "lifecycle" to merge write/deploy/track/monitor (test?): "Collaborative notebooks for your ML-model lifecycle".

I considered it, and even to use MLOps, but I'll postpone it for now. Every "validate-the-market" landing page claims "end-to-end lifecycle management no-code MLOps AI", therefore I wanted to be humble, thus specific in what this does for now.

The docs will also be improved and the "UX flow" as well to get the users unstuck from sign-in to job done smoothly. We won't look at making it pretty for now, though.

You're right in that the copywritig is vague. Maybe this helps: https://docs.google.com/document/d/1xwPcPXAxT-nVimAiT2AWEIu9...

Jensson · on Sept 25, 2021

Data scientists wants salaries like software engineers which is why they get requirements like software engineers. There are plenty of data scientist positions where all you need to know is excel, but those doesn't pay nearly as well. And if you look at the typical software engineering position there is almost always a slew of adjacent technologies, it is hard to get a position today where you only have to know one thing.

musingsole · on Sept 25, 2021

I don't believe pay directly influences job responsibilities like that. Maybe scale of responsibilities. But more pay doesn't mean you start doing something outside the job description.

The business leaders and managers trying to load kubernetes work on data scientists are doing so because the managers don't know what they're doing, what they want or who they need to get it done. Instead, they have the one hire they got greenlit last year and if that person can't do EVERYTHING, your group is screwed.

Zababa · on Sept 25, 2021

That's pretty much what I came to say. Expectations to know all the stack are what software engineers face.

miscaccount · on Sept 25, 2021

Not only stack but all roles from devops to devsecops to QA to performance and loads testing.

Seems they want to replace one team with all responsibilities

Zababa · on Sept 25, 2021

Pretty much! QA teams are almost not a thing anymore, and you're lucky if you have specific people taking care of ops and tooling these days. Most of the time it's "the most involved people will work on them when they have a bit of time".

piggybox · on Sept 25, 2021

"Data scientists wants salaries like software engineers" This is a bit weird. In general, DS is still one of highest paid jobs in recent years, if you check any job market report.

jrockway · on Sept 25, 2021

I think what's going on here is that tech leadership folks know that the models the scientists develop eventually need to feed into their live product (so need to be "production ready"), but there isn't enough work to have two teams; one to develop the models, and one to run them in production. Thus, the ideal employee is an expert in everything! That's valuable, but not likely to be something you find when both data science and SRE are deep fields where people are very successful only knowing one of them ;)

I work on something called Pachyderm, which is a Kubernetes-based data storage and job execution system that tries to bridge this gap. We have a managed solution (https://hub.pachyderm.com) where we provision your Kubernetes cluster and do all the management (keeping the software up to date, authentication and authorization, etc.) and in fact don't even expose kubectl to you. You'll never see any of the Kubernetes stuff (though you might recognize certain error messages, I suppose). You just supply your code and a specification for how data flows around your pipelines, add your data, and we do the rest. Data scientists can interact with the versioned inputs and outputs through notebooks, but you're getting the full suite of production features behind the scenes -- a history of exactly which data inputs went into which data outputs, incremental processing, seamless autoscaling (set cpus: 8, gpus: 1 in your pipeline specification, and we find you a machine that meets that spec, add it to your cluster in less than a minute, schedule your work there, and remove the machine when the job finishes), etc.

Sorry for the sales pitch. I pretty much never use HN to shill my paid work, but it seems especially relevant to this sort of problem. Maybe you don't need the unicorn employee that is an expert in multiple fields -- focus on the data science and let us actually deal with the ugliness of computers ;)

(And if you do like Kubernetes but don't want to write your own orchestration system, Pachyderm itself is open source.)

marcinzm · on Sept 25, 2021

> but there isn't enough work to have two teams

Two teams causes an issue where scientists chuck models over the wall for the engineers to somehow rebuild into a semi-workable approach. The end result isn't great because you can't build good production models without taking production deployment into account. You also can't convert non-production models into production models without understanding the modeling assumptions that happened.

The general result is that the engineers and leadership finds the results underwhelming to horrible. The scientists often don't care because what happens on the other side of the wall isn't their problem.

That doesn't mean everyone has to know everything but separating people into teams is not the answer. Have a single team with people of different focuses and areas of expertise.

nerdponx · on Sept 25, 2021

Forgive my ignorance, but wasn't (isn't?) Pachyderm a Hadoop data version control tool? Did the product pivot?

samuell · on Sept 28, 2021

The (very) short version is that it is a much better alternative to Hadoop, built for the container era.

commandlinefan · on Sept 25, 2021

> there isn't enough work to have two teams

There may not be enough work for two teams 100% of the time, but there sure is when TSHTF. Manufacturers understood the need for some slack, but software companies still haven’t figured this out.

urthor · on Sept 25, 2021

I don't think it's a particularly new feature of software development that a few highly paid employees who've got the entire stack in their brains are vastly more productive than a vast cross functional team.

urthor · on Sept 25, 2021

Also I will say.

There is some fantastic tooling for machine learning.

Databricks, GCP, everyone knows it.

The issue is that the data industry was raised from birth in complete fear of the boogeyman.

The boogeyman is Oracle. And the frankly ridiculous things Oracle did in the bad old days.

Hence most places have a constant internal conflict between "look here are all these brilliant data science tools" and "ah shit, GCP costs a ton of money when some idiot runs a select * query on a join across 5 TB of data."

But there are plenty of great tools.

tomrod · on Sept 25, 2021

Can you speak a bit more to this? I dislike Oracle with a passion, but I am not sure how the GCP comment connects.

urthor · on Sept 25, 2021

It's just $$$.

You can save a LOT of money in GCP by specifying the columns you actually need in your queries, and various very simple SQL optimization techniques.

Everyone is scared of the cost of these vendor tools.

tofflos · on Sept 25, 2021

It's a price data scientists have to pay in order to work in rapidly evolving business and solution spaces. Someone within the local organization has to experience all these tools before being able to reach similar conclusions. Many organizations are still struggling to get the data science infrastructure in place so they look for full-stack people to help get the ball rolling and start making progress on some initial set of prioritized business problems.

A few organizations are further along on that journey enabling their data scientists to focus on things other than process and tooling. Full-stack will be in demand until the solution space stabilizes and the bulk of organizations catch up.

hobofromabroad · on Sept 25, 2021

That might be true for startups. But larger business organizations are far better of creating a specific heterogeneous team with data scientist, data engineer and ops in one. At least starting out. That way, there is inherent knowledge transfer. You are not artificially limiting your hiring pool and can actually get some T shaped folks being experts in a certain domain.

Later on you can then build more specific teams or even more cross functional ones.

Of course, if you only want feel the waters and check if DS use cases are viable at all, consider getting a (few) freelancers and but a somewhat technically inclined person in charge. If that's a success use it to get funding for a proper team.

rjzzleep · on Sept 25, 2021

This is a pretty good post. I completely agree that a data scientist should not need to know Kubernetes.

There is a section about Airflow and while the author doesn't advocate for it, I've very much like it many many times. People still recommend it, but I find it to be an absolute nightmare to deal with.

One thing I have learned dealing with different data science teams is something else though. I have gone through every single pipelining tool(including pachyderm) and stream processing tool that was available at the time. The thing that people forget is that every single one of them has a thing that throws you off of what you actually want to accomplish or has some sort of caveats in your use case.

The important thing to note is that the job of the architect or whatever you want to call that person, is to provide an infrastructure where the data scientist can just run their code. And no matter which one of these environments you use you still need to build glue code for your use case. Even if that glue code is python library with a good distribution mechanism.

tdeck · on Sept 25, 2021

> There is a section about Airflow and while the author doesn't advocate for it, I've very much like it many many times. People still recommend it, but I find it to be an absolute nightmare to deal with.

Airflow's UX is just needlessly easter-eggy and bad. The one thing I'd want out of the dashboard is the list of recent job runs and whether they succeeded or failed, so of course that's hidden in such a way that a novice has to click 10 different places to find it. There's also the fact that they chose to call a timestamp "execution time" when it often doesn't correspond to the time the job is executed. Want to add parameters to your task? You better like hand-writing JSON or pasting it into a textbox because apparently that's a weird thing to do, so why bother adding any UI support for it.

crucialfelix · on Sept 25, 2021

I found Argo Workflows (k8s job and pipeline manager) much easier to work with and manage than Airflow. But I know Kubernetes and find it easy, so ..

FpUser · on Sept 25, 2021

I am a developer and do not know much about k8s. Well I know the theory and what they're for and could learn to use it in practice. However I have yet to find a single case amongst my clients where all this infrastructure overhead will provide positive ROI. I do not deal at Google scale and for normal businesses a single instance of properly written server deployed on dedicated hardware covers all their needs many times over. It serves as many requests as they can ever hope for without breaking a sweat.

kureikain · on Sept 25, 2021

I had extensive airflow and I generally agree that Airflow isn't a good solution. It good when you process a single atomic/"unit of work" per step, when each step process multiple files etc and if it's restart you have to write code to handle skip those processed file for example.

But I want to point out a few things that are wrong in the artcile to help other evaluate airflow.

> Second, Airflow’s DAGs are not parameterized, which means you can’t pass parameters into your workflows. So if you want to run the same model with different learning rates, you’ll have to create different workflows.

You can pass the parameter to workflows by giving it a JSON config. When trigger on the UI, you can paste the JSON with the right argument/parameters into your DAGs. So you can train model with different arguments etc

> Third, Airflow’s DAGs are static, which means it can’t automatically create new steps at runtime as needed.

You can absolutely create new steps at run time. The point of airflow is everything is just Python code that is evaluate to generate DAGs, as long as you generate the DAGs and write the operator. It will happily run and log. It may have trouble rendered on the UIs and cause some weird issue (tasks won't advanced after certain steps regardless when I last work on them but they are bugs).

You can write an operator, the operator in turn can initiate any other known operators, and point the next steps to those operators. Here is an example: https://stackoverflow.com/questions/41517798/proper-way-to-c...

samuell · on Sept 28, 2021

The one drawback I did note with Airflow was none of the mentioned ones, but this: It does not allow defining data dependencies at the data level. That is, in terms of individual inputs and outputs of a process or task.

I cover this issue in some detail in a blog post from a few years back: https://rillabs.org/posts/workflows-dataflow-not-task-deps

dudeinjapan · on Sept 25, 2021

Waiters shouldn't need to know anything about cooking.

However, knowing a bit about cooking might one a better waiter.

thom · on Sept 25, 2021

Full stack data scientists exist. They have certain advantages over others. Specialists exist. They have certain advantages over others. Live your life, be free.

sandGorgon · on Sept 25, 2021

I'm kind of surprised at seeing kubeflow vs metaflow levels of abstraction honestly.

If you are indeed talking from a data scientist POV - then the right abstractions here are Dask and Ray Distributed.

Both can run on Kubernetes as the underlying orchestration layer - but are a pythonic interface to distributed data science primitives.

falcolas · on Sept 25, 2021

My opinion is simply: You should understand the environment your code runs in. Be it bare metal, Kubernetes, or anything in-between. How that environment works determines how your code works - or doesn’t work.

Despite our best efforts, we have yet to abstract away the runtime environment. Despite Java’s best efforts.

bwship · on Sept 25, 2021

I don't really agree on this. If your data scientists are extracting important information about your data in Python or R. The actual hard work of this is them figuring out the algorithms to run, not what it is being run on. They develop this code to sift through data in a data warehouse, a database, or flat files and then come up with answers. What servers, or cloud infra, or kubernetes fleet it then runs on is of 0 concern to the actual code they just laid down.

6nf · on Sept 25, 2021

Do you believe front-end CSS / HTML designers should understand the entire stack down to the machine code and hardware running the VMs? I don't think I can agree with this, our stacks are too tall these days.

thinkharderdev · on Sept 25, 2021

I think this actually gets at an important distinction. Are Data Scientists more like designers or developers? UX designers shouldn't need to know anything about k8s (or any other infrastructure) but developers should. Ultimately if you are not only responsible for building something but also running it in production and maintaining availability then you need to understand the infrastructure it runs on to some degree.

vasco · on Sept 25, 2021

That runs in a browser, so they should understand the browser if you follow their logic.

shoo · on Sept 25, 2021

i don't think it is necessary or sufficient for all individuals to have a deep understanding of the runtime environment. i agree that if the team needs to ship production code, it would be a good idea if at least one person in the team has a good understanding of how the code runs.

but there are other failure modes -- if everyone on the team is great at writing efficient production code, but no one understands the business context or the problem domain or understands if the problem they're attempting to solve is even vaguely feasible from some kind of theoretical perspective (maybe someone with a decent statistics background could demonstrate the entire premise of the project is flawed, and needs to be rethought, using a blackboard and no computers at all), maybe they'll spend months or years building and deploying a lot of fast, beautiful, completely worthless machinery.

tluyben2 · on Sept 25, 2021

Except that most devs who learn this stuff but do not use it daily (or ever) (and why would they, they are devs), will know just enough to have opinions and too little for them to make sense. You (in general, maybe YOU do) do not understand the env your code runs on: it is layers on layers on layers with millions of LoC in between; you know some abstraction and maybe you know a bit more about this abstraction than others but you still do not understand it really. If you run Java or .NET Core or whatever popular with good support, your day to day programming won't matter for whatever env it runs on; if you write best practice code in those envs, writing different code for whether it runs in k8s or bare metal is... weird in almost every case. Someone in the team should know how to tweak the knobs and if there are things you should not do (use the filesystem for persistence and other trivial things) but the average dev or data scientist really doesn't need to know about it in any significant detail.

But I am curious where you have seen modern runtimes fail and where the code was the issue (not tweaks to the JVM settings); any concrete examples where well written, best practice code worked on the laptop but failed in k8s?

quadrifoliate · on Sept 25, 2021

> But I am curious where you have seen modern runtimes fail and where the code was the issue (not tweaks to the JVM settings); any concrete examples where well written, best practice code worked on the laptop but failed in k8s?

Not sure about OP, but the most times I have seen devs have issues with Kubernetes is in the tweaking of the knobs around deployments including security. Startup v/s readiness v/s liveness probes, rolling updates, auto-scaling, pod security policies and such are usually all-new to developers, and have a lot of different options. Most devs just want "give me the one that works, with good defaults", and need a higher level abstraction.

tluyben2 · on Sept 25, 2021

But at most companies I have seen those are handled by specific roles in the company who are in the team as well. Not all devs on the team need this knowledge. Depending on the service, you need resourcing. We have monoliths and microservices running on ecs and eks and we have 1 person who does the knobs turning and 1 person (me) who can take over if need be. I see no need to burden others with this, I dare say it, crap, because it is just not really useful or needed for writing business functionality that our clients want and need and pay for.

OP seemed to imply that coders needed to know this stuff because their code might not work: if that means turning knobs on the outside (runtimes/containers) then sure, but the devs don't need to know, but their comment about the JVM implies something else and I am curious what that is.

pjmlp · on Sept 25, 2021

Java was doing alright, but then plenty of people decided they didn't want to go along with it.

We develop on Windows, and deploy on multiple kinds of scenarios and OS stacks, I hardly have to care what lies underneath.

Same applies to .NET, although in smaller extent, given its Windows focus until quite recently.

harpratap · on Sept 25, 2021

>Despite our best efforts, we have yet to abstract away the runtime environment. Despite Java’s best efforts.

I think containers are a pretty good attempt at abstracting away runtime environments, no? Same docker image works on your local docker setup, docker-compose, vanilla kubernetes, managed kubernetes, fancy PaaS like CloudRun, Fargate, Heroku etc

isbvhodnvemrwvn · on Sept 25, 2021

That's just running the code, you need to connect to something or have something connected to it, handle failures etc - docker doesn't solve these on its own.

mmarq · on Sept 25, 2021

These requests are not unreasonable in organisations that only need to run some simple (from a mathematical standpoint) operations against a complex (from an IT perspective) dataset. Quite often you don't need a full time statistician or mathematician, but you can make it a full time job if you hire a sysadmin or a developer that understand statistical distributions and hypothesis testing, and you put them in charge of the whole data infrastructure.

I'm not saying this is the majority of data scientists jobs, but in some organisations I worked for the data analyst was a guy that run `SELECT MIN(v), MAX(v) AVG(v) from TableX` against a MySql DB, so they were also in charge of DB administration and data ingestion, otherwise it would not have been a full time job.

alxmrs · on Sept 25, 2021

My favorite infrastructure abstraction tool in this category is Apache Beam. I like that it lets you think in Python and an explicit Map Reduce DAG. Serialization errors are a bear to deal with. But, the power and composability of the framework make it nearly addictive.

ricklamers · on Sept 25, 2021

This post really resonates with why we created Orchest [0]

From the article: "involve two full sets of tools: one for the dev environment, and another for the prod environment"

This is what we think should change. We intend to bring dev and prod into a single cohesive environment. Initially it will be difficult to cover all types of production workloads (like the post mentioned, production is a spectrum). But what we've observed is that through container encapsulation we can create well defined production workloads that we can run on any container orchestrator while shielding the data scientists from that complexity during pipeline development _and_ deployment.

With a container first approach to DAGs it becomes trivial not just to mix library versions but even languages (e.g. feature extraction in Scala and model fitting in Python). In practice, this flexibility has resulted in a significant productivity increase because existing code "just works". No "one virtual environment to rule them all" necessary.

I like how the article does justice to the fact that there's a subtle yet important difference between mere workflow orchestrators and workflow orchestrators that take on meaningful responsibility when it comes to infrastructure. To really unburden the data scientist from having to be a full-stack unicorn you need to hide the underlying stack to the point where it's invisible. In that sense, the OS kernel analogy really works. Similarly, how many data analysts writing SQL have ever worried about database node sharding?

A big problem we see in the space is that there are still way too many leaky abstractions and data scientists end up dealing with architecture & config yet again, for many a task out of their depth. We hope to contribute to a better ecosystem, one where data scientists spend their time looking at the data, relating it to the domain, shipping value generating data pipelines/models, and communicating about results with their stakeholders. Not fighting config & infra.

[0] https://github.com/orchest/orchest

spicyramen · on Sept 25, 2021

Very limited and unfair comparison between Kubeflow and metaflow. Metaflow is dependent on AWS (it is mentioned but not emphasized). To me this is a non-starter. It makes sense for Netflix but not for the rest of the world

vtuulos · on Sept 25, 2021

As the article mentions, Metaflow will start supporting Kubernetes natively soon, although data scientists don't need to care about it :) Nothing changes in your Metaflow code when you move e.g. from AWS to Azure, so Metaflow isn't fundamentally dependent on AWS in any way.

Netflix is an AWS shop, so naturally we started with AWS integrations.

m0zg · on Sept 25, 2021

Increasingly data scientists need to know a thing or two about underlying tech. Otherwise you're limiting yourself to stuff that can be built on a single machine, and that doesn't get you very far. That said, with that list of qualifications they'll be looking for a very long time, especially if they aren't prepared to hire a $400/hr contractor to do all that stuff. Such people exist, there are just very few of them, and they're booked solid months in advance.

savin-goyal · on Sept 25, 2021

A single machine can take you remarkably far these days, given the availability of high RAM/Disk/CPU machines in the cloud.

mark_l_watson · on Sept 25, 2021

I agree. A huge GCP VPS with a good GPU attached is very inexpensive when you only start it when you are in a work sprint.

Just this week I have been experimenting with SageMaker and SageMaker Studio. Too early for a real evaluation, but it looks like SageMaker Studio hits many requirements: good for experimenting, run large distributed jobs, good model and code versioning tools, easy to publish REST APIs, etc. Just yesterday someone asked me to review 3rd party tools, and I look forward to getting a better understanding of how SageMaker Studio stacks up against turn-key systems.

I have built my career from standing on the shoulders of giants. I am not shy about just using the results in academic papers, using open source libraries, tools and frameworks, etc. that other people have written.

So, I agree with you that so much can be done on a single beefy VPS, but services and frameworks that allow easy use of multiple servers are also important.

m0zg · on Sept 25, 2021

In plain old data science? Sure. In deep learning? Nope. Gotta be distributed unless you want to wait until the Sun burns out.

lvl100 · on Sept 25, 2021

This is laughable. 15 years of DL? I ran neural net models more than 15 years ago. It wasn’t even accepted back then. Heck people looked at you weird if you mentioned Python. As far as I am concerned if you tell me you did DL before 2013 as a “DATA SCIENTIST” you are full of shit.

As far as OP, how do you learn Docker without Kubernetes these days? To me this is like saying you don’t need to learn Windows because all you do is run the solver in Excel.

TruthWillHurt · on Sept 25, 2021

What DO they know? Their Python code is sub-par, a procedural script not suitable for production use. They can't use Git, They don't write tests. They don't understand how to deploy/use CICD.

Maybe they should stick to spreadsheets, or upskill a bit so they don't consume so much of the engineers time.

zwaps · on Sept 25, 2021

You pay these people for their PhD level knowledge of math and stats, because that is a sparse skill: No matter how many Coursera courses one does, you can not upskill anyone to that level (at least, I have never seen it).

So, if their time is better spent applying that knowledge rather than thinking about infrastructure trivialities, then by all means, pay an engineer to clean up. In the end, that's still more cost-efficient.

That being said, I refuse to believe that anyone leaving university today with a degree in stats/ML/econometrics etc. doesn't know git and can not be taught good programming that doesn't at least interfere with operations.

But as soon as you start requiring your experts to do infrastructure, you are either wasting money, or you hired a quotation mark "data scientist" with a degree from medium.com and towardsdatascience.org or whatever - in which case by all means, require them to do engineering duties.

tchalla · on Sept 25, 2021

It's not surprising that some scientists aren't the best at engineering practices given that it's not their speciality. Much like some engineers aren't experts in scientific either. May be, both science and engineers should learn to understand their limitations and collaborate towards achieving a common goal. That would be productive over being condescending.

EastSmith · on Sept 25, 2021

Nobody that is not in a system administrator / dev ops role needs to know about it. I do not want to know about it. I am not explaining react reconciliation in my scrum updates, so stop giving me updates about Kubernetes.

justsomeuser · on Sept 25, 2021

Sure they don’t need to know how to schedule their computations on CPU’s as another team member can handle it, but I think the reality is that if you work in software you have to constantly be learning.

tuananh · on Sept 25, 2021

by that def, developers shouldn't need to know Kubernetes as well..

however, with the raise of devops culture, everyone should know the stack so they can use the platform effectively. everyone needs to up skill.

sgt101 · on Sept 25, 2021

I am really puzzled by "production is a spectrum". Production means that the code is run with a support team to an sla - the support team must have accepted it to service and be confident that they can deal with what might go wrong.

That's production.

tedk-42 · on Sept 25, 2021

Kubernetes, Linux, CI/CD pipelines, unit testing...<insert tech here/>

To be honest the landscape is constantly changing and people should learn as much as they can.

I call ignorance on these kinds of posts.

etaioinshrdlu · on Sept 25, 2021

I don't want to learn just anything, I want to be careful to learn things that won't go out of date quickly.

tedk-42 · on Sept 25, 2021

OK PHP is great for you then. C and Java won't go out of date quickly.

Not the hottest tech out there but they have a long used-by date if that's what your major concern is.

If you believe k8s will just 'go away', you don't really have a good clue about what it tries to solve and instead, get confused in its complexity. Having been around the block, i can see it sticking around for at least 10 years.

strzibny · on Sept 25, 2021

Learn more of the underlying knowledge (which is what I teach in https://deploymentfromscratch.com/) and your knowledge will last longer. Ansible YAMLs or your CI/CD provider YAMLs are just abstractions.

But forcing anyone, especially data scientists into a specific and quite complex tool of the day? Pass.

tedk-42 · on Sept 25, 2021

No-one is forcing anyone to do anything.

Tools are built by people that use them. If your team chooses to deploy their applications on a k8s stack, it's on them to own that and not treat it like a black box.

I'm completely against the entitled belief that a person 'shouldnt need to know how to <x>'.

I can stretch the example in many ways: 1) if you're commit secrets into your source code and claiming a 'data scientist shouldnt need to know about secrets management' 2) if you're building a data analysis script and you leave it as an undocumented mess that's not got no unit tests and one day it breaks, you shouldn't claim that 'a data scientist shouldnt need to know about testing'

Oh cry cry there's a tech that everyone is using but i don't want to learn it / i dislike doing that particular thing / working with that piece of tech.

Build your own damn tech stack/computer if you think you can do it better. Or ask in the job interview if your team is running their data science platform on k8s if you dislike operating apps on it so much and deny the job.

rc_hackernews · on Sept 25, 2021

Bought your book awhile ago from someone that mentioned it on my Twitter feed.

I’m really enjoying it! You did a great job on it.

I especially liked the chapter on networking since it was always something I was weak in.

strzibny · on Sept 25, 2021

Thanks a lot for your kind words. Super happy you are enjoying it!

alexnewman · on Sept 25, 2021

i’ve heard a lot about people don’t want to learn the stack they program on.

fithisux · on Sept 25, 2021

Sooner or later DSes will need to become Full-Stack. Knowing Kubernetes will be an advantage.

streetcat1 · on Sept 25, 2021

There is a reason that operating systems is a mandatory course in any respectable CS program. Kuberentes is no difference.

Data scientist should know about kubernetes as much as they should know how to program.

phendrenad2 · on Sept 25, 2021

Developers in general shouldn't need to know about Kubernetes, but it's become trendy to slash your IT/Ops teams to the bone and instead accept that your developers will just spend all of their time trying to configure GCP.

thinkharderdev · on Sept 25, 2021

I don't understand how you would do your job as a developer without understanding the infrastructure it runs on. I agree that it can make sense to have dedicated people do all the infrastructure setup/management/etc, but when you have an application running in production there are a lot of considerations which can't be cleanly separated from underlying infrastructure. Not to mention troubleshooting production issues. When something is not working in prod, the first thing I do is check basic operational stuff with the underlying deployment. Are all the pods still running? Have there been any restarts? If there is some DNS/network error how can I spin up a pod in the cluster to check on various things?

throwaway894345 · on Sept 25, 2021

With an Ops team, developers aren’t expected to operate their code. That’s the ops team’s problem. And the ops team is measured on uptime, which is a function of the code itself, which they can’t actually change—devs own that. What the ops team can do is to slow down the rate of deployments (another input to downtime/uptime). Rather than many small deployments, they’ll have larger deployments once or twice a quarter (at best).

So a desire to ship features regularly and preserve agility and quality is the “trendy” that the GP is talking about.

thinkharderdev · on Sept 25, 2021

Regardless of how often you ship, things still break sometimes though right? And you still need to find out why when they do. Often the issue is some interaction between application and infrastructure which requires knowledge of both to understand. Long before k8s was a thing and I worked in an environment like you describe above I still knew how the infrastructure worked even if I personally wasn't allowed to touch it.

throwaway894345 · on Sept 25, 2021

> Regardless of how often you ship, things still break sometimes though right? And you still need to find out why when they do.

The point is that under the traditional model, ops is responsible for the debugging, and they are typically already familiar with the infrastructure. Of course, things in organizations are rarely neatly isolated like this, so certainly developers would help with the debugging in many other, and having infra expertise will help.

phendrenad2 · on Sept 25, 2021

> When something is not working in prod, the first thing I do is check basic operational stuff with the underlying deployment. Are all the pods still running? Have there been any restarts? If there is some DNS/network error how can I spin up a pod in the cluster to check on various things?

And how much less downtime would you have if domain experts were doing that part?