I’m glad I’m not the only one. When I inherited some “production notebooks” (if that’s a thing) I couldn’t believe it was nearly impossible to do basic things such as test and review changes (via version control).
At our company if it's in a notebook it's not considered ready for production, it must run as a script before being considered for Eng to take over from DS. It's actually not that hard to write a notebook in such a way that it converts easily to a script. Just check and make sure that your variables/functions/whatever are initialized above the cell(s) they're used in, declare all imports in the top cell, and periodically move cells to fix any inconsistencies with these rules (checking that you didn't break anything of course). I've always said that Data Scientist doesn't mean, "I don't do engineering," good basic eng practice helps make more productive data science and brings it into production more robustly. How do you know your models work well if the code that generated them is inscrutable?
I wonder how much of the "3 engineers for 1 data scientist" ratio I hear all the time is due to Data Engineering being assigned the role of cleanup to code that should be better in the first place.
I think cleanup is part of it. I also have noticed as a guy on the DS job family, but who has taken a large interest in SDE work, separating the job families can result in churn.
For example, I might think of three model choices A, B and C. C may be the worst of the three, but only very marginally worse. It can also be the case that C is an order of magnitude easier to keep and maintain in production.
I've seen cases where the wrong choice here ends up requiring three SDEs for half a year, where if they gave up a tiny benefit of the best model, they could have done it with 1 SDE in 1 month.
You don't use Jupyter notebooks in production; they are super useful for pitching ideas to clients/bosses and doing some early prototyping. I feel sorry for anyone that has to work with "pure data scientists" that have no clue about software engineering practices...
It depends on what you're doing, yeah? In RMarkdown notebooks... yeah, I wouldn't write models in one. But if the focus is on embedding some visualizations and tables into a document, and then refreshing the document to every so often pull in new data, I can see that as a production use for a notebook. TL;DR: Can be useful for reporting, wouldn't use it anywhere else in the pipeline.
I'm coming to realize one of the key skills for a data engineer to have nowadays is "productionizing" notebook code from data scientists and PMs and teaching them to make it more testable and modular in the first place.
Though the name "data engineer" may be newish, the role is really an old one - and this aspect has always been the single most important part of the role.