I think MLflow is a good idea (very) badly executed. I would like to have a library that combines:
- simple logging of (simple) metrics during and after training
- simple logging of all arguments the model was created with
- simple logging of a textual representation of the model
- simple logging of general architecture details (number of parameters, regularisation hyperparameters, learning rate, number of epochs etc.)
- and of course checkpoints
- simple archiving of the model (and relevant data)
and all that without much (coding) overhead and only using a shared filesystem (!) And with an easy notebook integration. MLflow just has way to many unnecessary features and is unreliable and complicated. When it doesn't work it's so frustrating, it's also quite often super slow. But I always end up creating something like MLflow when working on an architecture for a long time.
EDIT: having written this...I fell like trying to write my own simple library after finishing the paper. A few ideas have already accumulated in my notes that would make my life easier.
EDIT2: I actually remember trying to use SQLite to manage my models! But the server I worked on was locked down and going through the process to get somebody to install me SQLite was just not worth it. It's also was not available on the cluster for big experiments, where it would be even more work to get it, so I gave up on the idea of trying SQLite.
Yep - totally agree. I respect the attempt to introduce something that is basically an opinionated CRUD app as a central place to put your model metadata. But it's not really ready for large-scale production, or for use by teams bigger than about 5.
It's kind of flaky and slow. It doesn't have namespacing. It's overly opinionated on the workflow (the way that states work, with a model version being in exactly one of dev, staging, prod is super hard to work with).
But beyond that, the biggest problem I have with MLFlow is what I call the "part of this complete breakfast" problem, which the ML/data-science arena is particularly susceptible to these days: the marketing talks a lot about what problems can be solved using the product, but not a lot about what parts of the problem the product actually solves. This is often because an honest answer to the latter question would be "not much". In the case of MLFlow, that would be totally fine, because honestly an opinionated CRUD app is a very useful thing. But it should be a lot more honest about what it does. It's not a system for automatically tracking model metrics, it's a database into which you can write model metrics with a known key structure.
I've never encountered a Python installation on any operating system where `import sqlite3` worked but the underlying libraries were not available.
I imagine this is because SQLite is VERY easy to bundle with Python itself. So on some platforms the OS SQLite is used, but on others it gets shipped as part of the Python installation itself.
+1 I also think it's faster that way on both environment setup and ad hoc rapid experiments, from my experience using the library in a team doesn't scale well, it becomes pretty slow.
I really like guild.ai. The best thing is that their developers assumed people to be lazy and automatically makes flag for global variables and track them.
@tomrod, thank you for the callout. By the way we are integrating mlflow into Flyte in a way that you do not need to start the web server to view the logs. They are available
Locally and statically in Flyte Ui. Ofcourse you cal also use mlflow server
The elephant in the room with data is that we don’t need a lot of the fancy and powerful technology. SQL against a relational database gets us extraordinarily far. Add some Python scripts where we need some imperative logic and glue code, and a sprinkle of CI/CD if we really want to professionalise the work of data scientists. I think this covers the vast majority of situations.
Despite being around it for some time, I’m not sure big data or machine learning needed to be a thing for the vast majority of businesses.
"Let’s now execute the script multiple times, one per set of parameters, and store the results in the experiments.db SQLite database... After finishing executing the experiments, we can initialize our database (experiments.db) and explore the results."
Be warned that issuing queries while DML is in process can result in SQLITE_BUSY, and the default behavior is to abort the transaction, resulting in lost data.
Setting WAL mode for greater concurrency between a writer and reader(s) can lead to corruption if the IPC structures are not visible:
"To accelerate searching the WAL, SQLite creates a WAL index in shared memory. This improves the performance of read transactions, but the use of shared memory requires that all readers must be on the same machine [and OS instance]."
If the database will not be entirely left alone during DML, then the busy handler must be addressed.
Unless your income is depending on carrying out the exact demands of some money guy that's most common phrase while using a computer is "it won't let me" and they want "big data".
Then you just suck it up and build one of the totally unnecessary big data systems that have been excreted all over the business world these days. I don't think the problem is that devs are over-engineering.
I wonder what its called, makes me think of tragedy of the commons but probably not quite right.
Hierarchy on bueracracies, by Jean Tirole. I know because this was the phenomenon I wanted to study in grad school only to find he scooped me (on this an several items) by several decades.
Edit: Tirole, Jean. "Hierarchies and bureaucracies: On the role of collusion in organizations." JL Econ. & Org. 2 (1986): 181.
36 years is not old in terms of research. 2,223 cites on Google Scholar and many in the past year. Seminal research often identifies the problem but not all solutions.
What gets me is how many companies paid through the nose to push their data into things like Hive and slowed down 99% of their queries to make one "run once a quarter" report run about 25% faster.
At least that was my experience a number of years back.
Maybe like 20 years ago you were right but today there's a generation that's been working for 10 years on systems built like that. They don't know any better, and in most cases nobody is around to teach them otherwise.
Yeah and even if you do need to do proper big-dataset-ML... a SQL box and maybe something like a blob storage for large artifacts (S3, Azure storage account, whatever) is all you need as well. But if your boss bought The MLOps Experience, you gotta do what the cool kids are doing!
I work in an environment where there are multiple tech teams developing models for multiple use cases on VMs and GPU clusters spread across our corporate intranet. Once you move beyond a single dev working on a model on their laptop, you absolutely need something that can handle not just metrics tracking, but making the model binaries available and providing a means to ensure reproducibility by the rest of the team. That's what MLFlow is providing for us. The API is a mess, but at least we didn't have to code up some bespoke in-house framework, we just put some engineers on task to play around with it for a few hours and figure out the nuances of basic interactions and deployed it.
Agree. Once you have a team, you need to have a service they can all interact with. This release is a first step, we want to get the user experience right for an individual and then think of how to expand that to teams. Ultimately, the two things we're the most excited about are 1) you don't need to add any extra code (and it works with all libraries, not a pre-defined set) 2) SQL as the query language
I don't get why a lot of people are calling mlflow a shitshow when it has done so much getting data scientist out of recording experiments via CSV. I can log models and parameters and use the UI to track different runs. After comparisons, I can use the registry to register different staging. If you have other model diagnostic charts you can log the artifact as well. I think mlflow v2 has auto logging included so why all the fuss?
People tend to forget that first movers rarely tend to also have the best design. MLFlow (and DVC) brought us out of the dark ages. Now we can build better tools, with the benefit of hindsight.
Claiming that something is "broken" or "trash" when you mean "I don't like it" is a good way to make yourself feel big and smart, but it's not actually constructive.
Okay that's coming across as a pretty snide remark aimed at me, I'll bite.
Yes, I can understand why you comment that. I don't like blind slagging of free software either.
But there are ALSO those whose day job it is, and has been for the last 2 years, to use a badly designed overcomplex horrorshow of a tool that could be replaced easily by something better ... if it wasn't for the lock-in effects and strong marketing.
So I'm ventilating my frustration and at the same time expressing my gratitude to the person who made something fresh, that shows us things can be better.
I can't build the replacement to MLFlow myself, but I can cheer people on who do, and let them know their efforts are sorely needed.
Could you provide context on why SQLite would replace MLflow? From the standpoint of model tracking (record and query experiments), projects (package code for reproducibility on any platform), deploy models in multiple environments, registry for storing and managing models, and now recipes (to simplify model creation and deployment), MLflow helps with the MLOps life cycle.
Fair point. MLflow has a lot of features to cover the end-to-end dev cycle. This SQLite tracker only covers the experiment tracking part.
We have another project to cover the orchestration/pipelines aspect: https://github.com/ploomber/ploomber and we have plans to work on the rest of features. For now, we're focusing on those two.
Basically, doing a group by at millisecond resolution with a sum on the IP packet length to get a rough metric for bandwidth.
Once you have that, you can see the milliseconds with the highest bandwidth. Some extra math can also get you to Gigabits/second in a more network engineer friendly format.
I did a histogram-type thing in the same way by using a window function (similarly sqlite table scraped off pcap recordings). I can't remember if it was a fixed-width window (number of samples) or within some time window
Dropped it in datasette with datasette-vega and got a nice little plot
Yeah, MLFlow is a shitshow. The docs seem designed to confuse, the API makes Pandas look good and the internal data model is badly designed and exposed, as the article says.
But, hordes of architects and managers who almost have a clue have been conditioned to want l and expect mlflow. And it's baked into databricks too, so for most purposes you'll be stuck with it.
Props to the author for daring to challenge the status quo.
I have never seen a worse documented library. Initially I thought that they were lazy, now I realize that it cannot be documented because it is a total mess of a library held together with tape.
Docstrings are one thing, but functionality discovery, picking up from scratch, troubleshooting, etc are... not fun, nor easy with the documentation. If you know it well already and use it a lot it's easier to forgive its documentation faults since you can waive off the problems as "that's just learning something new".
But for a lot of people who use it infrequently its documentation is a frustrating mess. Simple problems turn into significant time sinks of trying to find which page of the documentation to look at.
A lot of issues are made worse by shit-awful interop between libraries that claim to fully support dayaframes, but often fail in non-obvious ways... meaning back to the documentation mines.
I'd argue that because there's a market for a single author to write two books about it is indicative of documentation problems.
Fair enough. I'm highly biased and my recent book is the most popular Pandas book currently, so it is evidence that folks prefer opinionated documentation.
However, I always though the 10 minutes to Pandas page was decent for getting started. I picked up Polars recently and thought it was more difficult than Pandas because there wasn't any quick intro docs. What projects have great introductory docs for you?
Also, I am curious to learn more about the specifics of interop libraries you are referring to.
Learning a new tool is generally a challenge. I think another challenge with a lot of data tools is that non-programmers tend to be the major audience. I make my living teaching "non-programmers" how to use these tools.
That said, I always teach "go to the docstrings and stay in your environment (to not break flow) if you can." The pydata docstrings are better than most, including Python (the language).
Yeah, I think for your audience, pandas makes total sense! When I first started using it, it was through an ambitiously large project with tons of gaps in data, untype-able text for 1% of rows, didn't fit in memory.. etc. So my personal experience is a bit tainted by putting myself through a hell that could have solved sooner by spending more time learning instead of bashing my keyboard with a hammer.
I've long suspected that Pandas has taken a similar stance to e-mail scammers. Where e-mail scammers inject all kinds of broken english and bad punctuation to ensure they get their targets of choice, Pandas has broken and often inaccurate documentation in order to get only the chosen ones to work with their software.
However, maybe it makes more sense that it's just a mess that's hard to document.
The Pandas documentation has improved quite a bit. Last I checked, the only part of the reference docs with a big gap was the description of "extension arrays" and accessors.
The user guide material absolutely needs work, and the examples in the reference docs tend to be a little contrived. But I absolutely have seen worse-documented libraries, such as Gunicorn and Pydantic.
I'm surprised to see Pydantic in here; I've used Pandas and Pydantic both quite a lot, and have found the Pydantic docs to be quite good! Also a much smaller library with a saner API, and thus easier to document well.
What makes the documentation so bad in your opinion? I’m not arguing but curious since I use pandas all day at my job and can’t think of any times the docs weren’t clear to me. (Plotly I have had some annoying times with!)
What bothers me the most is the egregious data types for any argument.
If it's a string, do this. If it's a list, do that. If it's a dictionary of lists, do this other thing.
No, I want you to force me to provide my data in the right way and raise a noisy exception if I don't.
Series and DataFrame have "alternate constructors" for this purpose, and the loc/iloc accessors give you a bit more control.
I agree that the magic type auto-detection is a bit too magical and sloppy, but you have to realize that data analysts and scientists have historically been incredibly sloppy programmers who wanted as much magic as possible. It's only in recent years that researchers have begun to value some amount of discipline in their research code.
Every time I open up pandas I jealously remember the expressive beauty of R for these tasks. But because we're all "serious" of course we must use Python for production lest we not be serious.
R is a trash of a language. It doesn't have any sense of coherency to it at all.
They keep trying to fix the underlying problems by ducktaping paradigms on to it over and over (S3, S4, R6, etc). There's never a clear sense of the best way to do anything, but plenty of options to do a thing in a very hacky 'script-kiddy' way.
Looking out at the community of different projects it becomes clear that everyone is pretty lost as to what design principles should be used for certain tasks, so every repo has its own way of doing things (I know personal style occurs in other languages, but commonalities are much less recognizable in R projects).
It's tragic that such a large community uses it.
Trash language is a bit harsh. I'm not sure I would try to put an R project into production or build a huge project with it but, at the very least, R/R Studio was the best scientific calculator I've ever used. Was particularly great during college
Yep, this is a mark of someone that's never used R but has heard a lot of incredibly ill informed criticism around it.
One look of dplyr code over pandas would of course disabuse anyone of the notion that R is trash and the tragedy is Python will in the current state never have anything like that. That's the advantage of the language being influenced by Lisp vs not.
I agree that it is a trash language and that, outside that many frontier academic ideas are available and some plotting preferences are solidly prescriptive, it should be thrown into the trash bin.
Python, Julia when it gets its druthers for TTFP, Octave, Fortran, C, and eventually Rust. These are the tools I've found in use over and over and over again across business, government, and non-profits.
Everywhere R is used by the org I have seen major gaps in capacity to deliver specifically because R doesn't scale well.
I'm not emotionally invested in tools so am happy to identify the user experience and operational experience as "trash."
"Trash", despite its connotations of lacking value, is really just a chaotic disorganized mess of something made by artifice with dubious reclaim/reuse/recycle value. Being a subjective assessment, it is natural that one person's trash is a treasure to another.
I take issue with your implication that I'm emotionally invested in something when I shouldn't be. You are free to dislike R and not use it, but to claim that it's "trash" is to wrongly disavow its usefulness for the many people that do find it useful, and to cast aspersions on the judgement of all those people.
Hey, I apologize here, my point on emotional investment was that I, personally, am not emotionally invested in it and did not mean to cast aspersions at you for your defense of the language nor at people who have preferences for it. Specifically I meant that I'm comfortable enough in my understanding of the language to classify it and it's standard library as better in the garbage bin relative to alternatives available.
It's fine that people like it. What's good about it isn't unique, and what's unique about it isn't that great. And there are certainly switching costs for some orgs to consider.
its forced upon many of them that are in finance, banking, insurance, ...
Mainly because those tend to run on Microsoft Azure, which has no decent analytics offering, and are pushing Databricks extremely hard. The CTO or whatever just pushes databricks. On paper it checks all the boxes. Mlops, notebooks, experiment management. It just does all of those things very badly, but the exec doesn't care. They only care about the microsoft credits.
Just to avoid using Jupyter so the compliance teams stay happy as well because Microsoft sales people scared them away from from open source.
We pushed back on it very, very, very hard, and finally convinced "IT" to not turn off our big Linux server running JupyterHub. We actually ended up using Databricks (PySpark, Delta Lake, hosted MLFlow) quite a bit for various purposes, and were happy to have it available.
But the thought of forcing us into it as our only computing platform was a spine-chilling nightmare. Something that only a person who has no idea what data analysts and data scientists actually do all day would decide to do.
What would you go with instead for collaborative notebooks?
I ask because normally I tend pretty strongly towards the "NO just let the DSes/analysts work how they want to", which in this case would be running Jupyter locally. However DBr's notebooks seem genuinely useful.
Is your issue "but I don't need Spark" or "i wanna code in a python project, not a notebook?", or something else?
Imo if DBr cut their wedding to Spark and provided a Python-only nb environment they'd have a killer offering on their hands.
> What would you go with instead for collaborative notebooks?
Production workloads should be code. In source control. Like everybody else.
Notebooks inevitably degrade into confusing, messy blocks of “maybe applicable, maybe not” text, old results and plots embedded in the file because nobody stripped them before committing and comments like “don’t run cells below here”.
They’re acceptable only as a prototyping and exploration tool. Unfortunately, whole “generation” of data scientists and engineers have been trained to basically only use notebooks.
It's ubiquitous. I've consulted for a 100 person company that built a data product on top of some IoT data. Everything was in databricks, literally everything. (Not endorsing that, just an observation)
Talking to a 2000+ person org now that is standardizing data science across the org using... you guessed it
Pretty interesting. I think this is part of this notion to release half baked products, like some of the stuff in there are really cool, just enough to get you in but it doesn't scale and usually is complex to deploy/use.
I think this is a neat solution for an engineer working on their own and wants to go back and look at the data from various experiments.
I don't see this scaling to many engineers working in a team, who would want to see each others experiment data, or even store artifacts like checkpoints and such. And lastly, in many cases ACLs are required as well when certain models trained with sensitive data shouldn't be shared with engineers outside of a team/group.
SQLite is literally a backend for MLflow, so the argument being made really is that you should just use SQL when you can, which is kind of adjacent to any criticisms of MLflow
Is querying the underlying SQL database officially supported in MLflow? Last time I used it, it wasn't documented. I took a look at the database and it wasn't end-user friendly.
As someone replied above, it's because SQL is just 1 backend and it's weird to expose an API that only works on 1 backend. Once you have many devs working together, you need a remote server. If you have a remote abstracted backend, it needs to have a unified API surface so the same client can talk to any backend. You might argue "This interface should be SQL", and to that I would say there are many file stores (like your local file system) that are not easy to control with SQL.
Not convinced by the example. I don’t see how you can’t use standard scikit-learn for it.
First, the example doesn’t take advantage of sklearn’s built in, super simple parallelization via n_jobs
Then, the entire example could be better wrapped with sklearn’s own cross_validate() which gives you the same functionality: a table of results across experiments.
If you use a different estimator, you can easily concatenate the results into a single df
The rest is the same.
Why you need SQLite for this? (SQLite is great of course for the right use cases)
And if you're doing many orders more experiments (1000s instead of 10s) then that’s probably where MLflow is good (haven’t actually used MLflow)
DVC also fills the "lightweight tracking" niche, although it relies on automatically creating Git branches as its technique for tracking experiments. I personally find that distasteful, so I don't use it specifically for experiment tracking, but the feature is there.
It doesn't require creating a branch when you iterate, it requires creating a branch or commit if you want to share it with the team - see it on GitHub or in Studio. But even those lightweight iterations (https://dvc.org/doc/command-reference/exp/run) could shared as well via Git server - they won't be visible for now via UI in GH/Studio at the moment.
Right, but experiments aren't always linear. Do you really want to make a new commit for every iteration of a hyperparameter search? What if you are using a black-box optimizer that supports parallel/concurrent updates?
I don't want to use Git to track all that. I want to use Git to store the final results of running such an experiment in the same commit as the code that implemented it. I just don't like the DVC experiment workflow, but I am more than happy to use DVC for storing the fitted model(s) at the end of the run.
If you use `dvc exp run` you don't need to commit anything as I mentioned above. You can run multiple experiments in parallel, etc. Commit happens only / when you want to select the best result and share it with the team. But even that is optional.
I highly recommend ClearML for effortless experiment that just works. It does a lot more of MLOps besides experiment tracking but I haven’t used those functionalities
I had researched and spent time with several other tools including DVC, GuildAI and MLFlow but finally settled on ClearML. WandB pricing is too aggressive for my liking (they force an annual subscription of $600 last I checked)
There are a lot of tools in this space. Shameless plug to follow.
I helped build and use Disdat, which is a simple data versioning tool. It notably doesn't have the metadata capture libraries MLFlow has for different model libs, but it's meant to a lower-layer on which that can be built. Thus you won't see particulars about tracking "models" or "experiments", because models/experiments/features/intermediates are all just data thingies (or bundles in Disdat parlance). For the last 2+ years we've used Disdat to track runs and outputs of a custom distributed planning tool, and used Disdat-Luigi (an integration of Disdat with Luigi to automatically consume/produce versioned data) to manage model training and prediction pipelines (some with 10ks of artifacts). https://disdat.gitbook.io/disdat-documentation
I mean, come on, SQLite doesn't even support concurrency. Are people seriously considering using it in a production scenario?
If you work in a DS team where you're the only DS, then it probably suits your needs. Otherwise I can't imagine how you could achieve anything production grade
Being able to use SQL for later analysis is definitely a good idea. For smaller models SQLite for sure is enough but as soon as you want to scale your HPO across multiple servers or even just processes, you will need something that supports a multi-user database. E.g. Optuna supports PostgreSQL and also defaults to SQLite as far as I know.
As noted in an earlier comment, I think there is a false equivalence between end-to-end MLOps platforms like MLflow and tools for experiment tracking. The project looks like a solid tracking solution for individual data scientists, but it is not designed for collaboration among teams or organizations.
> There were a few things I didn’t like: it seemed too much to have to start a web server to look at my experiments, and I found the query feature extremely limiting (if my experiments are stored in a SQL table, why not allow me to query them with SQL).
While a relational database (like sqlite) can store hyperparameters and metrics, it cannot scale for the many aspects of experiment tracking for a team/organization, from visual inspection of model performance results to sharing models to lineage tracking from experimentation to production. As noted in the article, you need a GUI on top of a SQL database to make meaningful model experimentation. The MLflow web service allows you to scale across your teams/organizations with interactive visualizations, built-in search & ranking, shareable snapshots, etc. You can run it across a variety of production-grade relational dBs so users can query the data directly through the SQL database or through a UI that makes it easier to search for those not interested in using SQL.
> I also found comparing the experiments limited. I rarely have a project where a single (or a couple of) metric(s) is enough to evaluate a model. It’s mostly a combination of metrics and evaluation plots that I need to look at to assess a model. Furthermore, the numbers/plots themselves have no value in isolation; I need to benchmark them against a base model, and doing model comparisons at this level was pretty slow from the GUI.
The MLflow UI allows you to compare thousands of models from the same page in tabular or graphical format. It renders the performance-related artifacts associated with a model, including feature importance graphs, ROC & precision-recall curves, and any additional information that can be expressed in image, CSV, HTML, or PDF format.
> If you look at the script’s source code, you’ll see that there are no extra imports or calls to log the experiments, it’s a vanilla Python script.
MLflow already provides low-code solutions for MLOps, including autologging. After running a single line of code - mlflow.autolog() - every model you train across the most prominent ML frameworks, including but not limited to scikit-learn, XGBoost, TensorFlow & Keras, PySpark, LightGBM, and statsmodels is automatically tracked with MLflow, including all relevant hyperparameters, performance metrics, model files, software dependencies, etc. All of this information is made immediately available in the MLflow UI.
Addendum:
As noted, there is a false equivalence between an end-to-end MLOps lifecycle platform like MLflow and tools for experiment tracking. To succeed with end-to-end MLOps, teams/organizations also need projects to package code for reproducibility on any platform across many different package versions, deploy models in multiple environments, and a registry to store and manage these models - all of which is provided by MLflow.
It is battle-tested with hundreds of developers and thousands of organizations using widely-adopted open source standards. I encourage you to chime in on the MLflow GitHub on any issues and PRs, too!
+1. I'd also like to note that it's very easy to get started with MLflow; our quickstart walks you through the process of installing the library, logging runs, and viewing the UI: https://mlflow.org/docs/latest/quickstart.html.
We'd love to work with the author to make MLflow Tracking an even better experiment tracking tool and immediately benefit thousands of organizations and users on the platform. MLflow is the largest open source MLOps platform with over 500 external contributors actively developing the project and a maintainer group dedicated to making sure your contributions & improvements are merged quickly.
- simple logging of (simple) metrics during and after training
- simple logging of all arguments the model was created with
- simple logging of a textual representation of the model
- simple logging of general architecture details (number of parameters, regularisation hyperparameters, learning rate, number of epochs etc.)
- and of course checkpoints
- simple archiving of the model (and relevant data)
and all that without much (coding) overhead and only using a shared filesystem (!) And with an easy notebook integration. MLflow just has way to many unnecessary features and is unreliable and complicated. When it doesn't work it's so frustrating, it's also quite often super slow. But I always end up creating something like MLflow when working on an architecture for a long time.
EDIT: having written this...I fell like trying to write my own simple library after finishing the paper. A few ideas have already accumulated in my notes that would make my life easier.
EDIT2: I actually remember trying to use SQLite to manage my models! But the server I worked on was locked down and going through the process to get somebody to install me SQLite was just not worth it. It's also was not available on the cluster for big experiments, where it would be even more work to get it, so I gave up on the idea of trying SQLite.