As someone who uses tensorflow a lot, I predict an enormous clusterfuck of a transition. Tensorflow has turned into a multiheaded monster, supporting many things and approaches but none of them very well.
I mean, when relying on third party code, things like `tf.enable_control_flow_v2() and tf.disable_control_flow_v2()` can and will go horribly wrong. It looks like some operations are changing behaviour depending on a global flag being set. And not just some operations, but control flow operations! That will lead to some very hard to figure out bugs.
In my opinion there are some architectural problems with TF, which have not been adressed in this update. There is still global state in TF2. There is still a difference in behaviour between eager and non-eager mode. There is still the control flow as a second class citizen.
If you need to transition from TF1 to TF2, consider doing the TF1 to pytorch transition instead.
Not only upgrading is hard, but also installation (on Windows at least). For each Tensorflow version you need a specific python version, a specific CUDA version, specific tensorflow-gpu version, and many other easy to get wrong things. The problem is not the requirements, but that it's very hard to know what versions are compatible. There are endless threads on Github of people trying to use Tensorflow but failing after spending days trying to install it.
Try using these containers[1] from my peer team at IBM. They run on a variety of architectures (x86, ppc64le, with and without GPUs in most cases).
In addition if you don't want to fiddle with the containers, there's also a conda channel[2] that lines stuff up. I work on a peer team in machine vision, and use these for personal and professional projects.
Tensorflow sounds like an ideal candidate for running in a container. List out all your approved compatible versions in the Dockerfile and distribute it with your source code and anyone can reproduce your results with the exact same setup.
Yes, but for personal use, unless someone else already made those containers, I will still have to go myself through the trial-and-error process of finding the right combination of versions. Yes, second installation will be easier, but if I just want it on my PC it doesn't really help.
You can `docker pull username/mytensorflowcontainer` and start from someone elses' work. Looks like Tensorflow has a how-to on the site: https://www.tensorflow.org/install/docker including working cpu-only and gpu-enabled examples.
Anaconda may work well as a virtual environment for some ml projects, but it is by no means a solution for getting a gpu-working installation of tensorflow on Windows.
As a software engineer and non-data scientist, I hate Anaconda because it feels like it's a tool that tries to be the be-all, end-all package management tool for everyone in the data science field, yet it feels like a sloppily built, bloated whale. It's even managed to overwrite PATH on some of my Linux machines, which is where I drew the line.
I vastly prefer creating hermetic environments with either venv or Docker. They're much cleaner and easier to work with. I wish data scientists would adopt these tools instead.
Sadly, many of the ML models I investigate on Github don't even have their package requirements frozen. It's an uphill battle...
> I vastly prefer creating hermetic environments with either venv or Docker. They're much cleaner and easier to work with. I wish data scientists would adopt these tools instead.
I suspect you have a lot of time on your hands. But for me the 'batteries included' approach really nails it, why repeat the headache over and over again when a single entity can take care of that in such a way that incompatibilities are almost impossible to create? The hardest time I've had was to re-create an environment that ran some python code from a while ago, with Anaconda it was super easy.
I'm sure it has its limitations and just like every other tool there are situations where it is best to avoid it but for now it suits me very well.
Have you ever used any of those features of tensorflow though? They're all, er, idiosyncratic. If you're a decent software engineer and are following the mafs well from a book, I couldn't reccomend Jax highly enough. (I work on big tf RL projects every day).
That’s true. I’m still a student and haven’t really shipped a large ML project. Most of my ETL is done using scrapers, manual Pandas transformations, and storage in flat files. But yeah, I see your point, and frankly I don’t think there’s too big of a user base around these specialized features. Not to mention that some can be, erm, difficult to use. A friend tried using the YouTube-8M dataset in TFRecords format for a project and was extremely annoyed at the complexity.
> At first glance, it seems to be a GPU/TPU based NumPy?
Yes, with a compiler to make this fast.
> The thing is, TF has more than tensor ops. It has pre-defined NN layers, data loading/serialization, distributed training, metrics, and model serving.
Yes, it is a simpler and smaller API.
For things like data loading, you can use the tool of your choice -- TF, pytorch, whatever. For pre-defined NN layers, there are libraries that build this as a very thin wrapper around JAX's low-level API, see e.g. lax, which is include in JAX.
I know. I don't want to go as far, but if I had to choose, I would also go for Jax, and help make it feature complete. However, it is not very feature-complete yet, and thus probably not as useful for everyone yet.
Agreed. Tensorflow kinda reminds me of OpenGL in that its dependence on global flags causes some really annoying bugs, especially when you're using third-party libraries or pre-trained models. `enable_eager_execution`, `enable_tensor_equality`, and `enable_v2_tensorshape` have all completely broken my code at one point or another.
Is it production ready for serving PyTorch models? How about if I wanted to use something like Go to serve those models? That's fairly straightforward with Keras (Python) trained TF models.
In our team, We serve PyTorch models in production using libtorch. A C++ library for loading models. You can easily call the C++ code in Go if you wrap it in a C interface.
Last I checked there was basically zero serving story for pytorch. The trade off seems all to common, tensorflow optimizes for enormous production applications first while pytorch optimizes for developer ease first.
If you are trying to apply one of these libraries to a production system that doesn’t get a lot of throughput you probably shouldn’t be using them (try a linear model first). If you have a high throughput application you probably want tensorflow and just deal with the shittyness.
If you have any existing ML/DL experience, picking up PyTorch is a breeze. You could get a pretty solid understanding of the framework with an MNIST handwritten digit recognition model over the course of an afternoon, so don’t sweat looking for the “right” tutorial.
Highly recommend it! I love pytorch so much, it's basically numpy with automatic backprop and CUDA support. It evaluates eagerly by default, which makes debugging a lot easier since you can just print your tensors, and IMO it's much simpler to jump between high-level and low-level details in pytorch than in tensorflow+keras. Just as one example, activation functions in pytorch are applied by calling a python function on your layer, instead of passing a string argument with the function name to the layer constructor, so you write
layer = F.relu(Linear()(input))
instead of
layer = Dense(activation_fn='relu')(input)
As a result, it's a lot more straightforward to try out custom activation functions in PyTorch.
The problem is that tensorflow is an umbrella name for bunch of related technologies: it's a matrix calculation engine, graph definition language, distributed graph calculation engine, ML algorithmic libraries, ML training libraries. On top of that it's extremely poorly documented. At the end of the day when you use it anything beyond most trivial stuff turns out to be incompatible with each other (this operation is not implemented for TPUs or GPUs, this API doesn't work with this API) and most of development cut-n-paste trial and error. Then you go to read it at source, but creative Python renaming and importing leads you to multi-hour wild goose chase.
If you switch to PyTorch, what are you going to use for prod deployment? Is there any way to use TPUs?
> This is also the last major release of multi-backend Keras. Going forward, we recommend that users consider switching their Keras code to tf.keras in TensorFlow 2.0
Cannot but agree. Above that 1.14 documentation was simply deleted from the tensorflow website and now we are scratching our heads at what to do when it comes to model maintenance.
We serve our models via TF Java API since our system is Java/Scala based. We can't even update existing TF Java API because it is incompatible with anything prior to 1.15. It's an utter mess.
There's absolutely no evidence that MXNet is faster than TF. At the high end, all three (TF/PyTorch/MXNet) are similarly performant. The reality is that implementation matters more than framework when you are talking about performance.
In terms of features and functionality, they are very much interchangeable. A new project could be written in any of the three major frameworks and be equally good. The only standout feature I'm aware of is that TF has the best support for doing inference on devices, but that won't be true forever. In terms of actually migrating a codebase from one to the other, the APIs are different enough in small ways that it would be a large amount of effort.
I mean, when relying on third party code, things like `tf.enable_control_flow_v2() and tf.disable_control_flow_v2()` can and will go horribly wrong. It looks like some operations are changing behaviour depending on a global flag being set. And not just some operations, but control flow operations! That will lead to some very hard to figure out bugs.
In my opinion there are some architectural problems with TF, which have not been adressed in this update. There is still global state in TF2. There is still a difference in behaviour between eager and non-eager mode. There is still the control flow as a second class citizen.
If you need to transition from TF1 to TF2, consider doing the TF1 to pytorch transition instead.