Adding backends for TensorRT, ONNX, JAX, etc are on our TODO list (and we'd love to see PRs to add support for these and others)!
We actually do use TensorRT with several of our models, but our approach is generally to do all TRT related processing before the Neuropod export step. For example, we might do something like
TF model -> TF-TRT optimization -> Neuropod export
or
PyTorch model
-> (convert subset of model to a torchscript engine)
-> PyTorch model + custom op to run TRT engine
-> TorchScript model + custom op to run TRT engine
-> Neuropod export
Since Neuropod wraps the underlying model (including custom ops), this approach works well for us.
I wrote our internal lightweight version of neuropod at another SDC startup where we did use TensorRT. Our ML researchers worked in pytorch and more often than not, the pytorch -> onnx -> tensorrt conversion did not work. We ended up needing to replicate the network architecture using the tensorrt library and manually convert the weights from pytorch. Then we'd use the tensorrt serialization to compile the models so they could be run in c++. I imagine that they may have tried this in neuropod and saw the same conversion problems. TensorRT was a big investment to get running smoothly but it did shave off 20% or so off our inference latency
It's gotten better in TensorRT7. I'm using it quite successfully. It does have a lot of corner cases though, that much is true, and the documentation is really poor, which, coupled with it being mostly closed source, severely limits adoption.
That said, I'm getting ridiculously good performance with it, even without using the TensorCores.
Neuropod was created at Uber ATG (not AI Labs) and powers hundreds of models across the company (ATG and the core business). It's been used in production for over a year and we're continuing to actively work on it.
The blog post I linked above goes into more detail, but here's a relevant quote about usage within Uber:
> Neuropod has been instrumental in quickly deploying new models at Uber since its internal release in early 2019. Over the last year, we have deployed hundreds of Neuropod models across Uber ATG, Uber AI, and the core Uber business. These include models for demand forecasting, estimated time of arrival (ETA) prediction for rides, menu transcription for Uber Eats, and object detection models for self-driving vehicles.
This is a good question. I want to write a more detailed post about this in the future, but here are a few points for now:
- Neuropod is an abstraction layer so it can do useful things on top of just running models locally. For example, we can transparently proxy model execution to remote machines. This can be super useful for running large scale jobs with compute intensive models. Including GPUs in all our cluster machines doesn’t make sense from a resource efficiency perspective so instead, if we proxy model execution to a smaller cluster of GPU-enabled servers, we can get higher GPU utilization while using fewer GPUs. The "Model serving" section of the blog post ([1]) goes into more detail on this. We can also do interesting things with model isolation (see the "Out-of-process execution" section of the post).
- ONNX converts models while Neuropod wraps them. We use TensorFlow, TorchScript, etc. under the hood to run a model. This is important because we have several models that use custom ops, TensorRT, etc. We can use the same custom ops that we use at training time during inference. One of the goals of Neuropod is to make experimentation, deployment, and iteration easier so not having to do additional "conversion" work is useful.
- When we started building Neuropod, ONNX could only do trace-based conversions of PyTorch models. We've generally had lots of trouble with correctness of trace-based conversions for non-trivial models (even with TorchScript). Removing intermediate conversion steps (and their corresponding verification steps) can save a lot of time and make the experimentation process more efficient.
- Being able to define a "problem" interface was important to us (e.g. "this is the interface of a model that does 2d object detection"). This lets us have multiple implementations that we can easily swap out because we concretely defined an interface. This capability is useful for comparing models across frameworks without doing a lot of work. The blog post ([1]) talks about this in more detail.
The blog post ([1]) goes into a lot more detail about our motivations and use cases so it's worth a read.
Found it interesting that most of the commits are under 1 contributor (OP). Are you the most active contributor or was this an artifact of open-sourcing it? Just wondering if you get hit by a bus tomorrow, what would we do? :)
There's also a blog post that has more detail: https://eng.uber.com/introducing-neuropod/
Super excited to open-source it!