Nice work! I'm working on a similar standalone DevOps AI Agent (OpsTower.ai). This post shows how the agent is structured and how it performs against a 40 question evaluation dataset: https://www.opstower.ai/2023-evaluating-ai-agents/
While i used TikToken to limit the message history (and keep below the token limit), generally I found that I didn't get better completions by putting a lot of data into the context. Usually the completions got more confusing. I put a limited amount of info into the context and have generally stayed below the token limit.
> Are you storing message/ chat histories between sessions
Right now, yes. It's pretty important to store everything (each request / response) to debug issues with prompt, context, and the agent call loop.
This certainly looks like a cleaner way to deploy an ML model than SageMaker. Couple of questions:
* Is this really for more intensive model inference applications that need a cluster? It feels like for a lot of my models, a cluster is overkill.
* A lot of the ML deployment (Cortex, SageMaker, etc) don't see to rely on first pushing changes to version control, then deploying from there. Is there any reason for this? I can't come up for a reason why this shouldn't be the default. For example, this is how Heroku works for web apps (and this is a web app at the end of the day).
You're 100% right that Cortex is designed for the production use-case. A lot of our users are running Cortex for "small" production use cases, since the Cortex cluster can include just a single EC2 instance for model serving (autoscaling allows deployed APIs to scale down to 1 replica). For ML use-cases that don't need an API (a lot of data analysis work, for example), Cortex is probably overkill.
As for your second question, we definitely want to integrate tightly with version control systems. Since right now we are 100% open source and don't offer a manged service, we don't have a place to run the webook listeners. That said, most of our users version control their code/configuration (we do that with our examples as well: https://github.com/cortexlabs/cortex/examples), and it should be straightforward to integrate Cortex into an existing CI/CD workflow; the Cortex CLI just needs to be installed, and then running `cortex deploy` with the updated code/configuration will trigger a rolling update.
If you're referring to version control for the actual model files, Cortex is un-opinionated as to where those hosted, so long as they can be accessed by your Predictor (what we call the Python file that initializes your model and serves predictions). If you're interested in implementing version control with your models, I'd recommend checking out DVC.
Neat approach. I've also seen public facing trello boards used to varying success (at the very least to give users a hopefully clear picture of what features/issues are prioritized)
The server timing metrics here are actually extracted from an APM tracing tool (Scout).
Tracing services generally do not give immediate feedback on the timing breakdown of a web request. At worst, the metrics are heavily aggregated. At best, you'll need to wait a couple of minutes for a trace.
The Server Timing API (which is how this works) give immediate performance information, shortening the feedback loop and allowing you to do a quick gut-check on a slow request before jumping to your tracing tool.
> To minimize the HTTP overhead the provided names and descriptions should be kept as short as possible - e.g. use abbreviations and omit optional values where possible.
I could see significant issues if we tried to send data in timeline fashion (such as creating a metric for each database record call in an N+1 scenario).
One idea: pass down an URI (ie - https://scoutapp.com/r/ID) that when clicked, provides full trace information.
Application instrumentation - whether via Prometheus, StatsD, Scout, New Relic - solves a very different problem than this. The server timing metrics here are actually extracted from an APM tool (Scout), so you get the best of both worlds.
With those tools, you do not get immediate feedback on the timing breakdown of a web request. At worst, the metrics are heavily aggregated. At best, you'll need to wait a couple of minutes for a trace.
Profiling tools that give immediate feedback on server-side production performance have their place, just like those that collect and aggregate metrics over time.