LLMOps: Match report from the top of the 5th
Stefan Krawczyk, Chief Executive Officer, DAGWorks Inc.
Hi everyone. Yeah, so just a bit about me. I created an open source package called Hamilton at Stitchfix, and then subsequently decided to start a company around it called Dagwix. Otherwise, for the bulk of my career, you can think I can term it now that I was in MLOps for a pretty long time.
So here to talk to you about Alamops. So who here knows what Alamops is? Who here is practicing Alamops now? Okay, a few of you. All right. Who here knows what MLOps is? Anyone? Who here is practicing or has some initiative doing MLOps?
Okay, most of you. And then who here is on the technical side of things? All right, most of you. Who here is on the management side? Hang on, a few people. Cool. So I'm going to try to give you an overview of the space, and then if you do know it, then hopefully some color and perspective.
But the level set, why should you even care about Alamops, MLOps, or even DevOps, so that matter, right? If your CEO came up to you and asked you why should I care, what would your response be? For me, the one word kind of answer is leverage, right?
Since delivering sustained value over time requires some sort of abstraction. And so, for example, in machine learning, if you're building models, shipping it to production, right? And now with Alamops building engineering prompts and changing apps, and then as the previous talks, building rack systems.
The idea with this kind of ops is that you're trying to usually tactically build out some sort of platform or process that provides you leverage, right? And so, for the resourcing that you apply, the idea is that you get more out on the other end if you apply these processes correctly.
This is a short talk, so my brief overview will be, so I'll try to situate, you know, LLM Ops against ML Ops. I'll give you a bit of a match report on top of the fifth, abusing some baseball kind of metaphors, but otherwise then I'll give you a forecast where I think things are going and then leave you with a take home.
So, ML Ops versus LL Mops, you know, two fields, right? The interesting thing to note is, you know, ML Ops as a field itself has roughly been only around maybe four years, obviously machine learning's been around longer for that, but the term and kind of the field, just around four years.
LL Mops, though, is, you know, very nascent. This time last year, if you ask someone, do they know, have you heard of LL Mops? Like, pretty much no one was, there was pretty much no one who would have known what that meant, right?
And so it's only this year that this field has really, you know, come about. And so, how does it compare to ML Ops? Well, Let's just recap MLOps a little bit. Here I have a baseball diamond. And so the idea is, with MLOps, you're helping get machine learning to production.
So we have some idea, and we have the right data and resources. We then develop some sort of prototype to round the base, first base. To get around to second, then we take that, what we developed, and we're trying to get it to production.
And then rounding third to home is like, we have it in production. We can monitor, maintain it. We have an effective process, and we're showing business value. And so MLOps, for me, is this kind of process of things and systems that you need to help you round the bases to get to home run.
So design, to model development, to then operations. So famously, here's this paper from Google, where so what are the things that you need to help you get around the bases? The point here being that the machine learning code is the small, tiny box in the center.
While there are so many other boxes that are then required to help you operationalize and bring machine learning to production. So what about LLOmops? Well, I actually think it's a special case of MLOps, so a subset, if you will.
Because since you have the same kind of high level problems that you're thinking about, so do you have the right data and kind of idea to implement it. The development to prototype to production phase now is like prompt engineering with APIs.
And then maintenance and business value, you still have to measure it. You still have to monitor it. You have to understand, is data changing, et cetera, and have processes on ensuring that you can get things to production stably.
And so with respect to the systems and things, you might have heard of prompts, fine tuning, embeddings, vector databases, foundational models. To me, they're just special cases of MLOps, of these boxes that are.
existed in the MLops kind of world. So at a high level, you know, I think, you know, they share the same general shape of problems. So code and data, and you gotta, you know, observe and have processes around them.
At a low level, if you were to zoom in, right, just focus on LLMops as opposed to MLops, well, in MLops, you didn't need GPUs to get stuff done, but with LLMops, it is required. That's what LLMs run on.
And so GPUs are a central core component. The application integration pace is very different, right? It's much easier to round the base to third base with, you know, LLMops, because all you're doing is, you know, starting with a foundational model, engineering some prompts, and then you pretty much have, you know, something that's working pretty easily and quickly, whereas with MLops, it was, you know, much harder to get to production.
And then because of that, there are many more models in a single application with LLMops, right? You have, prompts and API calls that you can chain. So if you're building a rag system, there's many places where you have prompts and API calls into a single call chain.
And so then that means that the software development lifecycle, how do you version things, how do you monitor things is a little more challenging than with Emelops because you generally didn't stack too many models together unless you were building, say, a recommendation system.
And then evaluation is a little fuzzier with LMS than with traditional Emelops, right? You don't have a single floating value, floating point value, you have text. And so there's many ways you can evaluate it.
So it's a little more challenging than LMOPS necessarily to figure out, is this a good answer? So let's imagine we're in some game and we're at the top of the fifth. So the two teams that are kind of playing, the way that I see it is, we have the proprietary foundational models, this is the open source foundational models, right?
So proprietary being open AI, cohere, Anthropoc, et cetera, and open source being Falcon, Lama, et cetera. The reason why I think we're at the top of the fifth is that we're about the fifth generation of these models, right?
So GPT one and two were actually many years ago, but what really started it was GPT three, and so which case, with the latest updates, I think we're about at the fifth inning. So this means we still have a lot of innings to go.
And then in terms of if you're a fan, which team or side you're choosing, really depends on your privacy concerns, are you really worried about cost or controlling most of the stack? In terms of the tooling to help you round the bases, there's been a Cambrian explosion, if you will, of point solutions, things to manage prompts, things to help you trace the apple tree.
application, things to help with evaluation, self -hosting of models, open source models, fine tuning, embeddings, vector databases, people focused on privacy and governance, tracking costs on API calls, there's a plethora of options.
You have a lot of choice. And then in terms of ML Ops providers, everyone now is transitioning to offer support for our own ops. So H2O being a classic example of being an old school ML Ops now offering a lot of ML Ops capabilities.
So just to go through a bit of a challenge or play that I've seen. So who here actually transitioned from GPT 3 .5 to 4 for an application? Anyone? A couple of you. Okay. We'll compare notes afterwards.
But so with these foundational model updates, you kind of have to make a decision if you have an application. do you update, right? And the challenge really is that given a prompt, these two foundational models are going to give you different outputs.
So if you had built an application and tuned things in a certain way, you now have to potentially rewrite your entire application. And then as the technology improves, there are other things that this transition brought about.
So larger context windows. So in the previous talk, in the RAC talk, for example, a larger context window leads to better results. So does that actually simplify now your engineering? Is there less, are you deleting code?
And then with these foundational models, though, they're usually more expensive because you're paying for tokens in and tokens out. And so is this model that much better with the cost? And then also from an application performance standpoint, they also have different latency characteristics.
So in general, if you had this app, right, you got to be able to answer the question, is this worth it? And the people that I've seen do this well the most have established pretty good Alarm Ops practices.
So being able to version prompts, understand how to push things out into production. But in general, this has really been built around evaluation as their backbone. So given a change in a prompt and an API, they can quickly and easily evaluate, like, am I getting the results that I expect?
Not only in the development cycle, but also when things have been pushed out to production. So that was a voluntary change. What about curve balls? Well, until very recently, until OpenAI just announced a new feature yesterday, but people actually had to update their prompts kind of ad hoc or rather unknowingly when OpenAI pushed out a new foundational model update.
Because unlike, or at least it wasn't the case until very recently, most foundational models, it's kind of hard to potentially control. the output, especially if you don't own the model and aren't serving it.
So that was something you had to contend with. And then these foundational model providers are slowly building out more capabilities. So if you had jumped on 3 .5 and you had to build out something to upload PDFs, well, OpenAI just rolled out something the other week that negates the possibility.
So there's more features. And so if you had engineered things now, you have to ask the question, do I remove what I engineered or the solution I brought in? And instead, do I use these new foundational model provider features?
So in this last couple of minutes of this talk, let's talk about at least how I think where things might be going. So the one thing to take note is that computation and cost curves are going to continue to go down.
So the cost today is going to be different from six months. It's going to be different from a year. So that is something, if you're planning in terms of operationalizing, you should take that into account and plan for it.
The other is foundational models, much like open -source databases and proprietary ones exist, I think they're here to stay. So they're always going to be improvements in both camps. Then I think over time, you should continue to expect better context window improvements.
Actually, the need for simpler prompts or rather, you shouldn't have to build complex prompts because these models will get better and so simpler prompts will get you further. Then with multi -modal models, you can expect more interesting capabilities and that to continue to get better.
In terms of organizationally, I think data is still your mode, just like in machine learning, that is your mode, that is what your special secret sources. But the key to operation here is really evaluations.
You really got to be able to evaluate the impact of what it's doing on your business, but also evaluate your changes pretty easily. What's interesting to note for me, I think is that front -end and back -end developers are going to have to learn LMOPS.
Because a lot of these foundational models now behind APIs and all that is prompts, but they haven't learned the processes that data can change and you need to monitor for it. But otherwise, from a cost curve perspective, and as time goes on, you're going to explore fine -tuning.
So how can you get better output of your models? And so which case, fine -tuning is where you're going to explore there. And then otherwise, don't fire your data science and machine learning team, because I think that's where they're going to end up spending some time.
And then lastly, from a tooling perspective in the space, I think you'll continue to see a reduction in wrappers or things that thinly wrap foundational model APIs, since they will slowly provide those capabilities.
But in general, from the space, I think there's going to be a heavy focus on fine -tuning and evaluation, since those are two very core related things in the... and rounding the bases and creating a great LLM kind of product and experience.
So my take home is one, if you're going to play, you know, plan for rapid evolution, and so if you're going to build something, expect to change it in six months, maybe even sooner. And so if you are going to play, though, you need to ensure that you have these strong LLM ops practices to ensure that you can change and push things out and understand its impact and its value and how to evaluate it to ensure that you're actually building something that isn't broken.
But otherwise, a small plug for what I'm doing. So if you're interested in making this kind of ops process much simpler, I am, you know, standardizing code here with my open source package called Hamilton and then with a few add -ins from Dagwix.
But otherwise, thanks for listening. Happy to take questions.