Return to page
h2o gen ai world conference san francisco h2o gen ai world conference san francisco

Open Source h2oGPT: Mastering LangChain and Agents with Open Source LLMs using h2oGPT

Jon McKinney, Director of Research, H2O.ai

Arno Candel, Chief Technology Officer, H2O.ai

AI-generated transcript

 

00:06

Thank you very much for coming. It's a very exciting time in the world of AI. And I'm going to talk to you a little slice of that, myself, John McKinney and Arno Candel, who talk a little bit in the end. 

 

00:18

So I'm going to talk about an open source project called H2O GPT. You can go ahead and go to this GitHub if you like, or if you'd like to try it out, it's GPT H2O AI. So, as I said, AI is very exciting right now. 

 

00:35

It's very complicated at the same time. And this is just an arbitrary slice. Don't take it too seriously. If I'm missing something, it's probably on purpose. But basically, this is like an ever presentation of what's going on right now. 

 

00:51

Obviously, at the bottom, we have NVIDIA, CUDA, GPUs, maybe other kind of cerebrus, whatever. And all these things have interesting problems. We always have GPU out of memory problems. Then there's Torch that sits on top of that. 

 

01:06

It's a very useful open source tool. And you can build models like LLMs. Thank you. And it has its own set of difficulties. As you move up, you might want to do some fine tuning, like with LLM Studio. 

 

01:23

And with that, you can build a whole host of different kinds of models, but it becomes complicated. There's many different kinds of fine tuning you can do. On top of that, we have something like Lane Chain or Lama Index, where you're trying to put together a bunch of data sources and combine them and tell the LLM how to use that data. 

 

01:42

It's also very complex. It can be fragile. H2O GPT is an example application of that, where we're doing RAG, Retrieval, Augmented Generation, where you're taking all those sources and trying to give it in a simple way to the consumer, either as data sources like PDFs, images, videos, audio, whatever. 

 

02:06

And of course on the top, the dream is that you have something like an agent which is controlling all this in some way. Now, of course, this is very simplistic. In reality, NVIDIA is very interested in working on fine -tuning. 

 

02:20

They have NEMO, things like LLM Studio, they don't use, they can be used for fine -tuning, but they can do many more broader things. Lane chain, there's certain like LLMA index, you can fine -tune models within LLMA index, so you don't, it's kind of breaching those barriers. 

 

02:38

And of course with agents, you would hope that you would be able to incorporate all of this together in sort of a general framework, but we're a long way from that. Specifically with RAG, what HSO GPT is focused on, this is sort of a generic picture of what that is. 

 

02:56

You have some data, some question, it goes into some kind of model, like an embedding model, which represents that information in English language as a vector. And then it combines that with your restive all your data, finds the similarity and with some kind of extra handling, it might go into the LLM and come back as an answer. 

 

03:17

Now a complication can be that what if you ask a question of a bunch of documents which is something like a bunch of documents from McDonald's, annual reports, and you have a question like what does McDonald's do? 

 

03:30

What does McDonald's do? Or what did they sell the most of? Now in general, this is a really bad query for similarity search because what did McDonald's is obviously going to be there a lot and do is of course a horrible word. 

 

03:44

So it's really hard, but you can do approaches where you do what's called query understanding where you process your user question. You don't take it literally as what you do similarity search on, but you do some improved search. 

 

03:56

So there's many ways to improve RAG and Arna will talk about a little bit that later. So specifically, late show GPT open source, it's like a multi -chat view. Think of it as like chat GPT, but with multiple options that are simultaneously rendered. 

 

04:13

And you can also upload documents for retrieval augmented generation. You can have web search just like the Bing or chat GPT. It also has search agents and Python agents that are from Lang chain with some improvements. 

 

04:27

And the other kind of interesting part of it, we focused on what our Kaggle grandmasters have provided, which is best of breed models. So for example, if you use classic open source things like test, the models that for OCR for optical character recognition are not quite good. 

 

04:43

If you use what we use, it's much, much improved. And these are things that our Kaggle grandmasters like Ryan Chesler have helped us add. And of course, we focused on a lot of other engineering aspects to make this very efficient. 

 

04:58

In the end, what we're interested in is giving up to the open source community some very useful well -tested RAG system, but also to incorporate it as a back end for our enterprise offering. So this is roughly what it looks like. 

 

05:13

It's a little small, but you basically have Lama 70 billion, Lama 13 billion, Zephyr 7 billion, and ChatGVT. And it's asking some question about some CBA annual report. And can it get the question is the answer. 

 

05:26

Can it get the answer? I asked how much, go ahead and switch to the next one, how much net profit did New Zealand contribute in 2023? Now this is something like a, I know it's something on the order 200 page document. 

 

05:38

You as humans, we can go through that. But the question is what are these guys doing? And do they do it correctly? So we can look at the sources and say, well, in particular one example for the 70 billion, the number that it got is somewhere, this like 1 billion, sorry, 1 billion 356 million, I should say 1 billion 356 million. 

 

06:03

So it is in the sources. But where does this really come from? Can we trust it? It's hard to say here, and Arnault will talk more about how we can do that better. This is what the LLM actually saw. It's really a really difficult stream of text. 

 

06:17

Now as a human, where's the answer? Where there is a page where there are some answers? I'll give you a second to stare at this page over here, see if you can find the answer to where, again, the question was, how much net profit did New Zealand contribute in 2023? 

 

06:34

It's not too hard to find. If I give you the page, I'll give you a few seconds to think about it in case it takes time. You can see the number down there. You already saw the number. But how does the model do this? 

 

06:49

If it's looking like this, it's actually this column of text is what extracted out of this sort of infographic. That's pretty hard. There's also tables which contain all this information, which has the answer to, and that's why there's multiple sources. 

 

07:05

So it's having to incorporate hundreds of pages, do retrieval in an interesting way. It's difficult. And it's difficult enough where if you do the same thing, something like with Claude too, of course in Claude, this document's too big. 

 

07:18

And even if you pay 20 bucks a month, you're not going to be able to do this. This is comparing to some other kind of quite not quite open source, but freely available option. And if you try to provide the link, it can't do links. 

 

07:30

If you do open AI with its advanced data analytics as of three days ago or two days ago, then it would fail or complain about erroneous problems. It would complain about the encryption being, the PDF being encrypted or something funny like that. 

 

07:45

You provide the link, it can't search the link. Similar with Bard. Now Bard's more interesting. All the results we showed before, I should go back a little bit. All these results appear correct. ChatGPT, the 13 billion, 70 billion, but the newest interesting hugging faces ever. 

 

08:16

and it gives nothing about the sources to tell you whether it's right or how to prove it. There's another interesting project out there. It's not, again, open source, but it's still an interesting option because it's pretty cheap. 

 

08:30

It's free if you don't use the API. It's chat PDF. And if you ask them, they'll say, I don't know. And if you ask that question there, it gives a nice pre -summary, but it basically says it doesn't exist, even though we know it's in a few places. 

 

08:43

And the kind of reference it gives has New Zealand in it, but it's not related to profit. So just one thing about the Kaggle Grandmaster models, these are the kind of images that would be normally quite difficult, but actually we can do quite well with them. 

 

08:59

And this is a very strange skewed table that's just a picture of a table. or even handwriting, and we can read all this kind of stuff, similar to GPT -4V. Not quite as good, but pretty good. So there are some challenges with all this. 

 

09:16

One of the ones that Shree brought up in the morning was context length. That's been the bane of existence of open source, and I think we're at a moment now, finally, where we can see that that's not a problem anymore. 

 

09:28

It's still a difficulty in getting these models that are very long context, that you can put a lot of documents in to pay attention correctly to that, the relevant pieces, but that's less of a problem than the original problem. 

 

09:42

But there are other challenges, and Arnault will talk about this in a second. Sometimes you might have vague questions. As I mentioned, what does McDonald's do? You may have meta questions, so some kind of questions require collection, looking over the collection of documents, not just one particular retrieval, but an answer that might span multiple documents. 

 

10:01

You may be extracting data out of all the documents. Another one is, can you be aware of time over multiple documents? Like your annual report is changing over time. Can you hypothesize what the next annual report would be? 

 

10:17

And of course, something like multimodal agency would be the dream, where you could have an agent generate reports for you, do all sorts of tasks that at least do a little bit of it, and you go back through and finish it off. 

 

10:31

And that would be from the collection of original documents. So I think that's pretty much it. I want to invite Arnault up here and he'll talk about how we're using the open source within our enterprise offering. 

 

10:42

And he'll talk about how we're solving some of those challenges. Thanks, John. Excuse my voice. I'm still a little bit ill, but the Enterprise Edition of HDO GPT is really built on top of HDO GPT open source. 

 

10:58

It adds a little bit of more of the enterprise kind of things, right? Like scalability, high availability and so on. But the rack is a centerpiece. So you have documents, and it highlights nicely in yellow where the information was found. 

 

11:12

And it gives you a score and so on. So basically, it's a rack for the enterprise. And you can throw in hundreds of PDFs or millions, whatever you want. Obviously, if you have a million PDFs, and you say, what's net revenue, it's not going to work, right? 

 

11:26

Because you have a lot of net revenues in there. So as John said, you want to make your questions specific enough, or you want to make collections of documents that are specific. So only quarterly earnings reports for one company, not for all companies at once, and stuff like that. 

 

11:42

And when this video is over, we'll go into more details. This video, by the way, is on our website. You can watch it again. And we also have demos on the other side here. And there is, I think, a training session in the Jewish Contemporary Museum. 

 

11:58

So there's more to be learned. I'm just going to do a little. pitch here on the close source product. So it was designed from the get go to be scalable with Kubernetes, with AirGap deployments. Everything is scalable, right? 

 

12:21

So each system, each subsystem is a separate process. Can run on a different pod. You can spin up multiple workers, multiple chat bots, multiple VLLMs that do the LLM hosting, multiple vector databases, multiple Postgreses, multiple Minios, and so on. 

 

12:38

So everything can be scaled. And if you have 1 ,000 people that want to ask questions all day and night, you can make it work that way. And like I said, it's made for AirGap environments. But of course, it runs in all the clouds. 

 

12:51

It's not beholden to any one particular cloud. And it can run any LLMs, just like John was showing earlier. Different LLMs, that's kind of our bread and butter. And you can select the LLMs here in the middle. 

 

13:03

You can also have a system prompt. You can say you're grog, you're funny, you're sarcastic. And you can basically emulate what Elon is tweeting these days. It's very easy. It's just a one liner to get it into the mood of being sarcastic and telling jokes all day long. 

 

13:17

So that's not a secret how to do that. You see at the H2O how it helps CBA. This was literally an annual report of CBA. And I asked that question. And because I gave it a system prompt that says be a little bit funny, it gave an answer that was a bit funny. 

 

13:33

And we see the different rack types on the right. So there's the no rack, which is you just tell the LLM, give me answer. And it will tell you an answer without looking at any documents. Then there's the regular rack where you embed your question and get an answer. 

 

13:49

Then there's one where you make up a fake answer and embed that so that you get some more relevant words. Maybe your answer, I mean, your question was too short. You said just like, give me net revenue. 

 

14:01

But then it will say, oh, net revenue for this company was this and this match, blah, blah, blah. And there's some more words in that sentence that are useful for retrieval. So that's called hypothetical document embedding. 

 

14:12

And there is two levels of that. Basically, two or three passes over the LLM. The more passes you do, the more accurate it gets. So if you want a really good answer, you can do that. And there's different levels of accuracy, of course. 

 

14:24

What does it mean, accuracy? Well, the simplest way is to say, did you get the net revenue? There is only one number, like John was showing, one right number. But sometimes you say, you know, is this company doing well or not? 

 

14:38

And that's, there's not a clear answer that's good or bad. But let's just focus on the yes or no kind of questions. Like, did they get it or not? Is the string contained in the answer or not? And I don't even care if there's more strings. 

 

14:52

Like, you can have every number in the universe in the answer. And of course, it will say, yes, you got it right. But we're not talking about that corner case. In general, the answer is relatively short. 

 

15:02

So if the number is in there, it means... Probably you got it right. And this is the leaderboard we make every night. We compare all these different LLMs, and you can see that the GPT -4 is the winner, as expected. 

 

15:12

But LAMA 270 is quite good. So we are at 90% mark, more or less, and we are doing more with AI techniques. Grandmaster inspired, and John and I are constantly looking at what's happening in the field. 

 

15:27

And so we'll try to make our best effort giving you a reliable RAG system that runs anywhere. And it obviously has Python clients and has nice APIs for summarization, data extraction, JSON. You can do document AI use cases. 

 

15:46

You can do meeting notes, all kinds of things, all programmatically. So please see us at the booth later, and thanks for your attention.