Return to page
h2o gen ai world conference san francisco h2o gen ai world conference san francisco

Fireside Chat | Arno Candel, Jon McKinney, Pascal Pfeiffer, Philipp Singer, Michelle Tanco

Arno Candel, Chief Technology Officer, H2O.ai

Jon McKinney, Director of Research, H2O.ai

Pascal Pfeiffer, Principal Data Scientist, H2O.ai

Philipp Singer, Senior Principal Data Scientist, H2O.ai

Michelle Tanco, Director of Product Management, H2O.ai

Read the Full Transcript

00:06

We'll jump start with there's just so much innovation out there in open source. So we'll give mike to honor. How was it that describe the general challenges in bringing RAD to life on open source starting with a luther to where we are with Lama and Mishra? 

 

00:30

Yeah so when you have documents right and you ask the LLM what's in it? It doesn't just know about that document you have to somehow give it that document but what if you have 50 gigabytes of documents and you have one LLM that only has a thousand tokens context window. 

 

00:49

That's the challenge so you have to you have to find the relevant snippets in those documents and how do you make a snippet? You have to chop up that document into small text pieces and then each one has to be searchable later when you say what was net revenue in 2021 somehow you have to find that relevant snippet right and that's the art of doing rag so you need to have embeddings you need to have good parsing you need to have you know good ranking which are the top snippets because you can only find in the end five or so that you can throw into the LLM and say look at those five passages of text and now please answer the question so imagine that and then in a scale where you can share it with other people where you can do this in you know scalable ways in air -gapped environments with any LLM and so on so you get the picture that's what we're the claim was that this is the world's best rag you We also have the world number one here with Kaggle world number one who managed to win a RAG based LLM science exam. 

 

02:14

So, maybe it is a there is two world number ones on the stage, one is the best RAG in town and I am going to ask the question to John, how do you evaluate the best RAG or performance of RAG? Sure, it is a great question. 

 

02:31

So, the way in which you use RAG like Arnav mentioned is give it some documents. How do you verify that the LLM is actually not hallucinating? How do you confirm that the documents that it used were used and that the particular parts of the response were valid and part of the documents? 

 

02:53

So that whole RAG evaluation step needs to be done in. So that is one of the things that we have been doing at H2O. It is making sure that the responses make sense and that requires some human in a loop. 

 

03:07

Sometimes you have to look back at the highlighted pieces of text and confirm. Because LLMs aren't perfect. So that human and loop really helps in that regard. That's great. In terms of rag evaluation, there's probably both the factual accuracy, but there's also style of response. 

 

03:34

What kind of themes have we applied to ensure that we can compare and contrast? Maybe Michelle can talk. She's done a lot of style -based things. Sure. We found that when we're building rag, we're not just doing it for fun. 

 

03:52

We're solving specific business problems. Every company has their own voice. They want their content to sound like them, even if AI is writing it. So being able to have rag -based applications that don't just look up facts, but also provide context of what's the right voice is something we're seeing people do. 

 

04:12

So we should talk about the LLM science exam and see how anybody can make their LLMs to pass their domain -specific exam, I would imagine. Yeah. So in general, with LLMs, like when we talk about LLMs, oftentimes we refer to different kinds of LLMs, like they're the base models, which are usually just trained to predict like the next token. 

 

04:40

But to make them useful for certain types of applications, you would need to fine -tune them. So for example, if we talk about Lama Chat model, this is a model that is fine -tuned for chat use cases. 

 

04:49

And similarly, for other specific use cases, you can use LLMs successfully and fine -tune them for your use case. Rack is one thing. Like for example, in this LLM science exam, the question was that you have a multiple choice question. 

 

05:07

a science -based multiple choice questions with five different answers and you need to build a solution that can correctly predict the right solution and there are different puzzle pieces coming together. 

 

05:23

Rock is one puzzle piece and bedding search is one puzzle piece but also the LLM fine -tuning is another important piece to kind of tailor it towards your use case at hand and yeah we have open -sourced H2L LLM studio which explicitly allows you to fine -tune your LLMs and I'm sure we will hear more about that also throughout the day. 

 

05:50

Probably a question for Pascal. Good. Segue, when should one use fine -tuning or when should one use rag? That's actually a very good question and I think that's one thing that everyone should ask. themselves too when building models or using LLMs. 

 

06:09

So what I usually do when tackling a problem is going with a base model or any chat model that is already out there, using some kind of prompt engineering to get some initial results out. But that is only using the internal knowledge of the model. 

 

06:27

And in a second step, then you can add context with RAC, for use the context to formulate the answer and also grounds the question or the answer on the context. And if you're not content with that, you can even go one step further where you need training data to fine tune a model to your specific use case. 

 

06:55

Many times there are even companies or people who do have the fine -tuning data already, the training data already. And in this case, it makes sense to skip maybe all of the things in front of that. Don't need any prompt engineering. 

 

07:12

Don't need any context retrieval, which you can also use for fine -tuning. But yeah, it's always a matter of the problem that you have and the metric or the style that you want to achieve. I heard the word prompt tuning, and that's a good prompt for Mark. 

 

07:33

Mark's been building AI for documents for the last four or five years in different phases of it. And LLM quite literally intersected your document AI journey. Maybe probably worth talking about that experience as well as very likely going to intercept many of your application building experiences as well, but also prompt optimization, prompt tuning that you're. 

 

08:03

Yeah, actually, So with Document AI, like Sri says, we've had the product for a while. And so we've been doing it more like classical machine learning. But what we're noticing is customers will sometimes work really hard to go through the process that we have of connecting OCR and doing classification model and some post -processing. 

 

08:26

And then they'll just dump it into an LLM and realize that they can get something that looks similar. Now, looking similar and being similar or slightly different, but there's no doubt that it's powerful what these LLMs just know in them. 

 

08:39

And actually, so what we're looking at with Document AI is very similar to what Pascal just said, is where are you on the spectrum of things? And what we're experimenting with maybe the most heavily, because it differs from some of the others, are prompt tuning, which we mentioned. 

 

08:52

But it's kind of a term that probably quite a few of you haven't really paid attention to. So when people are looking at our documents and they're just tossing it in an LLM. They're almost zero -shotting it. 

 

09:05

What is the total? What's the date? What's the company? These are very similar. Easy to ask questions. You can even go further and ask for it in JSON format. And so that's all kind of prompt engineering. 

 

09:15

You might tweak a few different prompts. And then you can get, if we, prompt tuning, I see as this evolution is kind of next. It's a little easier than going the next step, which is kind of fine -tuning the LLM. 

 

09:27

What prompt tuning is really going to do is going to build a small model that takes your input prompt and embeds it and puts that together with the input into an LLM. So you're not really touching the LLM. 

 

09:38

You're training it with the LLM. So you kind of connect this to the LLM of your choice. But you're actually training embedding model. And we have some slides later today to show if you're intrigued about this. 

 

09:48

But what it really helps us do is get not just regular prompt engineering questions where you can't control the output. It actually helps us get control over the output. So it looks a little more like a supervised problem, which is what we have in Document AI. 

 

10:01

So people know if you want to ask an invoice, LLMs do a pretty good job of an invoice. But if you need these specific 17 fields from it and line items and things like that, that's actually a complex problem. 

 

10:11

And a lot of our customers don't have room for error. Like, we can't mess that up and miss some fields here or there. They need it to be dependable in the database. So what we're experimenting with now is prompt tuning, among other things, in this spectrum, because it's some nice properties of deployment with a single model. 

 

10:27

You can very easily get one for each use case. So kind of like Pascal says, is what we want to allow people to do, the zero shot is in the tool. Let me just get started. I don't even want to bother with supervised results. 

 

10:37

Let's just see if the LLM already knows the answer. And then you can kind of correct those annotations and loop through and continue to get a better model, because we move from a zero shot fully unsupervised model to a slightly supervised model, even though the LLMs not getting trained at that point. 

 

10:52

But you can do that too. You can go to fine tuning. So the full spectrum is available of the LLMs. But we're really intrigued with the prompt tuning, because the LLMs are so powerful at what they know with a lot of simple documents. 

 

11:04

and even some more complex ones. So it's training the document where to focus on these and it's very interesting to us. Michelle, bring it all together. Tell us about the Genai App Store and all the various apps, both in telco and finance, both of which you've been working very closely with. 

 

11:21

Yeah, so we're excited to announce today and you'll see later today our public Genai App Store. So basically everything that all these really smart data scientists on the stage have been talking about is for more of our technical people, but a lot of the people using Genai to solve problems today end up being business users, who we don't necessarily want to have to learn prompt engineering. 

 

11:43

So custom Genai apps come in where your data science teams can build these applications using skill sets they already have that solve a specific problem. So let's say that your HR team spends three hours writing internal job summaries, instead let's say, build them a front end that's customized for their exact needs, where they fill out some forms, and they spend 10 minutes reviewing content from an LLM rather than three hours writing it from scratch. 

 

12:10

So we're seeing lots of use cases across finance, telco, in HR, in marketing, where our customers are wanting to rapidly build custom bespoke applications for their internal and even external use cases. 

 

12:25

So we'll see a lot more examples of some apps today. Some of them are fun for your daily life. You can write home listings faster meal plan, and some of them are more business use cases. Well, we'll get deep in them. 

 

12:38

So, John and maybe Arnaud, for the better part of the last two or three quarters, we've been racing. And there's an incredible race out there, Sonam's race, between the capital intensive, foundational model companies, and the open source community that's really eking out of performance. 

 

13:04

What are things that you've experienced in the long chain, the incredible integrations? What are things that caught your eye that were exciting, and what are missing? What should the community be building? 

 

13:17

Yeah, I can briefly mention. So we started our journey, I feel, with fine tuning. And when we saw something, the first models come out, Pythia, et cetera, we started saying, can we make them work for our purposes? 

 

13:33

And we were fine tuning the llama when it came out, trying everything out. And that was our first start. And of course, the grandmasters here fine tuned Falcon 40 models, et cetera. The journey has been to see the community, the open source community explode from basically unable to do much against the big guys, opening the eye, et cetera. 

 

13:59

to being able to actually make a dent and thinking of all the competitive advantages that OpenAI has, basically one by one, open source community's being able to breach it. And people talk about this moat between the big corporate giants and open source community. 

 

14:18

I think that's been degraded over time with all the great innovations all the way from, just being able to have CUDA and Torch out there to all of these stuff like auto -GPT at the top level. It's very creative types of solutions that have really made- What about context? 

 

14:42

What about context? Context length? Length, yeah. Oh yeah, I mean, even in the beginning, we had customers who would say, it just can't take a 2048 context length, it's just not unacceptable. And... To see how right now you can get an open source model that's like Lama 70 billion with 32 ,000 tokens or you can get a maybe even 128 ,000 tokens for some recent models. 

 

15:11

And so to see that in the open source community is really encouraging. Being able to, we're being able to figure out how to do that efficiently like with long techniques like with Lora or Longlord, different kinds of attention. 

 

15:24

And to see all that, that there's so many levels of progress and innovation, it's been amazing to see that in the open source community. Kudos to the open source community on that. So yeah, we see a vision where you know you ask a question like you would ask a human, like look at these hundred reports, 10 per company for 10 different companies, and write me a story about how they each evolved over the last quarters. 

 

15:51

Which one should I invest in? For what reason? What are the risks? Make me a nice summary. And that will be done all by these LLMs with the right APIs, with the right agent -like behavior where you, you know, got to fetch the right information, put it somewhere, then pass an order, time over all the data and then collect it and so on. 

 

16:11

So it's not just single shot, right? It's store that information that you collected, put it somewhere and then deal with it a little bit later, just like a human would do. So very exciting times and we are... 

 

16:24

Portfolio construction is actually a real use case for us. Bond recommendations in fixed income, right? Sort of earnings calls. One of our customers actually is reusing earnings calls, transcripts to pull out signal, but also meetings that you cannot attend, right? 

 

16:45

Sort of on the more fun side of things. You're seeing a lot more of kind of meetings you can, like get a... summary that you otherwise not have gotten. I was going to demo one where can I park here or not. 

 

17:04

Right, so I found just looking at parking signals. I think the power use cases for Telco is obviously contact center. All customers want to have better conversations with their customers. Their customer experience is a huge killer app. 

 

17:23

Code generation, it's a lot of code. How well is code Lama doing? Code Lama is great. We use it a lot just for fun. Like you ask a question, how should I do this? What's wrong with this code? And it gets it right about half the time or so. 

 

17:36

So you should find useful. One of the apps in our public app store we'll show you later today will help you write better, clean Python code and teach you any improvements it might make. So it's fun. Are there open source apps on your app? 

 

17:49

They are open source. All of them at this point are open source. So we'll show you how to start making your own. I think question for the KGMs on the stage, what are some emergent properties? Can LLMS do reasoning? 

 

18:06

Yeah, that's a very controversial topic. So, yeah, people think by the answers that LLMS are giving right now, because they seem to be pretty smart in their reasoning, what they show. But I don't believe that the architectural itself allows for actual reasoning. 

 

18:30

And the models kind of trick the user to say that they are reasoning, but they are actually just laying out a plan in tokens, and they take the time to get to the correct answer, but it's really just a byproduct of the loss function that you train on. 

 

18:49

So you always train on the next token, and the model will create a sentence that sounds good, and it will be coherent, it will make sense, but only because it has been trained on that. So if it is about to generate something completely new, which is out of distribution from the training set, this will be pretty hard, and that's where I would say is the step to actual reasoning, which you can compare to a human. 

 

19:19

In terms of can LLMS win competitions themselves? It would be boring for me if that would be the case, right, because, but I mean LLMS themselves, I mean they're starting to get useful, also in the daily life of a data scientist I would say. 

 

19:39

So first for writing codes, second for like brainstorming, but also in the whole area of like label generation for other supervised training, we see it coming up more and more. So I think on themselves, They're starting to get useful, but like building agents that automatically win competitions, I think this is a bit far -fetched at the moment, but who knows? 

 

20:09

The whole development around LLMs is kind of crazy in a way, right? How quickly things go on, so who knows what will happen at some point? I think for me, like when I learned data science a while back, 10 -12 years ago, I was watching these conferences and someone had this idea that we can all create supervised models for everything. 

 

20:30

If you don't have an answer for something in a dataset, I can plug the answer in, I can fill the NA with a prediction model, and it sounded wonderful. We're going to have all these models out there and doing all this stuff. 

 

20:40

In reality, nobody does that because it takes a lot of time and governance and all that, so we don't have labeled data for that a lot. It's a chore to do this, and what I think is interesting is to take all those labeled problems that were a little too hard, there's not enough value to solve that problem. 

 

20:55

Well now we have unsupervised models that can start to answer these questions and it'll be interesting for practitioners to start to figure out where can we plug these two together and it's probably going to be obvious in a lot of cases once you think about it, but it's finding those obvious things. 

 

21:09

Helens are great labellers. Yeah, exactly and in general like it's a new technology, right? And we should rather learn to work with the technology in that case and I'm personally not scared that it will replace my job or something like this. 

 

21:23

But I see it rather as a helping tool and there might be new fields emerging, right? We saw it starting like suddenly they came out like a job like prompt engineer, right? Which was not a thing beforehand, but in general it just opens up new doors and new opportunities. 

 

21:44

And English is the new ML, right? Now this is the team where I think we were solely missing Mark when we started working with one of our customers, PwC. And we created AutoML, AutoMarklandry, AutoML. 

 

21:59

And this is the team that built one of the world's best AutoML driverless AI. Where do you see the convergence of the traditional TableLar machine learning world and the LLM world? Yeah, so the tool usage will be very useful for LLMs, right? 

 

22:14

When the LLMs could maybe make up some plausible next step, let me fit a model on this data set to predict this column. Once it can say that, it can also call a function that does that such as our AutoML. 

 

22:26

Once you have a model, you can say make predictions for everybody and tell me who's going to churn, you know? And then sort by churn and tell me the reasons why, Shapley values for example. And then given that table, write a story on the five most likely to churn customers and why they will churn, all this is doable today, right? 

 

22:49

It's just a matter of how you want to connect these pieces. So the data is there. and the LLMs are there to help bridge the gap, but you can also do it by hand today, right? You can call AutoML and get the value you want. 

 

23:03

But the LLM will obviously make it a little easier for end users. They can just say, here's the data set, what can I do with it? And then LLM might suggest some things, especially if you have a few pages of suggestion context, let's say. 

 

23:18

That's your corpus that you throw into the system to say, that's what one can do. These are the kinds of functions you can call, the kinds of outcomes you would get if you did that kind of stuff. So it's all just next plausible steps for the LLM. 

 

23:35

So it's going to be an exciting time to combine the two. The way we look at it is from retrieval to predictions, right? So far, we're doing retrieval, which is looking what happened in unstructured data. 

 

23:47

I think when you bring unstructured and structured and the ability to do powerful predictions, the features, from feature stores to feature stories, right? So how do you go to telling stories from your data that's coming, that's going to happen in the future. 

 

24:03

And then predictions will unlock some incredible narratives with LLMs. So super, super excited, both on the vertical side, how we can create, co -innovates with some of the audience here and the customers. 

 

24:16

But also, co -innovate with the traditional machine learning world and bring LLMs to power our applications and our engines there. Team Hydrogen is on stage here. They're talking later today. But Hydrogen is the hydrogen torch is the engine that we really pioneered by bringing transformers and taking the attention is all you need paper to practice. 

 

24:45

Label Genie is the other tool chain that can be really useful for labeling. But I think LLMs have really jump started almost every facet of what H2O has been offering. So we're super, super pumped. There's one question, I know we're out of time. 

 

25:04

The cost of AI is still ridiculously high. VLLMs, TGI, what is the latest way to run an LLM for really, really low cost? So the best way to run an LLM on -prem is to put it into VLLM or TGI. We like VLLM because it's an open source license and it has a concurrency where you can have multiple requests at the same time. 

 

25:29

And it stuffs it into the same GPU. Basically the sentences don't have to start at the same time. You can have a sentence going in and something coming out and then a lot of sentence going in, a lot of sentence at any time. 

 

25:40

You can fill up this boat if you want, like it's like a ferry. It's always swimming and you just jump on at any time and jump off at any time. So the LLM is constantly busy streaming different sentences at different positions. 

 

25:55

in their continuation. And you can have 10 or 20 of those going on at the same time on a single GPU or on four GPUs or two GPUs, whatever you need for the memory. But the main limit is always the memory, right? 

 

26:09

So if you have a model that needs 100 gigs, then that's about it is. The only way to reduce the 100 gigs down to 20 gigs or so is by doing three -bit or four -bit quantization. And we've seen good results with that. 

 

26:21

So we'll show in a later presentation that even with four -bit, which is a factor of four reduction, we get the same results as with the full 16 -bit. And it's funny that 16 -bit is full, right? It used to be 64 -bit. 

 

26:36

32 -bit was like half and not so good. And now 16 -bit is actually the top. But yeah, four -bit is just fine. But these models, they don't need billions of unique values and stuff. They're just the single -sprint. 

 

26:49

Quantization, GGUF. Yep. Then all the tons of open source contributions gg uf a w q all these methods are great and The ecosystem is thriving we'll get you can run the 70 billion model on a one or two GPUs now, right? 

 

27:09

And and and what is exciting about this personally is that? different fields of the open source community merge now kind of come together because the Hardcore engineering community has not been interested in machine learning that much before and now with all these inference Things coming up so there is like the whole open source communities rather expanding And merging in a way than being separate which is also very exciting The compiler engineers are co -creating with the Kaggle grandmasters, right? 

 

27:41

Well, that's all we have for today Although most all of them have a talk either on the stage or in the training sessions So if you if you've got a sample of what you can have is conversation starters I left enough prompts for all of you to pick off off Have great conversations with the team throughout the day and with that thank you. Thank you for all the innovation and this is just the beginning. 

 

28:11

We're still getting started here. Thank you. Thank you