Return to page

BLOG

Announcing H2O Danube 2: The next generation of Small Language Models from H2O.ai

 headshot

By Michelle Tanco | minute read | April 23, 2024

Blog decorative banner image

A new series of Small Language Models from H2O.ai, released under Apache 2.0 and ready to be fine-tuned for your specific needs to run offline and with a smaller footprint. 

Why Small Language Models? 

Like most decisions in AI and tech, the decision of which Language Model to use for your production use cases comes down to trade-offs. Many Large Language Models are excellent at solving a wide-range of Natural Language Understanding use cases out-of-the-box but require sending your data to a third party who is hosting the model, or having enough GPUs to run the large model privately. On the other hand, Small Language Models are more compact, nimble, and take significantly fewer resources to deploy - these smaller models are great at handling specific tasks and can be tailored to perform well within a narrower scope. When you have an application where resources are scarce and your use case is well defined, SLMs can be the way to go. 

At H2O.ai, democratizing AI isn’t just an idea. It’s a movement. By releasing a series of small, foundational models that can be easily fine-tuned to specific takes and which are lightweight enough to run on your phone, we are expanding the freedom around the creation and use of AI.

H2O-Danube2-1.8B is built upon the success of its predecessor, H2O-Danube 1.8B, with notable upgrades and optimizations bringing it to the forefront in the 2B SLM category. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.

What is H2O Danube2?

The H2O Danube series consists of compact foundational models, each containing 1.8 billion parameters. These models are designed to be lightweight and quick to respond, making them suitable for a broad array of use cases. Initially trained on 1 trillion tokens, the Danube models offer robust foundational capabilities. H2O Danube2 represents an incremental enhancement, having been trained on an additional 2 trillion tokens to refine and expand its learning.

H2O Danube has a context of 8k tokens and can be fine-tuned to support a wide range of use cases including:

  • Retrieval augmented generation

  • Open-ended text generation 

  • Brainstorming

  • Summarization

  • Table creation 

  • Data formatting 

  • Paraphrasing 

  • Chain of thought 

  • Rewrite 

  • Extraction 

  • Q&A

  • Chat

 

In addition to the base foundational model, we have also released H2O Danube2 Chat. This model has been fine-tuned using open source H2O LLM Studio for chat-specific use cases. 

Model

Description

h2oai/h2o-danube2-1.8b-base

Base model which can be fine-tuned for a wide range of natural language understanding use cases.

h2oai/h2o-danube2-1.8b-chat

Instruction fine-tuned model for chat use cases.

 

What’s new in version 2?

The latest release of these foundational models focus on improved accuracy, increasing performance on the Hugging Face Open LLM Average Leaderboard by 9% percentage points making it the top performing model under 2 billion parameters. 

In addition to general improved performance, users will see better performance for long-context behavior as well. 

On the technical side, here is how Danube2 has different and improved from its predecessor: 

  • Improvements in long-context behavior: By removing the sliding window approach for attention, we effectively improve the retrieval capabilities in long context. H2O Danube2 models can handle a total of 8K tokens for input and output size combined. 

  • Leverage Mistral Tokenizer: The choice of tokenizer is a crucial aspect of large language models, transforming and compressing the input text to token representations consumed by the language model. We found that switching to the Mistral tokenizer improves downstream performance, while keeping the same vocabulary size of 32000.

  • Improved Filtering: Better filtering and deduplication of training data by using advanced heuristics, as well as machine learning models (GBM and BERT) for predicting the quality of text and using predictions for filtering.

  • Data Curation Improvements: Significant improvements in underlying data curation leading to a three stage training of H2O-Danube2. At each stage, we gradually decrease the percentage of noisy web data in favor of higher quality data. 

For details, please refer to our Technical Report.

Conclusion

Small Language Models allow organizations to solve natural language problems without the resources needed to use Large Language Models. H2O Danube2 is here to help users create the tailored solutions they need faster, and with less resources. 

Whether you're looking to integrate advanced AI into your business processes, develop new ways to interact with data, or simply explore the potential of machine learning, H2O Danube2 provides the tools necessary to transform ideas into reality. The model's improved accuracy signifies our commitment to delivering cutting-edge technology that not only meets but anticipates the needs of our users.

Available for download and integration today, H2O Danube2 is ready to help your company be an AI company. We invite developers, researchers, and businesses to explore the possibilities with our community of innovators who are as passionate about AI as you are.

 headshot

Michelle Tanco, Head of Product

As the Head of Product at H2O.ai, Michelle Tanco’s primary focus lies in delivering a seamless user experience across machine learning applications. With a strong dedication to math and computer science, she is enthusiastic about leveraging these disciplines to address real-world challenges. Before joining H2O, she served as a Senior Data Science Consultant at Teradata, working on high-impact analytics projects to tackle business issues across various industries. Michelle holds a B.A. in Mathematics and Computer Science from Ursinus College. In her downtime, she enjoys spending quality time with her family and expressing her creativity by playing the bass and ukulele.

 headshot

Philipp Singer

Philipp has won several Kaggle competitions and is as of this writing ranked 4th on the global competition leaderboard. Philipp obtained a Ph.D. in computer science with honors at the Technical University of Graz where he also finished his Master's studies in Software Development and Business Management. During his Ph.D. and following post-doc, he has published papers in seminal data science conferences and journals including the best paper award at The Web Conference. Philipp loves utilizing his passion and skills in machine learning, statistics, and data science in order to enable data-driven decision making and he is always eager to delve into new bleeding-edge fields and technologies. In his time off, he likes to do sports like playing football or going to the gym, as well as play video games or watch the newest tv shows.

 headshot

Pascal Pfeiffer, Kaggle Grandmaster

Pascal is a Kaggle Competitions Grandmaster whose focus is the application of machine learning techniques to unstructured data in the areas of computer vision, natural language processing, and audio. Pascal holds a Ph.D. in electrical engineering. He has previously worked in the automotive industry, specializing in the field of automated driving.

 headshot

Yauhen Babakhin

Yauhen holds a Master's Degree in Applied Data Analysis from the Belarusian State University. He has over 7 years of experience in Data Science having worked in the Banking, Gaming and eCommerce industries. Yauhen is a Kaggle competitions Grandmaster with a total of 10 gold medals in classic ML, NLP and CV competitions.