H2O LLM DataStudio is a comprehensive no-code application designed to simplify data preparation tasks for Large Language Models (LLMs). This tool comprises three key components: Curate, Prepare, and Augment.
In part one of the H2O LLM DataStudio blog, streamlined no-code data preparation pipelines for LLMs and LLM DataStudio was introduced. In part two of the H2O LLM DataStudio blog, LLM DataStudio Curate was introduced.
We recently launched a new version V4.1 of LLM DataStudio to our user community. This new version further expands the current capabilities to allow for more customizations and options for your data preparation pipelines. Take a look at our highlight reel below:
We’ve now extended our dataset curation capabilities to generate context summarisation pairs. This allows the user to curate a dataset for another LLM fine-tuning workflow. This workflow uses the same smart chunking and prompting techniques to generate these article-summary pairs which can be propagated to prepare pipelines and LLM Studio for fine-tuning.
The tool also allows the user to generate a summary paragraph for a whole document as well.
We’ve enhanced the dataset curation component with multi-language support, including English, Spanish, French, German, Italian and Portuguese. This feature enables users to effortlessly diversify their datasets for various language models by generating Question and Answer pairs and Summarization pairs using the datasets of those languages.
We’ve extended the dataset curation capabilities by integrating H2OGPTe and Gradio Client into our LLM DataStudio. Users now have the flexibility to utilize their own H2OGPTe credentials or opt for the Gradio Client with a secure token, ensuring the presence of the essential Llama2 family of models, for the dataset curation component. Users can easily add or modify these credentials using our application’s settings page.
We’ve added some new capabilities in our drag-and-drop data preparation pipelines. These include:
To wrap up, H2O LLM Data Studio is an essential tool that provides a consolidated solution for preparing data for Large Language Models. Being able to curate datasets from unstructured data and also continue the dataset creation with no-code preparation pipelines, data preparation for LLMs becomes a smooth task.
In a world where data powers everything, LLM DataStudio is an easy-to-use solution that removes the headache of curating and preparing data. Stay tuned as we continue to democratize LLM data preparation for all organizations. For more information, please contact cloud-feedback@h2o.ai
Want to learn how to use LLM DataStudio, take a look at our Step-by-Step training guide here: