Return to page

BLOG

H2O LLM DataStudio: V4.1 Release

 headshot

By Nishaanthini Gnanavel | minute read | January 16, 2024

Blog decorative banner image

H2O LLM DataStudio is a comprehensive no-code application designed to simplify data preparation tasks for Large Language Models (LLMs). This tool comprises three key components: Curate, Prepare, and Augment.

  1. Curate - Conversion of documents (PDFs, DOC & audio/video files) into question-answer pairs and summarization pairs
  2. Prepare - Prepare datasets using various text preprocessing techniques relevant to LLMs, such as profanity and toxicity removal, padding, truncation, sensitive information removal, filtering, flattening, and deduplication.
  3. Augment - Integrate your data with external datasets to enhance their richness and reduce biases. 

In part one of the H2O LLM DataStudio blog, streamlined no-code data preparation pipelines for LLMs and LLM DataStudio was introduced. In part two of the H2O LLM DataStudio blog, LLM DataStudio Curate was introduced.

V4.1 Release

We recently launched a new version V4.1 of LLM DataStudio to our user community. This new version further expands the current capabilities to allow for more customizations and options for your data preparation pipelines. Take a look at our highlight reel below:

Update 1: Curate your Summarization Pairs

We’ve now extended our dataset curation capabilities to generate context summarisation pairs. This allows the user to curate a dataset for another LLM fine-tuning workflow. This workflow uses the same smart chunking and prompting techniques to generate these article-summary pairs which can be propagated to prepare pipelines and LLM Studio for fine-tuning. 

The tool also allows the user to generate a summary paragraph for a whole document as well. 

Update 2: Multi-language support for dataset curation

We’ve enhanced the dataset curation component with multi-language support, including English, Spanish, French, German, Italian and Portuguese. This feature enables users to effortlessly diversify their datasets for various language models by generating Question and Answer pairs and Summarization pairs using the datasets of those languages.

Update 3: H2OGPTe and Gradio client Integration

We’ve extended the dataset curation capabilities by integrating H2OGPTe and Gradio Client into our LLM DataStudio. Users now have the flexibility to utilize their own H2OGPTe credentials or opt for the Gradio Client with a secure token, ensuring the presence of the essential Llama2 family of models, for the dataset curation component. Users can easily add or modify these credentials using our application’s settings page. 

Update 4: More exciting Curate Capabilities
  • Document type specification - to provide further customizability to your new curated question-answer pairs so there is a new option to provide Document Type. This input is used to guide the LLM to maintain the style of the document and understand the unique document terminologies.
  • Relevance score - The users were highly interested in having a metric that defines/quantifies how “well” the generated question-answer pairs are based on the provided document. We’ve now implemented a Relevance column as part of our QA projects. This column indicates the relevance of generated pairs with respect to the context using cosine similarities. This can be used to easily mark questions under a threshold as irrelevant to filter in when imported to your prepare pipeline. 
Update 5: Data Preparation Enhancements

We’ve added some new capabilities in our drag-and-drop data preparation pipelines. These include:

  • Filter by column option - For all problem types, we’ve implemented filter functionality which allows you to filter rows from the dataset based on another categorical column. This was implemented to easily filter out curation pairs that are marked as irrelevant. 
  • Sensitive Information Removal: You can now customize sensitive information to remove from your dataset. You can now remove the following information from your data: 
    • Email Address
    • Phone Number
    • Crypto wallet number
    • The International Bank Account Number (IBAN)
    • IP Address
    • Named Entity Removal

LLM DataStudio Demo

What’s Next?

To wrap up, H2O LLM Data Studio is an essential tool that provides a consolidated solution for preparing data for Large Language Models. Being able to curate datasets from unstructured data and also continue the dataset creation with no-code preparation pipelines, data preparation for LLMs becomes a smooth task. 

In a world where data powers everything, LLM DataStudio is an easy-to-use solution that removes the headache of curating and preparing data. Stay tuned as we continue to democratize LLM data preparation for all organizations. For more information, please contact cloud-feedback@h2o.ai

Want to learn how to use LLM DataStudio, take a look at our Step-by-Step training guide here:

 headshot

Nishaanthini Gnanavel

Nishaanthini Gnanavel is a Machine Learning Engineer Intern at H2O.ai. With a keen interest in the field of artificial intelligence and data science, Nishaanthini is committed to leveraging her skills and knowledge to contribute to innovative solutions in the industry. Connect with her on LinkedIn 

 headshot

Genevieve Richards

Genevieve (Gen) is a data science practitioner with over 5 years experience in AI use cases across different domains in particular Financial services. She has two Bachelor Degree in Arts (Linguistics & Psychology) & IT (Computer Science) at the University of Queensland and Queensland University of Technology respectively.  She is passionate about AI democratization, Responsible AI and AI for Social Good.  Her LinkedIn profile can be found here.

 headshot

Tarique Hussain

Tarique completed his B.Tech in 2019 from IIT-Kharagpur. He has more than 4 years of experience solving DS/ML use cases across different verticals such as Healthcare, BFSI, FMCG, and Manufacturing. He loves keeping up with new developments in data science and also enjoys doing Kaggle Competitions. Currently, Tarique is not only authoring a comprehensive course on Distributed Machine Learning but he has also made a significant impact by successfully mentoring individuals to excel in Data Science through the esteemed KaggleX BIPoC mentorship program. His LinkedIn profile can be found here.