Return to page

BLOG

LLM DataStudio - V6.0 Release

 headshot

By Nishaanthini Gnanavel | minute read | September 13, 2024

Blog decorative banner image

H2O LLM DataStudio is a no-code application created to streamline data preparation tasks for Large Language Models (LLMs). The tool features three main components: Curate, Prepare, and Custom Eval.

  1. Curate - Conversion of documents (PDFs, DOC & audio/video files) into question-answer pairs and summarization pairs

  2. Prepare - Prepare datasets using various text preprocessing techniques relevant to LLMs, such as profanity and toxicity removal, padding, truncation, sensitive information removal, filtering, flattening, and deduplication.

  3. Custom Eval - Create your own evaluation datasets with different evaluation types (Question Type, Multichoice, Token Presence) from documents (PDFs, DOC & audio/video files) or datasets.

This release introduces several key features to allow for extensive user customisation in all aspects of the Data Preparation and Curation pipelines. 

1) Customization Options for QA Prompts in Curate

Curate offers a range of customization options to tailor QA prompts based on the user’s requirements, providing flexibility at three distinct levels:

  • Basic Customization: Persona Selection

    Users can choose from predefined persona types or create their own custom personas to define the tone and context of the generated answers. Options include:

    • Persona Type: Select a persona such as Teacher, Advocate, or Doctor, or design a custom persona to fit your unique needs.
  • Advanced Customization Options

    For more control, users can adjust the following aspects of the QA prompts:

    • Knowledge Prompt Options: Choose approaches like summarization, theme identification, or other forms of content analysis
    • Question Type: Select from different types of questions such as factual, open-ended, or analytical..
    • Answer Length: Define the response length as short, medium, or long based on the desired depth of the answer.
    • Question Difficulty: Adjust the complexity of the questions, with options ranging from easy to hard, depending on the level of expertise required.
  • Expert Customization: Full Control Over Prompts

    For expert users, Curate provides an option to bring your own prompts (BYOP), allowing for complete control over the input and output of the QA system, enabling advanced and tailored solutions.

2) Integration with H2OGPTe

Users can now use H2OGPTe for the ingestion pipeline,which allows users to ingest documents directly through H2OGPTe rather than relying on custom data studio ingestion pipeline. This integration brings a host of benefits, including the ability to leverage H2OGPTe's advanced document handling capabilities.

3) Customisation options for Relevance Score

Users can select from multiple approaches to calculate the relevance score between the generated questions and their respective contexts. These approaches include:

  • BERT Approach: Utilizes the BERT model to calculate the similarity score between the question and its context.

  • Regex Approach: Uses regular expressions to tokenize text and find common words between the question and its context. The relevance score is calculated as the ratio of unique matched words in the context to the total words in the question.

  • FinBERT Approach: Employs the FinBERT model to calculate the similarity score between the question and its context.

4) Steps in Prepare workflow

  •     Bias Check

This step checks for potential bias in the text data. The text is analyzed to classify it as "BIASED" or "NEUTRAL" along with a confidence score. Only neutral or acceptably biased data is retained, ensuring a more balanced and fair dataset.

  •     Data Anonymization

If the Data Anonymization option under Sensitive Info Check is enabled, it masks sensitive information from the text to protect privacy. It identifies and replaces personal data like email addresses, phone numbers, cryptocurrency keys, IBANs, IP addresses and named entities with generic placeholders (<EMAIL_ADDRESS>, <PHONE_NUMBER>). 


5) New workflow from Curate to QA type in Prepare

The QA pairs generated using a project in the Curate component can be used to create a new QA type project in Prepare by selecting the “Question Answering” option under "Task Type".

6) Download Configurations Feature

Allows users to easily download and save all their customized settings for each step of their Prepare projects. 

 

To wrap up, LLM DataStudio V6.0 release introduces key enhancements for user customization across data preparation and curation workflows. With new options for customizing QA prompts, integration with H2OGPTe and multiple approaches to relevance scoring, users gain greater control over their processes. The release also includes new steps for prepare workflow and a new streamlined workflow from Curate to QA type, making LLM DataStudio a more effective tool for managing data preparation tasks.

H2O LLM DataStudio V6.0 Demo

 headshot

Nishaanthini Gnanavel

Nishaanthini Gnanavel is a Machine Learning Engineer at H2O.ai. With a keen interest in the field of artificial intelligence and data science, Nishaanthini is committed to leveraging her skills and knowledge to contribute to innovative solutions in the industry. Connect with her on LinkedIn 

 headshot

Genevieve Richards

Genevieve (Gen) is a data science practitioner with over 5 years experience in AI use cases across different domains in particular Financial services. She has two Bachelor Degree in Arts (Linguistics & Psychology) & IT (Computer Science) at the University of Queensland and Queensland University of Technology respectively.  She is passionate about AI democratization, Responsible AI and AI for Social Good.  Her LinkedIn profile can be found here.

 headshot

Laksika Tharmalingam

Laksika Tharmalingam is a Machine Learning Engineer at H2O.ai. With a strong background in data science and engineering, she's dedicated to pushing the boundaries of AI for a brighter future. She is also a Kaggle Master. Connect with her on LinkedIn

 headshot

Prathushan Inparaj

Prathushan Inparaj is a Machine Learning Engineer at H2O.ai. With a strong passion for artificial intelligence and data science, Prathushan is dedicated to applying his expertise to develop cutting-edge solutions in the industry. Connect with him on LinkedIn.