H2O LLM DataStudio is a no-code application created to streamline data preparation tasks for Large Language Models (LLMs). The tool features three main components: Curate, Prepare, and Custom Eval.
Curate - Conversion of documents (PDFs, DOC & audio/video files) into question-answer pairs and summarization pairs
Prepare - Prepare datasets using various text preprocessing techniques relevant to LLMs, such as profanity and toxicity removal, padding, truncation, sensitive information removal, filtering, flattening, and deduplication.
Custom Eval - Create your own evaluation datasets with different evaluation types (Question Type, Multichoice, Token Presence) from documents (PDFs, DOC & audio/video files) or datasets.
This release introduces several key features to allow for extensive user customisation in all aspects of the Data Preparation and Curation pipelines.
1) Customization Options for QA Prompts in Curate
Curate offers a range of customization options to tailor QA prompts based on the user’s requirements, providing flexibility at three distinct levels:
Basic Customization: Persona Selection
Users can choose from predefined persona types or create their own custom personas to define the tone and context of the generated answers. Options include:
Advanced Customization Options
For more control, users can adjust the following aspects of the QA prompts:
Expert Customization: Full Control Over Prompts
For expert users, Curate provides an option to bring your own prompts (BYOP), allowing for complete control over the input and output of the QA system, enabling advanced and tailored solutions.2) Integration with H2OGPTe
Users can now use H2OGPTe for the ingestion pipeline,which allows users to ingest documents directly through H2OGPTe rather than relying on custom data studio ingestion pipeline. This integration brings a host of benefits, including the ability to leverage H2OGPTe's advanced document handling capabilities.
3) Customisation options for Relevance Score
Users can select from multiple approaches to calculate the relevance score between the generated questions and their respective contexts. These approaches include:
BERT Approach: Utilizes the BERT model to calculate the similarity score between the question and its context.
Regex Approach: Uses regular expressions to tokenize text and find common words between the question and its context. The relevance score is calculated as the ratio of unique matched words in the context to the total words in the question.
FinBERT Approach: Employs the FinBERT model to calculate the similarity score between the question and its context.
4) Steps in Prepare workflow
This step checks for potential bias in the text data. The text is analyzed to classify it as "BIASED" or "NEUTRAL" along with a confidence score. Only neutral or acceptably biased data is retained, ensuring a more balanced and fair dataset.
If the Data Anonymization option under Sensitive Info Check is enabled, it masks sensitive information from the text to protect privacy. It identifies and replaces personal data like email addresses, phone numbers, cryptocurrency keys, IBANs, IP addresses and named entities with generic placeholders (<EMAIL_ADDRESS>, <PHONE_NUMBER>).
5) New workflow from Curate to QA type in Prepare
The QA pairs generated using a project in the Curate component can be used to create a new QA type project in Prepare by selecting the “Question Answering” option under "Task Type".
6) Download Configurations Feature
Allows users to easily download and save all their customized settings for each step of their Prepare projects.
To wrap up, LLM DataStudio V6.0 release introduces key enhancements for user customization across data preparation and curation workflows. With new options for customizing QA prompts, integration with H2OGPTe and multiple approaches to relevance scoring, users gain greater control over their processes. The release also includes new steps for prepare workflow and a new streamlined workflow from Curate to QA type, making LLM DataStudio a more effective tool for managing data preparation tasks.
H2O LLM DataStudio V6.0 Demo