A no-code application and toolkit to streamline data preparation tasks related to Large Language Models (LLMs)
H2O LLM DataStudio is a no-code application designed to streamline data preparation tasks specifically for Large Language Models (LLMs). It offers a comprehensive range of preprocessing and preparation functions such as text cleaning, text quality detection, tokenization, truncation, and the ability to augment the dataset with external and RLHF datasets. The processes and workflows include: question-answer datasets, instruction-response datasets, continue pertaining, human bot conversations etc.
The importance of cleaned data in NLP downstream tasks after fine-tuning lies in its ability to improve the reliability, performance, and ethical considerations of the models. The benefits include – Enhanced Model Performance, Reduced Bias and Unwanted Influences, Consistency and Coherence, Improved Generalization, Ethical Considerations, User Experience, and Trust.
To ensure the quality of datasets is met, one of the first steps in building your own LLM or working with LLMs is the data preparation process. It starts with the ingestion of datasets from various sources, tailored to serve a specific purpose such as continued pretraining, instruct tuning, chatbot development, or RLHF protection. This is followed by an intensive data cleaning stage, where subpar data segments are meticulously identified and eliminated, ensuring the removal of potential noises such as lengthy sequences of spaces or anomalous characters that could disrupt subsequent analysis or modeling efforts. In the pursuit of data quality assurance, a strict data quality checking procedure needs to be implemented, employing advanced techniques like BLEU/METEOR/similarity or RLHF reward models to detect and filter out data of inferior quality. Further refining measures, including length-based filtering to distinguish between concise and extensive answers, and profanity checks to maintain data appropriateness, needs to be applied at this step. This comprehensive process needs to be mapped out in a multi-step workflow, providing a clear visualization of each phase. Upon completion of these refinement procedures, the polished data can be exported in user-preferred formats, facilitating seamless integration with other tools such as the fine-tuning studio.
H2O LLM DataStudio is equipped with all these procedures and provides a smooth experience to users in preparing the datasets for LLM Fine tuning and other related tasks.
LLM DataStudio supports a wide range of workflows and task types, offering the necessary tools to simplify and streamline your data preparation efforts. This includes:
LLM DataStudio supports a multitude of functions to facilitate the preparation of datasets for various task types. The primary goal is to structure data optimally for maximal model performance. Here is an overview of the key functions available:
Create a project, ingest datasets, define the workflow – select the techniques to be applied to the datasets:
Configure each element accordingly with different user inputs such as cleaning thresholds, text grade thresholds, profanity thresholds, etc.
Final output, comparison, and insights in a desired format:
The no-code interface can be used as a web application in H2O Cloud, or it can be run as an API, used as CLI, or a Python package. LLM DataStudio also offers the API to work directly with Python. The necessary steps include:
Following is an example of using llm-data-preparation for a Question Answering type project.
# Import modules
from datastudio
import prep import pandas as pd
# Initializing configuration
cfg = {
"type": "qa",
'config': {
'augmentation': [],
"text_clean": {
"cols": [],
"funcs": ["new_line","whitespace","urls","html"]
},
'length_checker': {
'max_answer_length': 10000,
'max_context_length': 10000,
'max_question_length': 10000,
'min_answer_length': 0,
'min_context_length': 0,
'min_question_length': 0
},
'padding': {'max_length': '600'},
'profanity_checker': 0.3,
'quality_checker': {'max_grade': 30, 'min_grade': 15},
'relevance_checker': 0.15,
'truncate': {'max_length': '10000', 'ratio': 0.15}
},
'datasets': [{
'cols': { 'answer': 'answer', 'context': 'context', 'question': 'question'},
'path': ['input/qna.csv']
}],
'is_hum_bot': 0,
"selected_aug_dataset": 0
}
# Executing pipeline
output_df = prep.prepare_dataset(cfg)
output_df.head()
This script is set up to load a QnA dataset, preprocess and clean the text data, check and filter the data based on length and relevance criteria, and finally, pad and truncate the sequences. The resulting processed data frame is then displayed in the notebook. The final datasets obtained from LLM DataStudio can be pushed to different tools such as H2O LLM Studio for fine-tuning and making your own LLM Models
To wrap up, LLM DataStudio is a fantastic tool that makes preparing data for Large Language Models a lot easier. It’s packed with features like cleaning text, checking data quality, adding more data, and arranging data into a specific format. It’s great at handling many kinds of jobs like creating question-answer pairs, summarizing text, making bots more responsive to instructions, improving bot-human chats, and training language models further. Whether you’re a pro coder or not, anyone can use LLM DataStudio. It’s got a Python API for coders and a no-code web interface for everyone else. What’s more, it’s really careful with data quality and user privacy. It checks for bad words, private information, and repeated data, so users can trust their datasets are high-quality, respectful, and ready for training their language models. In a world where data powers everything, LLM DataStudio is an easy-to-use solution that takes the headache out of preparing data. It changes data preparation from a tough task to an easy, smooth process. For any queries, reach out to our team. Also stay tuned for the next version release with Document to Q:A pair curation capability.