BACK TO LIST

H2O AI Cloud, NLP, Technical

Improving NLP Model Performance with Context-Aware Feature Extraction

Published: October 08, 2021

min read

Written by: Jo-Fai Chow

I would like to share with you a simple yet very effective trick to improve feature engineering for text analytics. After reading this article, you will be able to follow the exact steps and try it yourself using our H2O AI Cloud .

First of all, let’s have a look at the off-the-shelf natural language processing (NLP) recipes in H2O Driverless AI (one of our AI Cloud’s AutoML products). We have some standard text transformation recipes like Term Frequency-Inverse Document Frequency (TF-IDF) as well as some complex ones like Convolutional Neural Network (CNN), Bi-directional Gated Recurrent Unit (BiGRU), and Bidirectional Encoder Representations from Transformers (BERT) . You can find the full list of available text transformers here .

Off-the-shelf NLP recipes in H2O Driverless AI

www.h2o.ai2019/02/driverless-ai-npl-recipes.png

So, in other words, we already have many general-purpose NLP recipes to cover the most common text analytics use cases. But we don’t stop right there. We know that it is possible to further improve predictive performance with smart and, more importantly, domain-specific feature extraction. That’s why we make the NLP capabilities in Driverless AI extensible via custom recipes . We can leverage state-of-the-art NLP models from the research community and perform context-aware feature extraction with minimal effort in Driverless AI.

Let me show you how.

A Quick Tutorial – Airline Twitter Sentiment

The Airline Twitter Sentiment dataset was scraped in 2015 and contributors were asked to classify positive, negative, and neutral tweets. You can find out more about the dataset and download it from here . Out of the 20 columns available in the dataset, we are only interested in text (the single feature) and airline_sentiment (the target).

Airline Twitter Sentiment Dataset

www.h2o.ai2021/10/Screenshot-2021-10-07-230826.png

Step 1 – Split the Data

Follow these steps to import the airline dataset into Driverless AI. Since the Airline Twitter Sentiment dataset is just a single CSV without a dedicated test dataset, we can split the dataset into airline_train and airline_test using the dataset splitter as shown below.

Dataset splitter interface in Driverless AI

Step 2 – Build a Baseline Model

Now we are ready to train our first model using airline_train and then evaluate the out-of-bag performance with airline_test. For the first baseline model, we are going to leave most settings as default. Since we are only using the text column as a single feature for this exercise, we need to remove the rest (see dropped columns settings below) before we launch the experiment.

Driverless AI model training settings for the baseline model

www.h2o.ai2021/10/nlp_blog_baseline_setting.png

Remember to drop everything but `text` in dropped columns setting

www.h2o.ai2021/10/nlp_blog_drop_features.png

As we haven’t switched on complex text transformation (e.g. CNN, BiGRU, BERT), the transformed features from this simple experiment are all TF-IDF-based. We can certainly improve this baseline model with more complex transformation so let’s move on to the next step.

The most important features for the baseline model are TF-IDF-based word embeddings

www.h2o.ai2021/10/nlp_blog_baseline_perf.png

Step 3 – Improve the Baseline with CNN and BiGRU Feature Transformation

In order to switch on more complex text transformation, we need to change two values in expert settings as shown below. This will activate word-based CNN and BiGRU text transformation in the automatic feature engineering pipeline. As a result, we can see that the dominant features in the experiment are created based on CNN and BiGRU (instead of TF-IDF-based features in the baseline model). We can also see an improvement in model performance (i.e. lower logloss and error rate). Can we further improve this? Read on.

Enable word-based CNN and BiGRU models in NLP expert settings

www.h2o.ai2021/10/nlp_blog_cnn_bigru_setting.png

New Features from CNN and BiGRU lead to better predictive performance

www.h2o.ai2021/10/nlp_blog_cnn_bigru_perf.png

Enter the Hugging Face Model Hub

Before we get to the next step, let me introduce a fantastic platform called Hugging Face. Here is the statement on their website:

“We are helping the community work together towards the goal of advancing Artificial Intelligence . Not one company, even the Tech Titans, will be able to “solve AI” by themselves – the only way we’ll achieve this is by sharing knowledge and resources. On the Hugging Face Hub we are building the largest collection of models, datasets and metrics in order to democratize and advance AI for everyone . The Hugging Face Hub works as a central place where anyone can share and explore models and datasets.” (Source )

For our Airline Twitter Sentiment exercise, we are going to find a relevant transformer on Hugging Face so that we can perform better feature extraction than those from the general-purpose text transformers in Driverless AI.

Find out more on Hugging Face’s website

www.h2o.ai2021/10/nlp_blog_huggingface.png

Step 4 – Find a Domain-Specific Transformer

From a quick search on Hugging Face using the keyword twitter, we can find the twitter-roberta-base-sentiment model from Cardiff NLP group. The model was trained on many different tweets. That sounds relevant to our use case here so let’s give it a try!

Searching for domain-specific models on Hugging Face

www.h2o.ai2021/10/nlp_blog_twitterroberta_transformer1.png

Example outputs of the twitter-roberta-base-sentiment model that can be used as new features

www.h2o.ai2021/10/nlp_blog_twitterroberta_transformer2.png

Step 5 – Extract Context-Aware Features with the Twitter-Roberta-based Transformer

Now, this is the most important step . If you get this right, you will be able to import many more transformers from Hugging Face.

First, we need to write a simple Python script that imports the twitter-roberta-base-sentiment transformer into Driverless AI. Let’s call this script TwitterRobertaTransformer.py. The most important parameters in this script are MODEL_NAME and class. Replace them with other transformers from Hugging Face and you will be able to import many other transformers into Driverless AI.

from h2oaicore.systemutils import config
from h2oaicore.transformer_utils import CustomTransformer
from h2oaicore.transformers_nlp import BERTTransformer

MODEL_NAME = 'cardiffnlp/twitter-roberta-base-sentiment'

class TwitterRoberta(BERTTransformer, CustomTransformer):
 _mojo = False

 @staticmethod
 def get_default_properties():
 return dict(col_type="text",
 min_cols=1,
 max_cols=1,
 relative_importance=1)

 @staticmethod
 def get_parameter_choices():
 return dict(model_type=[MODEL_NAME],
 batch_size=[config.pytorch_nlp_fine_tuning_batch_size],
 seq_length=[config.pytorch_nlp_fine_tuning_padding_length]
 )

Once we have the script ready, we can go to the recipes tab in expert settings and upload the script as shown below. You will also need to enable it by selecting TwitterRoberta in the specific transformers setting. After that, you should be able to see TwitterRoberta in the feature engineering search space.

Adding new feature transformation via custom recipe

www.h2o.ai2021/10/nlp_blog_twitterroberta_settings1.png

Twitter-Roberta-based transformation is now available for the feature engineering pipeline

www.h2o.ai2021/10/nlp_blog_twitterroberta_settings2.png

As expected, we can get better predictive performance with domain-specific features from the twitter-roberta-base-sentiment model.

Twitter-Roberta-based features further improve the predictive performance

www.h2o.ai2021/10/nlp_blog_twitterroberta_perf.png

Quick Recap

In short, we start with a simple baseline model using the standard text transformations like TF-IDF and then improve the performance with CNN/BiGRU feature transformations. In order to perform context-aware and domain-specific feature extraction, we import the twitter-roberta-base-sentiment transformer and further improve the model performance.

Comparing model performance based on various text transformations

(score = logloss, lower = better)

Your Turn to Try!

It is possible to improve the model even further (see screenshot below). I am not going to reveal the exact procedure but I am sure you can figure it out fairly quickly. Here are a few hints:

Can we switch on other BERT transformers that come with Driverless AI?
What if we try different accuracy/time/interpretability settings? This leaderboard feature may help.
Can we mix and match other transformers from Hugging Face?

Mix and match different text transformers. Yes, you can do better than this!

www.h2o.ai2021/10/nlp_blog_even_higher_perf.png

Key Takeaways

With custom recipes , it is possible to extend and improve text transformation in Driverless AI using state-of-the-art models from the AI community. Thus, we already have the technology in place to future-proof our automatic feature engineering pipeline. We are excited to see what our users can do with different transformers. For example, could you extract predictive features with BioBert for health care use cases? Could you get a competitive edge in the stock market with features from FinBert ? The possibilities are endless. We hope that our technology will enable our users to benefit from the latest transformers with minimal effort for many years to come.

How to Get Started?

H2O AI Cloud is the best way to get free, hands-on experience. No installation. All you need is a web browser. Request a demo today.

Credits

The advanced text analytics feature discussed in this article is brought to you by Sudalai Rajkumar , Maximilian Jeblick , and Trushant Kalyanpur .

Jo-Fai Chow

Jo-fai (or Joe) has multiple roles (data scientist / evangelist / community manager) at H2O.ai. Since joining the company in 2016, Joe has delivered H2O talks/workshops in 40+ cities around Europe, US, and Asia. Nowadays, he is best known as the H2O #360Selfie guy. He is also the co-organiser of H2O's EMEA meetup groups including London Artificial Intelligence & Deep Learning - one of the biggest data science communities in the world with more than 11,000 members.

BACK TO LIST