I would like to share with you a simple yet very effective trick to improve feature engineering for text analytics. After reading this article, you will be able to follow the exact steps and try it yourself using our H2O AI Cloud .
First of all, let’s have a look at the off-the-shelf natural language processing (NLP) recipes in H2O Driverless AI (one of our AI Cloud’s AutoML products). We have some standard text transformation recipes like Term Frequency-Inverse Document Frequency (TF-IDF) as well as some complex ones like Convolutional Neural Network (CNN), Bi-directional Gated Recurrent Unit (BiGRU), and Bidirectional Encoder Representations from Transformers (BERT) . You can find the full list of available text transformers here .
So, in other words, we already have many general-purpose NLP recipes to cover the most common text analytics use cases. But we don’t stop right there. We know that it is possible to further improve predictive performance with smart and, more importantly, domain-specific feature extraction. That’s why we make the NLP capabilities in Driverless AI extensible via custom recipes . We can leverage state-of-the-art NLP models from the research community and perform context-aware feature extraction with minimal effort in Driverless AI.
Let me show you how.
The Airline Twitter Sentiment dataset was scraped in 2015 and contributors were asked to classify positive, negative, and neutral tweets. You can find out more about the dataset and download it from here . Out of the 20 columns available in the dataset, we are only interested in text
(the single feature) and airline_sentiment
(the target).
Follow these steps to import the airline dataset into Driverless AI. Since the Airline Twitter Sentiment dataset is just a single CSV without a dedicated test dataset, we can split the dataset into airline_train
and airline_test
using the dataset splitter as shown below.
Now we are ready to train our first model using airline_train
and then evaluate the out-of-bag performance with airline_test
. For the first baseline model, we are going to leave most settings as default. Since we are only using the text column as a single feature for this exercise, we need to remove the rest (see dropped columns settings below) before we launch the experiment.
text
in dropped columns settingAs we haven’t switched on complex text transformation (e.g. CNN, BiGRU, BERT), the transformed features from this simple experiment are all TF-IDF-based. We can certainly improve this baseline model with more complex transformation so let’s move on to the next step.
In order to switch on more complex text transformation, we need to change two values in expert settings as shown below. This will activate word-based CNN and BiGRU text transformation in the automatic feature engineering pipeline. As a result, we can see that the dominant features in the experiment are created based on CNN and BiGRU (instead of TF-IDF-based features in the baseline model). We can also see an improvement in model performance (i.e. lower logloss and error rate). Can we further improve this? Read on.
Before we get to the next step, let me introduce a fantastic platform called Hugging Face. Here is the statement on their website:
“We are helping the community work together towards the goal of advancing Artificial Intelligence . Not one company, even the Tech Titans, will be able to “solve AI” by themselves – the only way we’ll achieve this is by sharing knowledge and resources. On the Hugging Face Hub we are building the largest collection of models, datasets and metrics in order to democratize and advance AI for everyone . The Hugging Face Hub works as a central place where anyone can share and explore models and datasets.” (Source )
For our Airline Twitter Sentiment exercise, we are going to find a relevant transformer on Hugging Face so that we can perform better feature extraction than those from the general-purpose text transformers in Driverless AI.
From a quick search on Hugging Face using the keyword twitter
, we can find the twitter-roberta-base-sentiment
model from Cardiff NLP group. The model was trained on many different tweets. That sounds relevant to our use case here so let’s give it a try!
Now, this is the most important step . If you get this right, you will be able to import many more transformers from Hugging Face.
First, we need to write a simple Python script that imports the twitter-roberta-base-sentiment
transformer into Driverless AI. Let’s call this script TwitterRobertaTransformer.py
. The most important parameters in this script are MODEL_NAME
and class
. Replace them with other transformers from Hugging Face and you will be able to import many other transformers into Driverless AI.
from h2oaicore.systemutils import config
from h2oaicore.transformer_utils import CustomTransformer
from h2oaicore.transformers_nlp import BERTTransformer
MODEL_NAME = 'cardiffnlp/twitter-roberta-base-sentiment'
class TwitterRoberta(BERTTransformer, CustomTransformer):
_mojo = False
@staticmethod
def get_default_properties():
return dict(col_type="text",
min_cols=1,
max_cols=1,
relative_importance=1)
@staticmethod
def get_parameter_choices():
return dict(model_type=[MODEL_NAME],
batch_size=[config.pytorch_nlp_fine_tuning_batch_size],
seq_length=[config.pytorch_nlp_fine_tuning_padding_length]
)
Once we have the script ready, we can go to the recipes tab in expert settings and upload the script as shown below. You will also need to enable it by selecting TwitterRoberta
in the specific transformers setting. After that, you should be able to see TwitterRoberta
in the feature engineering search space.
As expected, we can get better predictive performance with domain-specific features from the twitter-roberta-base-sentiment
model.
In short, we start with a simple baseline model using the standard text transformations like TF-IDF and then improve the performance with CNN/BiGRU feature transformations. In order to perform context-aware and domain-specific feature extraction, we import the twitter-roberta-base-sentiment
transformer and further improve the model performance.
It is possible to improve the model even further (see screenshot below). I am not going to reveal the exact procedure but I am sure you can figure it out fairly quickly. Here are a few hints:
With custom recipes , it is possible to extend and improve text transformation in Driverless AI using state-of-the-art models from the AI community. Thus, we already have the technology in place to future-proof our automatic feature engineering pipeline. We are excited to see what our users can do with different transformers. For example, could you extract predictive features with BioBert for health care use cases? Could you get a competitive edge in the stock market with features from FinBert ? The possibilities are endless. We hope that our technology will enable our users to benefit from the latest transformers with minimal effort for many years to come.
H2O AI Cloud is the best way to get free, hands-on experience. No installation. All you need is a web browser. Request a demo today.
The advanced text analytics feature discussed in this article is brought to you by Sudalai Rajkumar , Maximilian Jeblick , and Trushant Kalyanpur .