Return to page WIKI


What is BERT?

BERT, short for Bidirectional Encoder Representations from Transformers, is a machine learning (ML) framework for natural language processing. In 2018, Google developed this algorithm to improve contextual understanding of unlabeled text across a broad range of tasks by learning to predict text that might come before and after (bi-directional) other text.


Examples of BERT

BERT is used for a wide variety of language tasks. Below are examples of what the framework can help you do:

  • Determine if a movie’s reviews are positive or negative
  • Help chatbots answer questions
  • Help predicts text when writing an email
  • Can quickly summarize long legal contracts
  • Differentiate words that have multiple meanings based on the surrounding text

Why is BERT important?

BERT converts words into numbers. This process is important because machine learning models use numbers, not words, as inputs. This allows you to train machine learning models on your textual data. That is, BERT models are used to transform your text data to then be used with other types of data for making predictions in a ML model.


Can BERT be used for topic modeling?

Yes. BERTopic is a topic modeling technique that uses BERT embeddings and a class-based TF-IDF to create dense clusters, allowing for easily interpretable topics while keeping important words in the topic descriptions.

What is Google BERT used for?

It’s important to note that BERT is an algorithm that can be used in many applications other than Google. When we talk about Google BERT, we are referencing its application in the search engine system. With Google, BERT is used to understand the intentions of the users’ search and the contents that are indexed by the search engine.

Is BERT a neural network?

Yes. BERT is a neural-network-based technique for language processing pre-training. It can be used to help discern the context of words in search queries.

Is BERT supervised or unsupervised?

BERT is a deep bidirectional, unsupervised language representation, pre-trained using a plain text corpus. and BERT: BERT pre-trained models deliver state-of-the-art results in natural language processing (NLP). Unlike directional models that read text sequentially, BERT models look at the surrounding words to understand the context. The models are pre-trained on massive volumes of text to learn relationships, giving them an edge over other techniques. With GPU acceleration in H2O Driverless AI, using state-of-the-art techniques has never been faster or easier.

Bert vs Other Technologies & Methodologies


Along with GPT (Generative Pre-trained Transformer), BERT receives credit as one of the earliest pre-trained algorithms to perform Natural Language Processing (NLP) tasks.

Below is a table to help you better understand the general differences between BERT and GPT.

Bidirectional. Can process text left-to-right and right-to-left. BERT uses the encoder segment of a transformation model. Autoregressive and unidirectional. Text is processed in one direction. GPT uses the decoder segment of a transformation model.
Applied in Google Docs, Gmail, smart compose, enhanced search, voice assistance, analyzing customer reviews, and so on. Applied in application building, generating ML code, websites, writing articles, podcasts, creating legal documents, and so on.
GLUE score = 80.4% and 93.3% accuracy on the SQUAD dataset. 64.3% accuracy on the TriviaAQ benchmark and 76.2% accuracy on LAMBADA, with zero-shot learning
Uses two unsupervised tasks, masked language modeling, fill in the blanks and next sentence prediction e.g. does sentence B come after sentence A? Straightforward text generation using autoregressive language modeling.

BERT vs transformer

BERT uses an encoder that is very similar to the original encoder of the transformer, this means we can say that BERT is a transformer-based model.

BERT vs word2vec

Consider the two examples sentences

  • “We went to the river bank.”
  • “I need to go to the bank to make a deposit.”

Word2Vec generates the same single vector for the word bank for both of the sentences. BERT will generate two different vectors for the word bank being used in two different contexts. One vector will be similar to words like money, cash, etc. The other vector would be similar to vectors like beach and coast.


Compared to RoBERTa (Robustly Optimized BERT Pretraining Approach), which was introduced and published after BERT, BERT is a significantly undertrained model and could be improved. RoBERTa uses a dynamic masking pattern instead of a static masking pattern. RoBERTa also replaces next sentence prediction objective with full sentences without NSP.

BERT Resources