BERT

1/13/22


Welcome back to the Odyssey! As I’ve mentioned earlier, for this project I will be using the language model BERT. But, what is a language model and how exactly does BERT work? In this post, I want to briefly answer those questions.


What is language modeling?

In essence, it is the problem of modeling the probability distribution of natural language text. To put this is plain English, we are trying to figure out how likely it is for a certain sentence to be said. A language model is able to do this because during training it is given a sequence of words and then tries to predict the next words using the representation it has learned. This lets the model obtain ‘meanings’ of words through the context they are used in, allowing it to determine how probable it is for each word to occur. The literature about language modeling is pretty diverse, so I will only highlight a few concepts relevant to my project. However, if you are interested in learning more, check out this course by the University of Toronto.

The first thing to note is the usage of Transformer architectures in language modeling. As I mentioned above, language models try to generate an output sequence given an input sequence, with longer input sequences providing more context and information for the model to make better output predictions. Traditionally, architectures like LSTMs, RNNs, and GRUs have been implemented to make use of long range dependencies. All of these models employ recurrence which is very computationally expensive and suffers from problems like vanishing gradients. Transformers revolutionized language modeling by using an attention mechanism and removing recurrence, meaning that the model processes the information at once and only focusses on the most salient details.

Secondly, all state of the art language models employ pre-training. Pre-training means that when you are training your model, it already has a preliminary representation of the language. The benefit being that during finetuning only a few parameters have to be learned from scratch to specialize to the downstream task, making training a lot less computationally expensive. This allows for much bigger and more complex models to be used.

Finally, it is important to note the widespread usage of unidirectional architectures. This method intuitively makes sense as when a model is implemented it will only know the preceding context. Therefore, during training this should be true as well. However, the authors of BERT argue that this practice restricts the power of the pre-trained representations by limiting the choice of architectures that can be used during pre-training.

What is BERT?

BERT is a language model that was developed by Google in 2018. The base model boasts around 110 million parameters and at the time the model achieved new state of the art results in 11 natural language processing tasks. BERT stands for Bidirectional Encoder Representations from Transformers. It sounds like a mouthful, but the first and last words are the only ones you really have to pay attention to.

Like most top-performing language models, BERT uses the transformer architecture and performs pre-training, the benefits of which I discuss above. The major difference, and reason for its success, is that it learns deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. The authors illustrate that this allows the model to learn a better representation of the corpus trained on than a unidirectional method.

BERT alleviates the unidirectional constraint by using a masked language modeling objective. Essentially, during training 15% of tokens are replaced with a mask token (i.e. [MASK]), which the model then tries to fill. As discussed above, language models ‘understand’ words by the company they keep (i.e. the surrounding words) and the masked language objective allows the model to use both the left and right contexts of words.

Hugging Face and German BERT

You might have been wondering how I am going to be using BERT in German if the model was pre-trained by Google in English. Luckily, Google open sourced their code and German NLP researchers have pre-trained BERT models on a German corpora that they claim performs as well as Google’s English BERT. These models have been published on Hugging Face, a non profit that seeks to democratize artificial intelligence through open sourcing. Additionally, Hugging Face provides a python library, which I will be using for this project, that greatly simplifies the process of working with these complex transformer models by providing functions to preprocess your data and train your model. If you are interested in working with transformers, I highly recommend that you check out their website.

BERT has two different variations, uncased and cased, that I need to choose from. The cased version treats capitalized words differently from uncapitalized ones, whereas the uncased version doesn’t. Although it has been concluded that the uncased model performs better in English, the situation in German is a bit more ambiguous because of the language’s greater use of capitalization. Because of this, I will be experimenting with both the cased and uncased versions of the German Bert model to see which performs better. Additionally, while there are several different German Bert models, this paper demonstrates that the models created by DBMDZ performed the best across the board.

Tune in for my next post where I talk more about finetuning!

Previous
Previous

Fine Tuning

Next
Next

The Project