Preliminary Results

3/23/22


Welcome back to the Odyssey! In this post I will discuss my first attempt at training and using the BERT model.


After scraping all of the tweets needed to train my models and performing , I wanted to get a rough idea of what the model is capable of and the quality of the the representation learned. This post will be split up into the actual training, the problems I ran into, a brief discussion of the results, and how I am moving forward.

Training

Code

 

The code was originally in a Colab notebook, which is where I have been running the finetuning, and is only in a regular file here because it makes it easier to screenshot.

 

Discussion

The code above can be broken down into 4 main parts: loading the data, tokenizing the data, creating the model, and training (finetuning) the model. As you can see, Hugging Face really streamlines the process and the whole program doesn’t take that much code. To get a bit of a better understanding of what is actually going on, lets look at the Hugging Face classes that I am using.

Datasets

This library is pretty self explanatory. It essentially allows me to convert my CSV file into a HuggingFace dataset object. This object is the only way to pass data into a HuggingFace model for finetuning.

AutoTokenizer

I mentioned in my previous post that one major part of pre-processing for language models is Tokenization, which consists of breaking up sentences into smaller units called tokens that represent different entities. This might seem pretty easy: just split the text up into words. However, actually doing this is easier said than done. For example, splitting text by whitespaces doesn’t allow the program to represent entities such as “New York” or “rock ‘n’ roll” as a single token. Further, in the real world the amount of different words used is enormous and creating a unique token for each word leads to the model getting a bad understanding of rarely used words. A way around this, and a method that is currently very widely used, is subword-based tokenization, which splits rare words into smaller meaningful sub words. The tokenization process that was used by the authors of German BERT (and the tokenizer that I am using) is WordPiece, which is the current state of the art sub-word tokenizer. Huggingface lets you import and use this tokenizer using the AutoTokenizer library.

You can read more about tokenization here.

AutoModelForMaskedLM

The models available to download on HuggingFace are the website’s main attraction. In order to load in a pretrained language model, all I have to do is call a AutoModel.from_pretrained function with a valid model name. Since I am performing masked language modeling, I am using the AutoModelForMaskedLM class. To load the German Bert model, I simply call the function AutoModelForMaskedLM.from_pretrained and pass it the parameter “bert-base-german-cased”.

Trainer and TrainingArgs

Training is the most complicated and computationally expensive part of my code. However, HuggingFace have streamlined training so that only two classes have to be used in order to finetune a state of the art language model. The Trainer class performs the actual training, taking a TrainingArgs object in as a parameter to set the hyperparameters. In order to maintain similarity with KhudaBukhsh’s paper I implemented the same hyperparameters he did: batch size = 4 (in the paper it was 16 but I had to lower it to 4 because of memory constraints), maximum sequence length = 128, maximum predictions per sequence = 20, fine-tuning steps = 20,000 (I did an equivalent amount in epochs), warmup steps = 10, and learning rate = 2e-5.

Results

Below you can see the results from some sample cloze statements I fed to the June 2020 model.

Overall, I was impressed with the quality of the learned representation. When compared to the non fine-tuned model (German BERT), it is clear to see that the new model gained an understanding of Germany’s perception of Covid and its related issues. This is evident when we look at the results from the cloze statements 2-6. The robustness of the model is especially apparent in examples 2-4 as the different words used for Covid (i.e. Covid, Corona, and Coid-19) all have the same meaning from the model’s point of view.

However, there is also clearly room for improvement. For cloze statements 1, 9, and 14, the fine-tuned model’s output is very similar to the base model’s output despite the fact that the pandemic has had a large impact on the perception of these entities.

Problems and Moving Forward

Training

One of the biggest problems I ran into while trying to actually run the code was the lack of memory on the Google Colab notebooks. Like I mentioned above, this prevented me from using the same hyperparameters as the KhudaBukhsh paper (I lowered the batch size to 4 from 16). After noticing that the model had learned an awful representation I realized that changing the batch size impacts the amount of data the model is trained on when using iterations. For example, KhudaBukhsh had 20,000 iterations with a batch size of 16 so the total amount of data points that the model sees is 20,000 * 16 = 320000. When I reduced the batch size to 4, I was only training on a total 80,000 data points (roughly 1 epoch). To account for this and to have the number of iterations stay consistent if I change the batch size in the future, I trained each model for 4 epochs.

Another major issue was the size of the model configurations. Originally I had planned to save multiple versions of a model during fine-tuning so that I would be able to select the best performing model to use in evaluation. However, after noticing that each model takes roughly a gigabyte of storage space I had to abandon this plan (I am training a total of 24 models) and instead am now using just the final model for each month. This of course prevents me from addressing overfitting, so in the future I will have to find another solution.

Data

After going back to KhudaBukhsh’s paper I also realized that I had miscalculated the amount of data he used. For his temporal tracking, he had a total of 6 million comments that were split up to fine tune 15 different models, meaning that he had about 400,000 comments per model. In contrast, I had limited myself to 150,000 tweets per model. Moving forward, I am going to increase the data set size to 300,000 tweets (which is the most I can increase it to while ensuring that each month has the same amount of tweets). If this amount of data isn’t enough I will merge datasets and track community perception on a bi-monthly basis.

Taking a much closer look at my data after this first initial test, I noticed that the tweets have a much more complex variation than I had accounted for in my data cleaning. In my second attempt I will preform more preprocessing to get rid of noisy text and in a future post I will go into more depth about how I am doing this.

Vocabulary and Model Choice

One major part of KhudaBukhsh’s process that I omitted in this test was supplementing the tokenizer’s vocabulary. The vocabulary of a sub-word tokenizer consists of the sub-words that the text will be split up into. This vocabulary is determined during pre-training and the problem here is that because this vocabulary has a limited size (30,000) and my data is very different from the pre-training data, a lot of the words present in my dataset are not present in the vocabulary. This means that the model will have a much harder time understanding the meaning of words that are frequently used in my dataset but aren’t stored in the vocabulary as full words. For example, BERT will have a find it more difficult to understand words such as “Masken”, and “Maskenpflicht”, which explains why the predictions for the cloze statements containing these words weren’t as good as others. Additionally, the model is only able to output a single token. This creates a problem when using cloze statements like “The biggest problem in Germany is [MASK]." because the model will be unable to output words comprised of multiple subwords like “Impfung” and “Maskenpflicht”. Moving forward I will supplement the tokenizer’s vocabulary with the 900 most frequently occurring words in my corpus.

I have also decided that I will be using the DBMDZ uncased German BERT model in the future because recent research has shown that it outperforms all other German BERT models in almost every benchmark.

Stay tuned for the next post!

Previous
Previous

Preprocessing 2.0

Next
Next

Preprocessing