Preprocessing 2.0

4/1/22


Welcome back to the Odyssey! In this post I will discuss the additional preprocessing that I will be applying to my dataset as well as the changes to the tokenizer’s vocabulary.


Data Cleaning

As I mentioned in my last post, a closer look at my data revealed that I needed more preprocessing to properly clean the data. The specific changes that I implemented were filtering out additional emojis, recoding errors in the UTF-8 encoding, deleting repeated tweets, and dealing with special characters. I also noticed that my code was taking an unusual amount of time. After doing some research, I found out that iterating through dataframe objects using for-loops is very computationally expensive, especially with the amount of preprocessing I am performing and the amount of data I am using. To take this into account I have used lambda functions with regex expressions to clean my data, which is a lot less computationally expensive. This is a useful website if you are keen on working with regex expressions that I found myself using a lot.

Cleaning

Additional Emojis:

In addition to regular emojis, the character emojis “\o/” and “¯\_(ツ)_/¯” are very commonly used and because of this make it much more difficult for the model to learn a proper representation. I removed them by using the code below.

Example of the character emoji '“\o/”.

Code to remove these emojis.

Errors in UTF-8 encoding:

Even though both HTML and my code use the UTF-8 encoding, when my tweets are scraped from twitter, a few symbols are not properly stored. The problem is that some special characters (e.g. “&”, “<“, and “>”) are reserved in HTML and have to be replaced with character entities. In my dataset the most commonly used reserved characters were:

  • “&” which showed up as “&amp;” and occured 320,713 times,

  • “>“ which showed up as “&gt;” and occured 55,565 times,

  • “<“ which showed up as “&lt;” and occured 20,283 times,

  • and “–” which showed up as “&amp#8211;” and occured 5,324 times.

To replace these character identities with actual characters I used a lambda expression again with the regex substitute function. Below you can see the code and how an “&” symbol shows up in my unprocessed dataset vs in a tweet.

Repeated Tweets:

In my original scrape I had checked for duplicates within the raw data, but the the amount of repeats was negligible when compared to the total size of the dataset. However, out of curiosity I checked for duplicates in the cleaned and found that approximately 10% of all the tweets were in fact repeats. The reason why I didn’t see these before was because the vast majority of the tweets are from news-bots that post the same message with slightly different links. I put an example and the code that I used to remove the duplicates below.

 
 

Special characters:

I decided to remove repeating special characters because their main purpose is to catch the viewers attention and don’t actually provide the BERT model any actual contextual information. However, special characters used on their own do provide valid information and thus shouldn’t be removed. Below are the special characters I focused on and how I dealt with them.

  • “+” and “*”: if repeated remove all but if they are not repeated do not remove them.

  • “|”: if “|” occurs in a word remove it, otherwise only remove repeating ones since “|” are used as full stops.

  • “!”, “(“, “)”, “-”, “!”, and “?”: Remove all repeated occurences of these characters (e.g. “!!!!” —> “!”)

Below is an example tweet using special characters and the code I used to remove the special characters.

 
 

Code

Below is the full code that I used for pre-processing my data.

Vocabulary

In KhudaBukhsh’s paper, roughly 75% of the words used in his data set were out of vocabulary (OOV), meaning that they weren’t present in the BERT tokenizer’s vocabulary list. In contrast, only about 16% of words in my dataset are not present in the German BERT tokenizer. Although KhudaBukhsh’s dataset is clearly much more complex, it is still necessary for me to supplement my tokenizer with frequently occurring words since these words are typically related to Covid and thus are the ones that I actually care about (e.g. corona, impfung, querdenker, etc.). If I don’t add these words the model will learn a much worse representation of the data and will be unable to use these tokens during prediction.

The words that are added to the model’s vocabulary are the most frequently occurring OOV words. As you might imagine, iterating through every example in a dataset in order to find these words is very computationally expensive and without an efficient implementation your algorithm might take hours to execute. I learned this the hard way after realizing that my code only reached a quarter of the total examples after running for an hour. A little tip: don’t use for-loops to iterate through Pandas.dataFrame objects. After converting my dataFrame to a python dictionary object using Pandas’ to_dict() function, my code was able to run in a couple of minutes. If you are interested in learning more about efficiently iterating through dataFrames, check out this link.

Below you can see the 20 most common words in the entire dataset and how often they occurred as well as the 20 most common words that weren’t already present in the dataset (the ones I actually added). The first 900 words from the dataset on the right were added to the the tokenizer’s vocabulary list.

 

The weird symbol in the 13th row is how excel interprets the umlaut “ü”.

 

As you can see, all of the most common words that weren’t already present in the vocabulary list are in some way related to covid.

 

Below is an example of how this affects the actual tokenizer in operation:


Stay tuned for the next post when I finally (hopefully) discuss the results!

Previous
Previous

Regular Expressions

Next
Next

Preliminary Results