Preprocessing
3/18/22
Welcome back to the Odyssey! In this post I am going to discuss the preprocessing that I am using to prepare the data set for my BERT model.
Preprocessing is a crucial part any machine learning and AI pipeline. The phrase “garbage in, garbage out” explains the intuition pretty well. Real world data can be really messy and in order to help our model learn better it makes sense to transform raw data into a more understandable format.
In general, there are 5 main tasks: data cleaning, data integration, data transformation, data reduction, data discretization, and tokenization (unique to language modeling). For my purposes, the only applicable tasks are data cleaning, data reduction, and tokenization. This post will only focus on the first two tasks. I will discuss tokenization in more depth when I start using the BERT model.
Data Cleaning
Data cleaning involves removing errors, inconsistencies, and smoothing out noisy data with the goal being to convert “dirty” data into “clean” data. It is important to note that for language modeling you shouldn’t actually do too much data cleaning because the goal is to learn a representation of the language. By removing a lot of messy data, you can actually be removing the intricate patterns present in the text that provide meaning. For example, if KhudaBukhsh had corrected all of the typos and misspellings present in his dataset for the paper We Don’t Speak the Same Language: Interpreting Polarization Through Machine Translation, he wouldn’t have discovered derogatory misaligned pairs and hence would have significantly different results. Basically, I only want to remove text that provides no contextual information to the language model and hence is more likely to confuse the model than help it learn a meaningful representation.
Tweets in particular contain a lot of these confusing strings of text and it is necessary to remove them for me to use the BERT model appropriately. I will be removing hashtags, mentions, links, and emojis.
Hashtags
Hashtags are a bit unique because some hashtags provide useful context information while others don’t. After manually combing through the data, I noticed that hashtags used within sentences are typically used in place of actual words and hence only the actual hashtag character should be removed. Hashtags used at the very beginning or end of a tweet on the other hand are simply useless noise and should be completely deleted (see example).
Mentions
Much like hashtags, mentions can provide useful context information when the user mentioned represents an entity that has a greater cultural significance (e.g. government branches or political figures). Typically though, this isn’t the case. Since differentiating these kinds of mentions is very difficult and would likely have a negligible impact on performance, I decided to simply remove all mentions.
Links
To us humans, links are a useful way to share information and are interpreted as representing some idea or argument. However, for a language model, links are just a string of seemingly random characters that somehow impact the words used in a tweet. Because of this, links tend to confuse language models and should be removed.
Emojis
Whether to include or exclude emojis for language modeling is debated in the academic literature. Since I didn’t want my models to predict emoji’s when I am using them to fill in cloze statements, I decided to remove them.
Data Reduction
Data reduction involves reducing the number of attributes and the actual number of examples in the data set.
If you can remember, the data from my scrape had the following attributes: Tweet ID, Date, Username, User Location, Language, Tweet URL, Tweet Text. However, every attribute besides the actual text was only used to identify or analyze the validity of the tweets. Additionally, the language model only needs the actual text for finetuning. Therefore, we can remove every column except for tweet text.
I also mentioned in a previous post that the amount of tweets scraped every month varied significantly. Since the amount of data available will impact the performance of a language model, it is necessary to standardize the dataset size for each month in order to be able to accurately compare the cloze statements from month to month. I decided that I will be using 150,000 tweets per month. Since January 2020 and February 2020 both have less than this quantity, I combined the two datasets.