Benjamin Pusch 12/05/2022 Benjamin Pusch 12/05/2022

Update on New Scrape

5/10/22

Welcome back to the Odyssey! Today I am going to provide a little update on the new scrape for the twitter data.

Updates

As I mentioned in my last post, I am scraping the replies to tweets from news organizations rather than simply scraping all tweets containing certain key words. The problem being that snscrape (the scraping library I am using) doesn’t have any built in functionality to scrape replies to tweets. In fact, the only way to efficiently obtain the replies to tweet is by using Twitter’s API, which I have previously discussed and I will not be able to use because of their restrictions.

There are two main ways that I can try to scrape the replies. In the first approach (below), I need to iterate through all of the tweets posted by one of the news organizations, and for each tweet they posted I need to iterate through their replies using snscrapes recursive function. However, the recursive function does not scrape nested replies and hence some tweets will be left out. The bigger issue with this approach, which I found out the hard way, is the amount of time it will take. After doing an initial test, I observed that scraping one week worth of tweets resulted in roughly 40,000 tweets and took around 2 hours. That means that for 2 years worth of data I will need to scrape for upwards of 200 hours. The second approach, which I am testing as I am writing this post, is to use twitter’s search functionality and scrape all tweets that were posted in the time frame I am interested in and contain a mention for one of the news accounts. The benefit of the first method is that I can control when the tweets from the news organizations are posted, meaning that the tweets replying will be in reaction to some event that happened during the timeframe. The second approach doesn’t have this feature as I can only control the date that the reply itself was posted, meaning that in theory my data could be full of tweets that are in reaction to some statistic/policy/event that happened in a different time frame which would defeat the purpose of the experiment. I will do some rigorous testing on the data to test whether this is the reality or not.

Stay tuned!

Code and News Organizations

News Organizations

Frankfurter allgemeine zeitung
Sueddeutsche Zeitung
Die welt
Handelsblatt
Der Spiegel
Die zeit
Bild
Tagesschau
ZDFheute Nachrichten
n-tv
Deutsche welle (dw)

Code

Benjamin Pusch 07/05/2022 Benjamin Pusch 07/05/2022

Covid: Analysis, Reflection, and Moving Forward

5/4/22

Welcome back to the Odyssey! My exams are finally over, so I will start posting to the blog more regularly again. Today, I want to go a bit more in depth about the results I obtained from my Covid, how the project went overall, what I learned, and how I will be applying these skills to my research on the German elections.

Reference Information

In order to evaluate the model’s findings I need to check how the hypothesized changes in perception correlate with real world events. It will be interesting to see what events had an impact on the perception of Covid related issues and how strong these impacts were. I am using the vaccination rates and Covid cases over the span of the pandemic as well as a general timeline of major Covid related events in Germany. Ideally, I would have conducted a survey of the German population to check the accuracy/quality of the model’s findings, however since I do not have the resources to do that I will be relying on KhudaBukhsh’s findings (his model was successful in tracking community perception).

Covid Cases and Vaccinations

Timeline

March 8, 2020: first death linked to COVID-19 is registered and the virus is reported in all of Germany's 16 federal states
March 2020: The German government issued worldwide travel warnings, and borders are closed to people from non-EU countries
March 22, 2020: first partial lockdown, international travel restrictions, and remote working begins. Reaction: praises from Germans and acceptance of restrictions.
April 2020: economy stagnates, 156 billion euro relief package is created, and panic buying ensues
May 4, 2020: First lockdown is over after 7 weeks
June, 2020: “Querdenker” (“lateral thinker”) movement gets stronger. They protest against the remaining restrictions which they claim are an infrigement of their fundamental civil liberties
August, 2020: police in Berlin have to break up a demonstration comprised of nearly 40,000 Querdenkers and far-right militant group (Reichsbürgers) members
End of August, 2020: Second wave starts to begin. 1,000 cases a day in August and 20,000 cases a day in September
October, 2020: Berlin imposes a curfew
November 2, 2020: Another Lockdown begins. Meetings in public are limited to two households and a maximum of 10 people. Many businesses in the catering, hospitality and tourism sectors again have to close down, as they had in spring
January, 2020: Vaccine deployment begins but is behind schedule and there are many logistical challenges. Germany no longer stands as a model for how to combat the virus.
January 6, 2021: Third wage begins and a strict lockdown is enacted
April 2021: Germany reaches 80,000 deaths. A reform of the Infection Protection Act in late April increased federal government powers, allowing it to mandate pandemic measures in hard-hit districts
May: The lockdown starts having an impact and infections fall. Third wave is broken. Merkel has her first vaccine dose.
June 2021: The government promises that by the autumn all German residents will have had the opportunity to be vaccinated
July 2021: While many areas of Germany roll back restrictions amid low COVID-19 case numbers, the country's incidence rate began to rise steadily again due to the delta variant
November 2021: Free testing is re-introduced due to a high number of cases after it had been phased out a month earlier.
December 2021 and January 2022: Cases rise drastically due to the omicron variant.

Sources:

https://www.dw.com/en/covid-how-germany-battles-the-pandemic-a-chronology/a-58026877

https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Germany

Successful Queries

Overall Opinion on Covid

Title: “I think that Covid is [MASK].”
Labels: over, terrible, to blame, gone, harmless (left to right)

What I found the most interesting about this query is that the main concern among Germans isn’t necessarily the danger or impact of Covid, rather it is by and large whether the pandemic is over or not. My personal hypothesis is that because Germany’s medical infrastructure was able to more or less handle the pandemic, the biggest worry for citizens is how the virus affects their day to day lives (lockdowns, virtual working, etc.). It would be fascinating to extend my research by comparing Germany’s response with another country that has a higher deaths per capita to see if words like ‘dangerous’ and ‘life-threatening’ are more prominent. Comparing my findings with new Covid cases per month in Germany highlights the short term mindset and optimism bias in German citizens as spikes in case numbers are directly correlated with a decrease in the perception that Covid is “over” while low Covid case numbers are directly correlated with spikes in confidence that Covid is over.

Opinions on Lockdowns

Title: “I think that this lockdown is [MASK].”
Labels: good, important, over, wrong, shit (left to right)

Like in the Covid opinion query, the main concern for Germans regarding lockdowns is when they will be over, and this opinion is heavily correlated with number of cases and the timeframe of the lockdowns enacted in Germany. “Lockdown” was not one of the keywords that I used to scrape tweets and hence the success of this query highlights the strength of the model to understand concepts that are directly related to the topic at hand but not explicitly in the scope of the research. It is also interesting to note that although the positive and negative opinions (“good”, “important”, “wrong”, and “shit”) of the lockdown were similar in strength throughout the pandemic, the opinion that the lockdown is “wrong” was stronger than the opinion the lockdown is “good”, indicating that the German population doesn’t believe that the benefits of the lockdown outweigh its costs.

Opinions on Masks

Title: “I [MASK] Masks.”
Labels: need, hate, wear, love, like (from left to right)

Germans’ opinions on masks had the clearest long term trends. Disregarding fluctuations, Germans became more accustomed to wearing masks over the pandemic. At the same time, the both the dislike for and favorable opinion of masks decreased the longer the pandemic went on. This illustrates that over time, mask wearing became the new normal.

Opinions on the “Querdenker” Movement

Title: “I think that the ‘Querdenker’ are [MASK].”
Labels: important, stupid, good, at fault, not important

The model shows that there is a clear negative perception of the Querdenkers throughout the pandemic, with initial support in the beginning of Covid waning and eventually dying out. It is also seems as though the omicron variant put a major strain on the movement with German’s blaming the current situation (increase in cases) on them.

Opinions on the vaccinated and unvaccinated

Title: “The unvaccinated are [MASK].”
Labels: at fault, querdenker, vaccinated, stupid, contagious

Title: “The vaccinated are [MASK].”
Labels: immune, at fault, stupid, dead, contagious

The perception of the vaccinated and unvaccinated didn’t have any noticeable long term trends, however the second lockdown (enacted in November) coupled with logistical challenges in vaccine deployment led to the unvaccinated being blamed for the current situation. Additionally, the beginning of the vaccine rollout was greeted with a lot of support as the vaccinated were seen as immune from Covid, however this perception went away over time. In addition, it seems as though the omicron variant had a severe polarizing effect on vaccine perception as both the unvaccinated and vaccinated were seen to be at fault for the rise in cases. It is also important to note that the unvaccinated were associated with the Querdenker movement throughout the pandemic, illustrating the bias and social stigma attached to choosing to be unvaccinated.

overall notes: i don’t think going month by month was a good idea since opinoins esspecially online, change very quickly. It would be interesting to test one week intervals because they might also provide a much smoother graph. this would require a lot mreo training though

I thought that I would see more long term trends regarding opions towrads covid, but is seems as though people are much more sensitive to the short term. The model is also much more sensitive to the hrot term as people will only speak what is currently on their mind. It would be interesting to see whether a different approach would be able to pick up longer term trends (maybe each model just gets newer information, as in the majority of the data is from the previous months but then new information (new tweets) get asdded for each month. sort of a moving average. would be interesting to play around with how the wieghting works (how strong is it)

Unsuccessful Queries, Overall Takeaways, and Moving Forward

Although a lot of my queries led to very interesting analysis, I was a bit disappointed by the overall quality of the findings. For the queries below there seems to be very little structure indicative of changes in perception and it seems more like very volatile noise. I think that a part of this problem is that I am tracking the change by month, a timeframe over which people’s mindsets can drastically change. Since the trends that I have been able to observe are much more meso/micro trends rather than long term macro trends, analyzing the tweets week by week makes a lot more sense. In addition, I should have realized that opinions are very reactionary, especially on a social media platform like twitter, and thus my research is much more likely to be able to illustrate how the German population perceives certain policies/events (mask mandates, lockdowns, travel restrictions being enacted, etc.) rather than long term changes in the perception of concepts related to covid.

In the queries below the major problem is that the confidence of the model’s outputs are far too low and thus it doesn’t make sense to make any inferences from the data. This low confidence is indicative of the fact that the model didn’t fine tune well to these concepts, which is most clearly demonstrated in “Spaziergang” query, where the token most likely to follow the query is a period. Although the query “The biggest problem in Germany is [MASK]” has a high confidence and shows long term trends, the query itself is an issue as all the data is about covid and thus the model will obviously show “covid” as the biggest problem in Germany.

This issue is the tip of the iceberg of a much deeper problem. One of the main reasons for analyzing the perception of Covid related topics is to be able to compare how Covid is perceived in the general public discourse and how it relates in importance to other current issues of the time. However, by training the model solely on data specifically about covid, I lose this perspective. Essentially, I will always be able to find enough data to train the model, however the actual amount of data there is indicative of how important the issue is, and thus by not taking that into account I am unable to perform an accurate analysis.

Originally, I had planned to move on to the Election research about Annalena Baerbock. However, I feel like there is a lot more potential in this project and I wouldn’t really be doing it any justice if I leave it how it is. The plan as it currently stands is to scrape new data from Facebook, Twitter, and/or Youtube using the comments of news media posts rather than a key-word approach. A model will then be trained for every week since the start of the pandemic (I will also be doing this with the data that I currently have to see if it improves results). I will also try to gain more insight about long term trends by using a sort of moving average approach, where I finetune one model on the first timestep of data and make predictions with this model, then keep finetuning the model on the next timestep of data, and so forth. I think this will yield more stable results and produce a model that behaves more like a human brain in that previous opinions are still taken into account but are less significant than new opinions.

Luckily, my exams are over which means I have a lot more time on my hands and will be posting a lot more regularly. Stay tuned!

Title: “I think that the covid vaccine is [MASK].”
Labels: good, wrong, sensible, important, ok (left to right)

Title: “I think that the mask mandate is [MASK].”
Labels: sensible, important, wrong, good, gone (left to right)

Title: “The biggest problem in Germany is [MASK].”
Labels: corona, covid, mask mandates, vaccine, the delta variant, the vaccine mandate, the omicron variant (from left to right)

Title: “I think that the vaccine mandate is [MASK].”
Labels: over, good, important, wrong, sensible

Title: “I think that a “Spaziergang” is [MASK].”
Labels: ., corona, important, good, nice

Title: “The Government is [MASK].”
Labels: ready, at fault, through, stupid, vaccinated (left to right)

Benjamin Pusch 23/04/2022 Benjamin Pusch 23/04/2022

Covid: Results!

4/20/22

Welcome back to the Odyssey! The results are in! In this post I will display the graphical results of the temporal queries I am most interested in and in a future post I will analyze the results more closely.

Who is to blame?

Title in English: Covid is the fault of [Mask].
Labels in English: government, people, politics, the unvaccinated, the vaccinated (left to right)

Opinions on Masks

Overall thoughts on Covid

Opinions on Vaccines

Opinions on Government

Opinions on Covid Lockdowns

Opinions on Mask Mandates

Overall Opinions on the Pandemic

Title: “The Pandemic is [MASK].”
Labels: over, ended, at fault, real, history (left to right)

Overall Opinion on Issues in Germany

Opinions on the Querdenker Movement

Opinions on the “Spaziergang”s in Germany

Opinions on the Vaccinated and Unvaccinated

Opinions on the Vaccine Mandate

Benjamin Pusch 14/04/2022 Benjamin Pusch 14/04/2022

Regular Expressions

4/12/22

Welcome back to the Odyssey! Unfortunately, end of term assignments have kept me from finishing my Covid research. However, I have continued to use a lot of what I have learned from my research in my assignments and I am going to use this post to briefly talk about one of the tools that has proved to be incredibly useful.

What are Regular Expressions?

Regular expressions (aka regex) are a sequence of characters that specify a search pattern in text. Basically, instead of matching a string exactly, you can use regex to match a set of strings that you are interested in. Regex is a universal tool and most major languages have regex directly built in or allow you to import a library that provides regex functionality.

Although the concept itself is pretty straight forward, the syntax can look pretty daunting. For example, this is one massive regular expression that I used to filter user input for a project I have been working on:

I am not going to talk about how to create regular expressions, but if you are interested in learning more I suggest you check out regexr.com which is the website that I used to learn about and test out regular expressions.

How have I used Regex?

This project

As you may have noticed if you have been following my blog, I use regex a lot to clean data. Dataframe objects make it especially easy to do this because the dataFrame apply function allows you to efficiently iterate through a column of a dataFrame and apply a lambda function. In my case the lamda function uses the function sub (from the regex library) to replace the string in question (a tweet) with the same tweet but with a specified pattern removed. For example, to remove all the links from my tweets I used the following expression:

clean_df['Tweet Text'] = clean_df['Tweet Text'].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x)) #remove links

Orbit Visualization

Regex is also incredibly powerful if you want to filter user input. I am currently working on a data visualization project for one of my college classes and in order to have the user query data, I needed to be able to check whether their input matched the syntax that I had specified. For one part of the project, I created an interactive 3d orbit visualizer and instead of writing a complicated function to check each separate case for my input, regex allowed me to filter user input in just a couple of lines. The regex expression I used is the same one that I displayed above:

So now, only when I input a valid string using the rules described on the right, does my program update what orbits are displayed.

In case you are interested, the orbit visualizer is astronomically accurate (for April 2nd, 2022) and below are some of the possible visualizations that I found interesting. These visualizations show all active satellites that match the query I entered.

ISS Satellites

Deep space satellites

Debris Satellites

All Starlink satellites

Thanks for reading! Next post will be about the Covid results! (coming soon)

Benjamin Pusch 02/04/2022 Benjamin Pusch 02/04/2022

Preprocessing 2.0

4/1/22

Welcome back to the Odyssey! In this post I will discuss the additional preprocessing that I will be applying to my dataset as well as the changes to the tokenizer’s vocabulary.

Data Cleaning

As I mentioned in my last post, a closer look at my data revealed that I needed more preprocessing to properly clean the data. The specific changes that I implemented were filtering out additional emojis, recoding errors in the UTF-8 encoding, deleting repeated tweets, and dealing with special characters. I also noticed that my code was taking an unusual amount of time. After doing some research, I found out that iterating through dataframe objects using for-loops is very computationally expensive, especially with the amount of preprocessing I am performing and the amount of data I am using. To take this into account I have used lambda functions with regex expressions to clean my data, which is a lot less computationally expensive. This is a useful website if you are keen on working with regex expressions that I found myself using a lot.

Cleaning

Additional Emojis:

In addition to regular emojis, the character emojis “\o/” and “¯\_(ツ)_/¯” are very commonly used and because of this make it much more difficult for the model to learn a proper representation. I removed them by using the code below.

Errors in UTF-8 encoding:

Even though both HTML and my code use the UTF-8 encoding, when my tweets are scraped from twitter, a few symbols are not properly stored. The problem is that some special characters (e.g. “&”, “<“, and “>”) are reserved in HTML and have to be replaced with character entities. In my dataset the most commonly used reserved characters were:

“&” which showed up as “&” and occured 320,713 times,
“>“ which showed up as “>” and occured 55,565 times,
“<“ which showed up as “<” and occured 20,283 times,
and “–” which showed up as “&amp#8211;” and occured 5,324 times.

To replace these character identities with actual characters I used a lambda expression again with the regex substitute function. Below you can see the code and how an “&” symbol shows up in my unprocessed dataset vs in a tweet.

Repeated Tweets:

In my original scrape I had checked for duplicates within the raw data, but the the amount of repeats was negligible when compared to the total size of the dataset. However, out of curiosity I checked for duplicates in the cleaned and found that approximately 10% of all the tweets were in fact repeats. The reason why I didn’t see these before was because the vast majority of the tweets are from news-bots that post the same message with slightly different links. I put an example and the code that I used to remove the duplicates below.

Special characters:

I decided to remove repeating special characters because their main purpose is to catch the viewers attention and don’t actually provide the BERT model any actual contextual information. However, special characters used on their own do provide valid information and thus shouldn’t be removed. Below are the special characters I focused on and how I dealt with them.

“+” and “*”: if repeated remove all but if they are not repeated do not remove them.
“|”: if “|” occurs in a word remove it, otherwise only remove repeating ones since “|” are used as full stops.
“!”, “(“, “)”, “-”, “!”, and “?”: Remove all repeated occurences of these characters (e.g. “!!!!” —> “!”)

Below is an example tweet using special characters and the code I used to remove the special characters.

Code

Below is the full code that I used for pre-processing my data.

Vocabulary

In KhudaBukhsh’s paper, roughly 75% of the words used in his data set were out of vocabulary (OOV), meaning that they weren’t present in the BERT tokenizer’s vocabulary list. In contrast, only about 16% of words in my dataset are not present in the German BERT tokenizer. Although KhudaBukhsh’s dataset is clearly much more complex, it is still necessary for me to supplement my tokenizer with frequently occurring words since these words are typically related to Covid and thus are the ones that I actually care about (e.g. corona, impfung, querdenker, etc.). If I don’t add these words the model will learn a much worse representation of the data and will be unable to use these tokens during prediction.

The words that are added to the model’s vocabulary are the most frequently occurring OOV words. As you might imagine, iterating through every example in a dataset in order to find these words is very computationally expensive and without an efficient implementation your algorithm might take hours to execute. I learned this the hard way after realizing that my code only reached a quarter of the total examples after running for an hour. A little tip: don’t use for-loops to iterate through Pandas.dataFrame objects. After converting my dataFrame to a python dictionary object using Pandas’ to_dict() function, my code was able to run in a couple of minutes. If you are interested in learning more about efficiently iterating through dataFrames, check out this link.

Below you can see the 20 most common words in the entire dataset and how often they occurred as well as the 20 most common words that weren’t already present in the dataset (the ones I actually added). The first 900 words from the dataset on the right were added to the the tokenizer’s vocabulary list.

The weird symbol in the 13th row is how excel interprets the umlaut “ü”.

As you can see, all of the most common words that weren’t already present in the vocabulary list are in some way related to covid.

Below is an example of how this affects the actual tokenizer in operation:

Stay tuned for the next post when I finally (hopefully) discuss the results!

Benjamin Pusch 30/03/2022 Benjamin Pusch 30/03/2022

Preliminary Results

3/23/22

Welcome back to the Odyssey! In this post I will discuss my first attempt at training and using the BERT model.

After scraping all of the tweets needed to train my models and performing , I wanted to get a rough idea of what the model is capable of and the quality of the the representation learned. This post will be split up into the actual training, the problems I ran into, a brief discussion of the results, and how I am moving forward.

Training

Code

Discussion

The code above can be broken down into 4 main parts: loading the data, tokenizing the data, creating the model, and training (finetuning) the model. As you can see, Hugging Face really streamlines the process and the whole program doesn’t take that much code. To get a bit of a better understanding of what is actually going on, lets look at the Hugging Face classes that I am using.

Datasets

This library is pretty self explanatory. It essentially allows me to convert my CSV file into a HuggingFace dataset object. This object is the only way to pass data into a HuggingFace model for finetuning.

AutoTokenizer

I mentioned in my previous post that one major part of pre-processing for language models is Tokenization, which consists of breaking up sentences into smaller units called tokens that represent different entities. This might seem pretty easy: just split the text up into words. However, actually doing this is easier said than done. For example, splitting text by whitespaces doesn’t allow the program to represent entities such as “New York” or “rock ‘n’ roll” as a single token. Further, in the real world the amount of different words used is enormous and creating a unique token for each word leads to the model getting a bad understanding of rarely used words. A way around this, and a method that is currently very widely used, is subword-based tokenization, which splits rare words into smaller meaningful sub words. The tokenization process that was used by the authors of German BERT (and the tokenizer that I am using) is WordPiece, which is the current state of the art sub-word tokenizer. Huggingface lets you import and use this tokenizer using the AutoTokenizer library.

You can read more about tokenization here.

AutoModelForMaskedLM

The models available to download on HuggingFace are the website’s main attraction. In order to load in a pretrained language model, all I have to do is call a AutoModel.from_pretrained function with a valid model name. Since I am performing masked language modeling, I am using the AutoModelForMaskedLM class. To load the German Bert model, I simply call the function AutoModelForMaskedLM.from_pretrained and pass it the parameter “bert-base-german-cased”.

Trainer and TrainingArgs

Training is the most complicated and computationally expensive part of my code. However, HuggingFace have streamlined training so that only two classes have to be used in order to finetune a state of the art language model. The Trainer class performs the actual training, taking a TrainingArgs object in as a parameter to set the hyperparameters. In order to maintain similarity with KhudaBukhsh’s paper I implemented the same hyperparameters he did: batch size = 4 (in the paper it was 16 but I had to lower it to 4 because of memory constraints), maximum sequence length = 128, maximum predictions per sequence = 20, fine-tuning steps = 20,000 (I did an equivalent amount in epochs), warmup steps = 10, and learning rate = 2e-5.

Results

Below you can see the results from some sample cloze statements I fed to the June 2020 model.

Overall, I was impressed with the quality of the learned representation. When compared to the non fine-tuned model (German BERT), it is clear to see that the new model gained an understanding of Germany’s perception of Covid and its related issues. This is evident when we look at the results from the cloze statements 2-6. The robustness of the model is especially apparent in examples 2-4 as the different words used for Covid (i.e. Covid, Corona, and Coid-19) all have the same meaning from the model’s point of view.

However, there is also clearly room for improvement. For cloze statements 1, 9, and 14, the fine-tuned model’s output is very similar to the base model’s output despite the fact that the pandemic has had a large impact on the perception of these entities.

Problems and Moving Forward

Training

One of the biggest problems I ran into while trying to actually run the code was the lack of memory on the Google Colab notebooks. Like I mentioned above, this prevented me from using the same hyperparameters as the KhudaBukhsh paper (I lowered the batch size to 4 from 16). After noticing that the model had learned an awful representation I realized that changing the batch size impacts the amount of data the model is trained on when using iterations. For example, KhudaBukhsh had 20,000 iterations with a batch size of 16 so the total amount of data points that the model sees is 20,000 * 16 = 320000. When I reduced the batch size to 4, I was only training on a total 80,000 data points (roughly 1 epoch). To account for this and to have the number of iterations stay consistent if I change the batch size in the future, I trained each model for 4 epochs.

Another major issue was the size of the model configurations. Originally I had planned to save multiple versions of a model during fine-tuning so that I would be able to select the best performing model to use in evaluation. However, after noticing that each model takes roughly a gigabyte of storage space I had to abandon this plan (I am training a total of 24 models) and instead am now using just the final model for each month. This of course prevents me from addressing overfitting, so in the future I will have to find another solution.

Data

After going back to KhudaBukhsh’s paper I also realized that I had miscalculated the amount of data he used. For his temporal tracking, he had a total of 6 million comments that were split up to fine tune 15 different models, meaning that he had about 400,000 comments per model. In contrast, I had limited myself to 150,000 tweets per model. Moving forward, I am going to increase the data set size to 300,000 tweets (which is the most I can increase it to while ensuring that each month has the same amount of tweets). If this amount of data isn’t enough I will merge datasets and track community perception on a bi-monthly basis.

Taking a much closer look at my data after this first initial test, I noticed that the tweets have a much more complex variation than I had accounted for in my data cleaning. In my second attempt I will preform more preprocessing to get rid of noisy text and in a future post I will go into more depth about how I am doing this.

Vocabulary and Model Choice

One major part of KhudaBukhsh’s process that I omitted in this test was supplementing the tokenizer’s vocabulary. The vocabulary of a sub-word tokenizer consists of the sub-words that the text will be split up into. This vocabulary is determined during pre-training and the problem here is that because this vocabulary has a limited size (30,000) and my data is very different from the pre-training data, a lot of the words present in my dataset are not present in the vocabulary. This means that the model will have a much harder time understanding the meaning of words that are frequently used in my dataset but aren’t stored in the vocabulary as full words. For example, BERT will have a find it more difficult to understand words such as “Masken”, and “Maskenpflicht”, which explains why the predictions for the cloze statements containing these words weren’t as good as others. Additionally, the model is only able to output a single token. This creates a problem when using cloze statements like “The biggest problem in Germany is [MASK]." because the model will be unable to output words comprised of multiple subwords like “Impfung” and “Maskenpflicht”. Moving forward I will supplement the tokenizer’s vocabulary with the 900 most frequently occurring words in my corpus.

I have also decided that I will be using the DBMDZ uncased German BERT model in the future because recent research has shown that it outperforms all other German BERT models in almost every benchmark.

Stay tuned for the next post!

Benjamin Pusch 22/03/2022 Benjamin Pusch 22/03/2022

Preprocessing

3/18/22

Welcome back to the Odyssey! In this post I am going to discuss the preprocessing that I am using to prepare the data set for my BERT model.

Preprocessing is a crucial part any machine learning and AI pipeline. The phrase “garbage in, garbage out” explains the intuition pretty well. Real world data can be really messy and in order to help our model learn better it makes sense to transform raw data into a more understandable format.

In general, there are 5 main tasks: data cleaning, data integration, data transformation, data reduction, data discretization, and tokenization (unique to language modeling). For my purposes, the only applicable tasks are data cleaning, data reduction, and tokenization. This post will only focus on the first two tasks. I will discuss tokenization in more depth when I start using the BERT model.

Data Cleaning

Data cleaning involves removing errors, inconsistencies, and smoothing out noisy data with the goal being to convert “dirty” data into “clean” data. It is important to note that for language modeling you shouldn’t actually do too much data cleaning because the goal is to learn a representation of the language. By removing a lot of messy data, you can actually be removing the intricate patterns present in the text that provide meaning. For example, if KhudaBukhsh had corrected all of the typos and misspellings present in his dataset for the paper We Don’t Speak the Same Language: Interpreting Polarization Through Machine Translation, he wouldn’t have discovered derogatory misaligned pairs and hence would have significantly different results. Basically, I only want to remove text that provides no contextual information to the language model and hence is more likely to confuse the model than help it learn a meaningful representation.

Tweets in particular contain a lot of these confusing strings of text and it is necessary to remove them for me to use the BERT model appropriately. I will be removing hashtags, mentions, links, and emojis.

Hashtags

Hashtags are a bit unique because some hashtags provide useful context information while others don’t. After manually combing through the data, I noticed that hashtags used within sentences are typically used in place of actual words and hence only the actual hashtag character should be removed. Hashtags used at the very beginning or end of a tweet on the other hand are simply useless noise and should be completely deleted (see example).

Noisy hashtags at the end of tweet that provide no useful information

Mentions

Much like hashtags, mentions can provide useful context information when the user mentioned represents an entity that has a greater cultural significance (e.g. government branches or political figures). Typically though, this isn’t the case. Since differentiating these kinds of mentions is very difficult and would likely have a negligible impact on performance, I decided to simply remove all mentions.

Links

To us humans, links are a useful way to share information and are interpreted as representing some idea or argument. However, for a language model, links are just a string of seemingly random characters that somehow impact the words used in a tweet. Because of this, links tend to confuse language models and should be removed.

Emojis

Whether to include or exclude emojis for language modeling is debated in the academic literature. Since I didn’t want my models to predict emoji’s when I am using them to fill in cloze statements, I decided to remove them.

Data Reduction

Data reduction involves reducing the number of attributes and the actual number of examples in the data set.

If you can remember, the data from my scrape had the following attributes: Tweet ID, Date, Username, User Location, Language, Tweet URL, Tweet Text. However, every attribute besides the actual text was only used to identify or analyze the validity of the tweets. Additionally, the language model only needs the actual text for finetuning. Therefore, we can remove every column except for tweet text.

I also mentioned in a previous post that the amount of tweets scraped every month varied significantly. Since the amount of data available will impact the performance of a language model, it is necessary to standardize the dataset size for each month in order to be able to accurately compare the cloze statements from month to month. I decided that I will be using 150,000 tweets per month. Since January 2020 and February 2020 both have less than this quantity, I combined the two datasets.

Code

Data Cleaning and Attribute Reduction Code

Random Sampling Code

Results

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

Software and Getting into AI

3/8/22

Welcome back to the Odyssey! I’ve had a couple of friends in the past few weeks ask me about what software I use and how to get into AI, so I want to use this post to briefly talk about what I use.

Software

PyCharm

Pycharm is an integrated development environment for python. I started using PyCharm because it was the software I had to use for a school class, but I stuck with it because of its advanced code editor. PyCharm makes coding a lot more streamlined by having intelligent code completion, on-the-fly error checking, easy project navigation, and a plethora of productivity features. One feature that I find myself using often is the ability to have multiple windows and tabs of code open at the same time. This proves especially useful when I am just updating or modifying a piece of code, rather than starting from scratch. Also, since PyCharm is so widely used, diagnosing and solving problems is a lot easier than if you were using a more obscure IDE. The main downside to using Pycharm is that all of the features it provides can make it run slower than other IDEs like Visual Studio Code.

You can download PyCharm here.

Google Colab

Google Colab is the other main piece of software that I use to code in python. I have actually discussed Google Colab in a previous post, so just to summarize:

“Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs.” - Google Research

I use Colab instead of Pycharm when I have programs that need to run for several hours. For these programs, running the code on my local machine isn’t feasible and running it on the cloud through Colab makes a lot more sense. I have considered fully transitioning from PyCharm to Colab, but I have noticed that because Colab is a hosted Jupyter notebook, there are a few unique issues that can be incredibly difficult to solve, making it impractical to be my main coding environment.

You can check out Colab here.

How do I learn more about AI?

I have dedicated a whole section of my website just to learning more about AI. If you’re interested I encourage you to check out the Want More? page. I talk about some of my favorite journal articles as well as some really helpful online courses that I have used.

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

Success!

3/3/22

Welcome back to the Odyssey! After 70 hours, my program was able scrape all of the tweets related to Covid and vaccinations in Germany.

Testing the Waters

I mentioned in one of my earlier posts, that I had a few concerns about the tweets I would be scraping. Namely, how well the German language parameter can filter tweets from people who culturally identify as German, how long the scraping would take, and how much space I would need to store the data. Before I started the actual scrape, I checked to see whether these concerns would become issues by analyzing a scrape from a random month. I took a random sample of 100 tweets from the scrape and found that only 6 tweets were from users outside Germany and only 7 tweets were not directly related to Covid.

In KhudaBukhsh’s paper, approximatly 30% of his data was not related to the topic he was analyzing, meaning that the noise in my data was well within the margin of error. Additionally, the scrape only took a bit over two hours and only needed 114 megabytes for storage. Assuming that the number of tweets present in this month (June 2020) was roughly equal to the number of tweets in other months, I estimated that the scraping would only take about 2 days, at most 3, and only about 3 gigabytes of storage would be needed. I decided that both the duration and storage would not be a problem.

The Scrape

The actual scrape took 70 hours to complete and a total of 4.1 gigabytes to store, a bit over my initial estimate.

The Code

As you can see, the code to scrape the tweets isn’t that complicated. I have a string variable that holds the base query that is used to search, as well as two lists that hold the date part of the search query and the file name for each scrape, which changes for every month.

twitter_query_no_dates: This is the base query. It ensures that every tweet pulled has at least one of the keywords (which are seperated by an OR) and has been written in German.

dates: I iterate through this list in my code and add it to the end of twitter_query_no_dates so that the query only searches in the month specified.. This allows me to save my data in multiple more organized files rather than one massive file. (You can only see the last 7 months here).

file_names: This list holds the names for each file that will be saved

The loop: This loop consists of an inner and outer loop. The outer loop iterates through each month and creates a list to store all of the tweets for that month. For each iteration of the outer loop, the inner loop iterates through every tweet matching the corresponding query using snscrape’s TwitterSearchScraper method. Once the inner loop finishes, I convert the list into a pandas.Dataframe object and save it as a CSV file. You can also save the Dataframe as a json file, which is computationally more efficient, but CSV files are more human friendlym which is why I opted to use them.

The Data

My program scraped a total of 12 million tweets. The distribution for each month wasn’t symetrical, with January 2020 having only 10K tweets and November 2021 having 1.1 million tweets. As I mentioned in an earlier post, I will be using a random subset of tweets to address the variability across months. However, I didn’t expect such an extreme difference in the number of tweets. January and February are the main problem as all other months are in a similar range. In order to address the problem I will have to either merge the data for January and February, or remove them altogether depending on the sample size that I use.

Ethics

I want to talk a bit more about what I actually saved from each tweet, because it touches on some broader ethical concerns regarding scraping. You might have noticed that when I store my list as a Dataframe object, I have a second columns parameter. This parameter allows me to specify what attributes from each tweet I want to save (Tweet ID, Date, Username, UserLocation, Language, Tweet URL, Tweet Text). What’s important to note is that these tweets reflect the opinions of real people and therefore scraping any information that isn’t strictly necessary is highly unethical. Further, as it currently stands, my data set contains identifying information such as the tweet ID, the username of the author, and tweet URL. While I need this information to clean my data, if I were to share this dataset, these personal identifiers could easily be used to target and harass specific users based on their views. Therefore, especially when scraping massive amounts of data like I have done, it is crucial to be aware of the privacy concerns and act accordingly. For example, during my cleaning procedure I will remove everything, including mentions, except for the actual body of the tweet.

Now that I have the tweets, the next step is to process the data. Stay tuned for one of the upcoming posts to see how I clean the tweets.

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

Servers and Remote Control

2/23/22

Welcome back to the Odyssey! Today I want to share what environment I am using to scrape my tweets.

Scraping takes a long time. If you read journal articles where the authors had to create a dataset of tweets, you will often see that it took multiple days for their program to execute.

This is a problem. As someone who has to use their computer every day for college work, I can’t just have a program running in the background for several days. It would kill my battery and I wouldn’t be able to close my laptop. Luckily there are several alternatives which I will discuss in this post.

Google Colaboratory

“Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs.” - Google Research

I use Colab pretty frequently because of its flexibility. By having all of my python files on my drive, I can access them on any computer and don’t have to worry about the hardware. Colab also allows you to have multiple sessions open at once, which is useful if you running longer programs (such as a scraper) and want to work on something else in the meantime. The main downside of Colab is that resources are not guaranteed, which means that you might not get the best CPUs or GPUs available every time you connect to Google’s server. A second downside is that the maximum runtime for a Colab notebook is 12 hours. This makes it annoying to use for scraping, which can often last for much longer than 12 hours. You can upgrade your Colab account to have longer runtimes, but that costs money.

Despite the runtime issue, I actually tried using Colab as my environment for this project, however I ran into an issue unique to the snscrape library. Snscrape requires Python 3.8, but Colab currently only supports up to Python 3.6. So if you were planning on using snscrape on Colab, you will have to resort to one of the alternatives.

Cloud Computing

Cloud computing is another excelent way to scrape data. Companies that offer cloud computing infrastructure allow you to rent GPUs or CPUs, which in my case could then be used to scrape tweets or train my language model. The major downside is that these servers cost money to use, which I am not willing to do. However, if you are interested in using a cloud server, I would recommend that you take a look at one of the following:

Amazon’s AWS

Linode

Remote Control

Remote desktop control is a very unconventional way of getting around the problem of using your own computer for computationally intensive tasks. Remote control allows you to control another computer through your own computer. I am currently living in Dublin, but I have a desktop back in the US that no one is using. Basically, since Colab didn’t work and I don’t want to pay for a cloud server, I thought it would be fun to mine Tweets from my local machine in the US. The remote control software that I am using is free and you can download it at the website below:

TeamViewer

P.S. The next post will be about the scraping results, so stay tuned!

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

Brick Walls

2/16/22

Welcome back to the odyssey! I finally have an update on the project, unfortunately it wasn’t what I was hoping for.

The Problem: Twitter API

In the past two weeks, I have constantly been hitting brick walls trying to scrape the Covid tweets. From the very beginning, using the twitter API was a struggle. Immediately after I created my account, I received the following email:

I was confused about what I had done to deserve a suspension and appealed it through Twitter’s support system. Four days later my account was reinstated. Apparently, Twitter’s automated system had accidentally flagged my project as an abusive API:

The next problem wasn’t far around the corner. As I mentioned in my previous posts, the standard access to the Twitter API is very constraining and doesn’t have the functionality I need to scrape tweets for this project. Twitter does have an application process that allows you to request access to either the Elevated or Academic Research version of the API. I applied first for the Elevated access, which was given to me only after I explained several times what my project was and what my intentions with the API were. However, I quickly realized that the elevated API adds barely any functionality and the exhaustive search endpoint that I need is only accessible through the Academic Research access level. I once again went through the application process to try and get access to the Academic Research API. This time though, I was rejected and no amount of follow up emails were able to change Twitter’s mind. Turns out that the Academic Research API is not available to undergraduate students who aren’t officially affiliated with a research department.

The Solution: Snsrape

I reached the conclusion that I wouldn’t be able to use Twitter’s API, leaving me in a bit of a pickle because I wasn’t sure how else I would be able to get tweets. I have discussed some possible alternatives to Twitter (e.g. Facebook) in one of my earlier posts, but I didn’t want to give up so easily. After doing some research I stumbled across open source web scrapers, which provide source code that allows you to extract data from web pages. I landed on snscrape, which is an open source scraper that specializes solely in social network services. The scraper provides a module specifically for Twitter which allows me to input a search query and extract all the tweets meeting that criteria. This is the same functionality that I would have gotten with Twitter API’s exhaustive search endpoint.

I think this experience highlights an important part of getting involved with AI. These projects are very complex and there will always be issues. If you treat these setbacks as crises that can only set you back, you will not be successful. If you instead treat each setback as an opportunity to find a better way forward, you will find that the problems you run into are a blessing and not a curse. For me specifically, using snscrape will now allow me scrape more tweets than I would have been able to with Twitter’s API while maintaining the exact same functionality. In fact, I have already run a succesful test scrape, so stay tuned for the results soon!

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

Ethics and AI

2/7/22

Welcome back to the Odyssey! In this post I am going to discuss the impact AI has on our society using three fascinating articles from the MIT Technology Review

MIT Technology Review

MIT Technology Review is a magazine owned by MIT that publishes news about the world’s newest and most innovative technologies. What I love about them is that they focus both on the technical aspect of tech, as well as its greater impact in society. You can subscribe and get access to their magazine for $50 a year. I’ve been subscribed for a bit over 3 years now and I think it’s totally worth it. If you’re unwilling to pay the fee, you can also always just delete your cookies after you run out of free articles.

Of course technology perpetuates racism. It was designed that way.

This article isn’t specifically about AI, but its major takeaways are incredibly relevant to AI. McIlwain discusses how one of the first uses of predictive modeling was when President Johnson wanted to create a surveillance program in “riot affected areas” to discover the causes of the “ghetto riots” in the long hot summer of 1967. The information gathered was used to trace information flow during protests and decapitate the protests’ leadership. This layed the foundation for racial profiling, predictive policing, and racially targeted surveillance.

We’ve already started going down on the same path with AI. Contact tracing and surveillance during the pandemic employ AI systems and are once again making Black and Latinx people the threat. Automated risk profiling systems disproportionately identify Latinx people as illegal immigrants and facial recognition software technologies convict criminals on the basis of skin color. The academic community is aware of AI’s propensity to pick up bias, yet the impact this might have never seems to be considered by researchers. Moving forward, an AI development and implementation needs to be seen through an ethical lens, rather than a results driven one.

You can find the article here.

AI has exacerbated racial bias in housing. Could it help eliminate it instead?

“Few problems are longer-term or more intractable than America’s systemic racial inequality. And a particularly entrenched form of it is housing discrimination.”

This article discusses how, even though automated mortgage lending systems are not built to have discriminator policies, they still end up learning unfair policies that disproportionately hurt Black and Hispanic borrowers. These systems are designed to have a profit maximizing mindset, but whoever designed them didn’t understand the racial consequences this focus has. A study mentioned in the article found that the price of approved loans differed by roughly $800 million a year because of race. The article also discusses how far behind regulations are on understanding how these systems even work. In order to fix this problem, the article argues that we need educated regulators who understand how AI works as well as more diversity and foresight in the teams developing the algorithms.

This issue of automated housing algorithms applies to every application of AI. There is an industry wide lack of consideration about the complexity of the problem that the system being implemented is trying to address. If this practice continues, there will just be more and more severe consequences disproportionately affecting vulnerable groups. In order to fix this, project teams have to be interdisciplinary and focus on the implicit consequences of their decisions, not just the explicit ones.

You can find the article here.

An AI saw a cropped photo of AOC. It autocompleted her wearing a bikini.

“Feed one a photo of a man cropped right below his neck, and 43% of the time, it will autocomplete him wearing a suit. Feed the same one a cropped photo of a woman, even a famous woman like US Representative Alexandria Ocasio-Cortez, and 53% of the time, it will autocomplete her wearing a low-cut top or bikini.”

Bias in data is a serious problem. Virtually all high performing AI systems require massive amounts of data to be trained properly, which naturally contain bias that the model then exploits to improve performance. In the case of image generation models, these biases sexualize women leading to the scenario described above.

Not being able to control what your model learns and the consequences that lack of control has, is what inspired me to make this blog in the first place. Currently, there is a serious lack of concern in the industry about AI’s opacity and bias problem, even though the consequences are devastating. It seems as though we are so focused on making artificial intelligence become exactly like human intelligence, that we haven’t taken a step back to question whether that is the best path to go down. Humans are plagued by cognitive biases. Systems that emulate our behavior will have the same problem and ultimately end up making pervasive societal issues like racism and sexism worse.

You can find the article here.

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

Time Travel

2/2/22

Welcome back to the Odyssey! Today I am going to deviate a bit from the regular program to talk about one of my favorite and probably most interesting journal articles I ever read.

When I was younger I was obsessed with ancient civilizations — the Sumerians, Ancient Greeks, Ancient Egyptians, the Chinese dynasties, you name it. I was enthralled by their magical cultures and legendary stories. If I had one wish back then, it would have been to travel back in time to talk to the ancient Greeks or Egyptians and learn what their life was like. In a way though, we can do this with AI. If you were to train a language model on the texts of an ancient civilization, it would learn the values, beliefs, and norms that define its society, allowing us to ‘talk’ with the civilization.

The journal article that this post is about does something similar. The authors process historical texts and use the word embeddings generated to study historical trends and social change regarding ethnic and gender stereotypes in the last 100 years.

The Paper

Word Embeddings

I have discussed word embeddings in a previous post. To summarize, a word embedding is a learned representation of the text in your data set where each word is represented by a unique vector in some vector space of predefined size. This representation encodes the meaning of each word such that the words that are closer in the vector space are expected to be similar in meaning.

This context based representation means that embeddings can be used to measure the relative strength of association between words by comparing the Euclidian distances between their corresponding embedded vectors. For example, you would expect the distance between the vector for ‘ocean’ and ‘water’ to be smaller than the vector for ‘desert’ and ‘water’, meaning that the word ‘water’ is more associated with the word ‘ocean’ than it is with ‘desert’.

Methods

The paper studies the changes in ethnic and gender stereotypes over time by measuring the strength of association between occupations and adjectives. To be more exact:

“As an example, we overview the steps we use to quantify the occupational embedding bias for women. We first compute the average embedding distance between words that represent women—e.g., she, female—and words for occupations—e.g. teacher, lawyer. For comparison, we also compute the average embedding distance between words that represent men and the same occupation words. A natural metric for the embedding bias is the average distance for women minus the average distance for men. If this value is negative, then the embedding more closely associates the occupations with men. More generally, we compute the representative group vector by taking the average of the vectors for each word in the given gender/ethnicity group. Then we compute the average Euclidean distance between each representative group vector and each vector in the neutral word list of interest, which could be occupations or adjectives. The difference of the average distances is our metric for bias—we call this the relative norm difference or simply embedding bias.”

Results - Gender Stereotypes

Through their use of word embeddings, the authors of this paper found that language today is even more biased than traditional methods like occupational data analysis show. Further, the embedding bias captures stereotypes in a far more nuanced and accurate fashion than occupational statistics.

Their results also show that bias, as seen through adjectives associated with men and women, has decreased over time and that the women’s movement in the 1960s and 1970s especially had a systemic and drastic effect in women’s portrayals in literature and culture.

Using word embeddings to analyze biases in adjectives is especially important because while the effects of the women’s movement on inclusive language is well documented, the literature currently is lacking systematic and quantitative metrics for adjective biases.

The change in the biases, as seen through adjectives associated with men and women, is shown quantitatively and qualitatively below:

Results - Ethnic Stereotypes

The paper illustrates the effectiveness of word embeddings to study ethnic biases over time.

Word embeddings allowed the researchers to better understand how broad trends in the 20th-century influenced the view of Asians in the United States. The embedding bias showed that prior to 1950 strongly negative words, especially those often used to describe outsiders, are among the words most associated with Asians: barbaric, hateful, monstrous, bizarre, and cruel.

However, a rising Asian population in the United States after 1950, led to these words being largely replaced by words often considered stereotypic of Asian Americans today: sensitive, passive, complacent, active, and hearty, for example. Word embeddings allowed the researchers to quantify this change, illustrating a remarkable change in the attitudes towards Asian Americans as words related to outsiders steadily decrease in Asian association over time (the exception being WWII).

The researchers also found that word embeddings serve as an effective tool to analyze finer-grained trends, using the stereotypes towards Islam, Russians, and Hispanics as their foci. Word embeddings were able to capture how global events, such as 9/11 and the Cold War, lead to a sharp change in the stereotypes towards ethnicities while more frequent but less salient events have a more gradual impact on these stereotypes.

Results - Validation

To validate the effectiveness of this approach, the researchers compared their results to the occupational differences between the groups in question using US census data, which supported their findings.

Why Should You Care?

Accurately and quantitatively measuring changes in a societal characteristic as subtle as bias is a very difficult, but vitally important, task. Typically, metrics like poverty, GDP, and inequality are used to measure the improvement or progression of society. If a country is richer and more equal, it is better off. And while that is generally true, these metrics miss the intricate details of human behavior that have a major impact on our lives. If we are not able to quantify and accurately measure the change in social biases, we won’t know what direction to head in. Hence, AI’s ability to pick up these subtleties allows us to map a path forward, make better decisions, and evaluate the effectiveness of initiatives that try to tackle these issues.

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

Scraping Tweets

1/27/22

Welcome back to the Odyssey! I am still waiting to hear back from Twitter about the higher access levels, so in the meantime I thought I would use this post do go more in depth about my Twitter scraping.

I’ve decided that I will start with just the Covid vaccine focus of my project for now since the politician aspect is a bit more complex. In order to scrape all of the tweets that are talking about Covid, I have created a set of key words that must be present in the tweets (image to the right). In addition, I am using Twitter’s language filter so that only tweets in German are scraped. Finally, I will be using Twitter’s time and date parameters to split up the data I scrape into datasets for each month.

The actual query that will be fed into the exhaustive search endpoint function is:

“Impfung OR Impfpflicht OR Querdenker OR Omicron OR Delta OR Corona OR covid OR Maskenpflicht OR (Verschwörung AND Corona) OR (Spaziergang AND Corona) lang:de until: ‘date2’ since: ‘date1’ “

In English:

“Vaccine OR Vaccine Mandate OR Lateral Thinker OR Omicron OR Delta OR Corona OR Covid OR Mask Mandate OR (Conspiracy AND Corona) OR (Stroll AND Corona) lang:de until: ‘date2’ since: ‘date1’ “

Some example results from the query are shown below.

Potential Failure Points

Since this is my first time scraping tweets, it is important to acknowledge some potential failure points that I might run into:

Effectiveness of Twitter

Do the Tweets actually capture the opinion of the population? By using Twitter, I am assuming the tweets I scrape provide a sample that accuratly represents the whole population. While the amount of literature using Twitter for this purpose is a pretty good reason to believe that this is in fact the case, there is definitely a possibility that that the people who frequently use twitter are just a small, but particularly outspoken, subset of the population. It will be difficult to confirm or reject my assumption because I can’t detect this error by simply looking at the data.

Reflection of German Opinion

I need to filter the tweets so that they only reflect the opinions of Germans. I will be using the language parameter in Twitter’s search query to ensure that this is the case. However, by doing so I am assuming that tweets in German imply that the author of the tweet is culturally German, which isn’t necessarily true. In order to account for this, I will sample some tweets from the scrape that have location tags. Going off KhudaBukhsh’s paper, as long as 70% or more of the Tweets are from Germany, the robustness of the BERT model will ensure that the representation learned reflects the opinoins of Germans. The reason why I am not using location tags to begin with is because they are optional and most users don’t enable them. Hence, I wouldn’t be able to scrape enough tweets to train the models.

Time and Storage Issues

I have no idea how many tweets will be scraped per month. Since Covid has been a major talking point for the last two years, it is likely that this number is in the millions. This presents two problems. First, I could run out of storage on my computer due to the volume of tweets. The bigger concern though is the time it will take to scrape and train the model. Depending on how fast my computer can pull the tweets , it could take upwards of one week. Further, the access level of my Twitter App will determine the maximum number of tweets I can scrape, which might be well below the number of tweets that I need. Fine-tuning the models on this quantity of tweets is another problem due to the time consuming nature of training BERT. To avoid this, I will take a random sample of equal size for every month to standardize the amount of Tweets used and to reduce the time of fine-tuning. I am also going to estimate the overall number of Tweets that I will be scraping by testing the API on only one months worth of data first, allowing me to then evaluate the total time needed and storage necessary.

Stay tuned for the next update!

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

API Setup and Tweepy

1/20/22

Welcome back to the Odyssey! Today I wanted to make a bit more of an informational post to talk about using the Twitter API to scrape tweets through Python.

Twitter API

Social media platforms know how useful their platforms are for researchers and offer APIs that allow you to interact with their app. Twitter in particular is very widely used by researchers due to the fact that almost every tweet is public and their search functionality allows for very specific queries. This makes knowing how to scrape tweets a pretty useful skill to have. In order to get started you have to go to Twitter’s developer portal and create a new app. After you finish the form to create the app, you will receive a pair of consumer keys and a pair of access keys. These keys are necessary to use the API later in Python, so write them down somewhere.

The Twitter API allows you to access Twitter through various endpoints. Which are essentially functions that let you read information off Twitter or post onto Twitter. Some of the most useful endpoints are the search and timeline endpoints, which respectively return the tweets from a certain query and return the tweets from a specified user. I will primarily be using the search endpoint, but the timeline endpoint is a handy alternative in case I just want to scrape replies to posts by news agencies. You can check out the other endpoints here.

Tweepy

Tweepy is library that allows you to access the Twitter API through Python. Using it is pretty straight forward and if you have any experience with Python you shouldn’t really run into any problems. You can use pip to install — ‘pip install tweepy’.

In order to use your API, you have to create a client object using the keys you wrote down earlier

After that, all you have to do is call the endpoints with the client object. In the example to the left, I am using my client object to scrape 20 of the most recent tweets from a user’s timeline.

That’s about it - its pretty straight forward to scrape tweets from Twitter using the API. That being said, I actually can’t scrape any tweets currently since I only have standard access. Twitter offers 3 levels of access (standard, elevated, and academic researcher) and in order to use the exhaustive search endpoint, you need the academic research level of access. Additionally, the standard access level only allows each Twitter App to pull at most 500,000 tweets per month at a rate of 300 requests per 15 minutes. This means that it will take me 41 hours to scrape 500,000 tweets, which won’t be enough for all of my models. Luckily, you can apply for the higher access levels, so hopefully by the time I post my next update, I’ll have the academic research access.

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

Fine Tuning

1/16/22

Welcome back to the Odyssey! Today I will be discussing how exactly I am going to be finetuning the German Bert model.

In order to study Germany’s perception of vaccines and female politicians, I need a dataset that captures the opinions of a subset of Germans regarding these issues who represent the average views of the entire country. In KhudaBukhsh’s case, he needed a data set that would capture India’s perception of political entities. His approach was to scrape comments from national and regional Indian YouTube news channels and use them to finetune his model. While this method worked well for him, it won’t work for me for two main reasons:

India’s young population uses social media as their primary source to consume news media, with both regional and national news channels having millions of subscribers. In contrast, Germany has a much older population that primarily consumes news media through TV or the newspaper and as a result a lot of major news outlets either don’t have YouTube channels or have very low engagement. Therefore, using YouTube comments will not provide enough data and the data collected wouldn’t accurately represent the population’s views.
In KhudaBukhsh’s study, around 70% of the video’s scraped were about politics, meaning that the model would be able to get a good understanding of the political discourse. German news media’s YouTube content on the other hand focuses mainly on clickbait videos discussing celebrity gossip, as those are the only types of videos that get views. Since filtering videos based on what topic they discuss is a pretty challenging task, it makes a lot more sense to look for an different data source.

Twitter is an excellent alternative as their search functionality allows me to filter for tweets that only contain certain words or phrases. This allows me to scrape tweets that express an opinion on the topics of interest and hence study the aggregate community perception of these issues. Twitter isn’t the only alternative though, I could also try a hybrid approach with Facebook where I scrape the replies to posts from news outlets, similar to KhudaBuksh’s YouTube approach, but only use the posts that contain key words relating to the topic of interest.

The next step moving forward is to learn about the Twitter API and teach myself to scrape tweets with it. Stay tuned!

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

BERT

1/13/22

Welcome back to the Odyssey! As I’ve mentioned earlier, for this project I will be using the language model BERT. But, what is a language model and how exactly does BERT work? In this post, I want to briefly answer those questions.

What is language modeling?

In essence, it is the problem of modeling the probability distribution of natural language text. To put this is plain English, we are trying to figure out how likely it is for a certain sentence to be said. A language model is able to do this because during training it is given a sequence of words and then tries to predict the next words using the representation it has learned. This lets the model obtain ‘meanings’ of words through the context they are used in, allowing it to determine how probable it is for each word to occur. The literature about language modeling is pretty diverse, so I will only highlight a few concepts relevant to my project. However, if you are interested in learning more, check out this course by the University of Toronto.

The first thing to note is the usage of Transformer architectures in language modeling. As I mentioned above, language models try to generate an output sequence given an input sequence, with longer input sequences providing more context and information for the model to make better output predictions. Traditionally, architectures like LSTMs, RNNs, and GRUs have been implemented to make use of long range dependencies. All of these models employ recurrence which is very computationally expensive and suffers from problems like vanishing gradients. Transformers revolutionized language modeling by using an attention mechanism and removing recurrence, meaning that the model processes the information at once and only focusses on the most salient details.

Secondly, all state of the art language models employ pre-training. Pre-training means that when you are training your model, it already has a preliminary representation of the language. The benefit being that during finetuning only a few parameters have to be learned from scratch to specialize to the downstream task, making training a lot less computationally expensive. This allows for much bigger and more complex models to be used.

Finally, it is important to note the widespread usage of unidirectional architectures. This method intuitively makes sense as when a model is implemented it will only know the preceding context. Therefore, during training this should be true as well. However, the authors of BERT argue that this practice restricts the power of the pre-trained representations by limiting the choice of architectures that can be used during pre-training.

What is BERT?

BERT is a language model that was developed by Google in 2018. The base model boasts around 110 million parameters and at the time the model achieved new state of the art results in 11 natural language processing tasks. BERT stands for Bidirectional Encoder Representations from Transformers. It sounds like a mouthful, but the first and last words are the only ones you really have to pay attention to.

Like most top-performing language models, BERT uses the transformer architecture and performs pre-training, the benefits of which I discuss above. The major difference, and reason for its success, is that it learns deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. The authors illustrate that this allows the model to learn a better representation of the corpus trained on than a unidirectional method.

BERT alleviates the unidirectional constraint by using a masked language modeling objective. Essentially, during training 15% of tokens are replaced with a mask token (i.e. [MASK]), which the model then tries to fill. As discussed above, language models ‘understand’ words by the company they keep (i.e. the surrounding words) and the masked language objective allows the model to use both the left and right contexts of words.

Hugging Face and German BERT

You might have been wondering how I am going to be using BERT in German if the model was pre-trained by Google in English. Luckily, Google open sourced their code and German NLP researchers have pre-trained BERT models on a German corpora that they claim performs as well as Google’s English BERT. These models have been published on Hugging Face, a non profit that seeks to democratize artificial intelligence through open sourcing. Additionally, Hugging Face provides a python library, which I will be using for this project, that greatly simplifies the process of working with these complex transformer models by providing functions to preprocess your data and train your model. If you are interested in working with transformers, I highly recommend that you check out their website.

BERT has two different variations, uncased and cased, that I need to choose from. The cased version treats capitalized words differently from uncapitalized ones, whereas the uncased version doesn’t. Although it has been concluded that the uncased model performs better in English, the situation in German is a bit more ambiguous because of the language’s greater use of capitalization. Because of this, I will be experimenting with both the cased and uncased versions of the German Bert model to see which performs better. Additionally, while there are several different German Bert models, this paper demonstrates that the models created by DBMDZ performed the best across the board.

Tune in for my next post where I talk more about finetuning!

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

The Project

1/9/22

Welcome back to the Odyssey and Happy New Year! In today’s post I will go in more depth about the mythology used in KhudaBukhsh’s paper and discuss some potential be difficulties of my research.

What will I be doing?

As I mentioned in one of my earlier posts, I am researching the change in community perception of vaccinations in Germany during the pandemic as well as the sexist perceptions of female politicians such as Annalena Baerbock. That is a pretty vague statement though, so let me go a bit more in depth.

The Paper

In his paper, KhudaBukhsh uses a BERT[1] model to fill in cloze statements to mine the perception of topics related to Indian politics. Cloze statements are essentially sentences where one of the words has been removed. The language model, in my case BERT, that has been trained on the data set fills in what it thinks the word that was removed should be. For example, the model might be fed a phrase like “Apples are [MASK]”, and the model would then use the representation it has learned to fill in a word that it thinks makes sense.

Depending on the data used, that word could be ‘fruit’, if the model was trained on a dictionary, or ‘delicious’, if the model was trained on social media comments reflecting people’s opinions. By using this method, KhudaBukhsh shows that he is able to essentially ask the Indian population what their opinion in aggregate is about certain topics. For example, for the phrases “Hindus are [Mask]” and “Muslims are [Mask"]”, his model returned the results below. They illustrate that there is a very polarized political landscape in India with both religions being viewed very negatively.

[1] I will go into more depth about how BERT works and what a language model is in one of the upcoming posts

My Project

I will be using the same technique of filling in cloze statements using BERT to track the perception of vaccines and female politicians in Germany. For the Covid and vaccinations part of the project, I am interested in how the perception of the pandemic and the vaccine have changed over the course of the last two years.

To do this, I will train a BERT model on social media comments discussing Covid from each month, starting in January 2020 up until January 2022 (24 different models in total). I will feed each model the same cloze statements like “I think that covid is [MASK]” and the model will then use the representation it has learned to fill in the mask. This will capture the aggregate opinion of the German population during that month and by comparing the probability of the filled in word for each phrase across each month, I will not only quantitatively demonstrate how the perception of Covid and vaccines has changed over the course of the pandemic, but also how the perception of other contentious issues not directly related to Covid have been impacted by the pandemic.

My analysis of the perceptions of female politicians will be more similar to the KhudaBukhsh’s example above. I will compare comments about Annalena Baerbock, who has received misogynistic criticism for having two very young children whilst in office, to equivalent male counter parts to illustrate the sexism present in German politics. I will also feed the model statements specifically about the election to illustrate the impact her gender had on her chances to get votes.

The Plan

There are 3 main components to my project: data collection, data cleaning/preprocessing, and model training

Data Collection
- In order for my model to make predictions that reflect the German population’s perception of these topics, I need to obtain data where Germans express their opinions on these issues. In order to do this, I will be scraping social media comments and using them to finetune the BERT model. I have never scraped data from social media, so I will need to learn how to use the APIs available as well as their limitations in order to obtain the data.
Data Cleaning/preprocessing
- Social media comments are inherently messy and as a result will require preprocessing and cleaning to make them viable. Additionally, I need to make sure that the comments actually discuss the topic at hand, since otherwise the model will not learn a significant representation. This will require me to make use of the literature in the field and convert that theory into code.
Model Finetuning
- This is the actual fun part, but also probably the most difficult. Massive language models like BERT are not meant to be built from scratch. Rather, I will use an existing implementation that has already been pretrained on generic data. My job is to fine-tune the model on the data I have collected. In order to do this I will need to get a better understanding of transformers and learn how to use the HuggingFace Transformers library, which has the python implementation of BERT that I will be using.

Potential Difficulties

Given that this is my first time using a state of the art language model, let alone conducting research with one, I am expecting to run into some issues along the way. In my previous projects I built neural networks from scratch and as a result knew the architecture inside and out, which allowed me to easily figure out any error that came up. In contrast, BERT is a very complex and opaque model that I didn’t build, making it a lot more difficult to understand errors.

The model I will be using also isn’t the original BERT model that was used by KhudaBukhsh, rather it is a version of that model that was pre-trained on German text. The inability to directly compare the performance of these models means that there is a chance that the model I use is unable to get a representation of the text that is equivalent to the one KhudaBukhsh achieved in terms of accuracy

Essentially, I am throwing myself into the deep end. But that’s the fun in it. I am looking forward to learning a lot of crucial skills that are necessary in my progression as a researcher. Further, by being in this inexperienced position, I hope that you as a reader will find this perspective useful to learn with me as I conduct my research.

Benjamin Pusch 14/03/2022 Benjamin Pusch 14/03/2022

What’s going on in Germany?

1/5/22

Welcome back to the Odyssey! Today I wanted to go in more depth about the actual problems I will be analyzing.

Covid-19 and Antivaxxers

In March 2020, the German government introduced restrictions of free movement and social/physical distancing provisions in response to the COVID-19 pandemic. This was the first time ever that such strict measures affecting individual freedoms were implemented in the country. At that time, these measures received wide support across the German public and the fact that Germany fared relatively well with low infection rates and deaths during the first wave (compared to many of its European neighbors) was widely attributed to these measures. This support broadly continued throughout the second, third and fourth COVID waves.

Starting in April 2020, a loud minority created the ‘Querdenker’ (lateral thinkers) movement which claims that the COVID-19 pandemic and the federal and regional laws to contain the spread of the virus, infringe on citizens' liberties.

Protests started in the southern part of Germany, but quickly spread throughout the country and often turned violent with police officers being injured. Protestors tuned out even when demonstrations didn’t have permissions, social distancing rules were not followed, and protesters didn’t wear face masks. Many of them believed in conspiracy theories and claimed that the government was hiding the truth about the pandemic.

Vaccines became available in spring 2021, but vaccination rates remained low in late 2021 and the government considered a vaccine mandate. This brought new energy to the Querdenker movement’s ‘antivaxers’ (‘Impfgegner’) and led to renewed protests. When new emergency regulations kicked in because of rising infection rates in some parts of the country, protesters gathered in what they called ‘Spaziergaenge’ (strolls). The atmosphere at these strolls is equally tense and heated as it is during official demonstrations. I want to see how this loud minority has impacted the general population’s perception of COVID and vaccines.

Gender in Politics

Despite the fact that for 16 years Germany has had a female chancellor with Angela Merkel, the representation of women in German politics is very low. Currently, only about one third (34.8%) of the Bundestag (German Parliament) are women, a slight increase from 30.7% in the previous Bundestag (2017-2021). In comparison, many other European countries have much higher percentages (Spain: 43.3%; France: 41.6%; Denmark: 41.3%; Finland: 46%; Sweden 46.6%). You can find out more here.

Even worse, of the “Ministerpraesidenten” who head the 16 German Laender (States), only 3 are women. However, under the new Government of Chancellor Scholz, Germany for the first time, has the same number of male (8) and female (8) ministers.

Annalena Baerbock is the first female German Foreign Minister, and with 41 years also the youngest. Since the beginning of the election campaign which she started by running as her party’s candidate for chancellor, she was under enormous scrutiny and had to deal with what seemed an overwhelming amount of sexist comments.

Women politicians continue to be at a disadvantage in the way they are covered by the media and because of gendered media reporting. This ranges from how a candidate’s background is viewed, to the mention of gender, family, leadership, physical appearance, qualifications, etc. I want to see how language models would allow me to quantifiably analyze the difference between the perception of male and female politicians and how this divide hurt Annalena Baerbock’s chances at winning the election for Germany’s Chancellor.

Benjamin Pusch 28/02/2022 Benjamin Pusch 28/02/2022

Happy Holidays!

1/1/22

Welcome back to the Odyssey! Today I wanted talk about the specific project I have decided to work on, how I got to that decision, and the beauty of AI’s interdisciplinary nature.

I hope you’re enjoying your winter holidays, I had a fantastic time celebrating Christmas with my extended family here in Germany. It’s the first time that I have been able to see them since 2019 (thanks Covid), which made the celebration even more special. Covid will actually play a significant role in today’s post because a particularly fiery discussion at dinner a couple of days ago is one of the reasons why I am changing my original plan for this project.

Originally, I wanted to expand on KhudaBukhsh’s paper by researching how the growth of Telegram, an encrypted social media platform similar to WhatsApp, has lead to increased polarization in Brazil ahead of their elections this coming October. While I still want to work on this project, after looking into it I realized that I would need a stepping stone to get there. My lack of familiarity with Brazilian Portuguese and Brazilian culture coupled with my lack of experience regarding machine translation models makes it a difficult project to start with. I plan on starting the Brazilian Election project this summer, so stay tuned!

As an alternative, I will be researching the community perception of contentious issues in Germany. I will be using another one of Ashiqur R. KhudaBukhsh’s papers, namely Mining Insights from Large-scale Corpora Using Fine-tuned Language Models. The overarching focus is the same: to use AI, in this case language models, as a scalpel with which to study social behavior and societal issues. Specifically, I will be using Google’s BERT language model to track how Covid has impacted the perception of vaccines in Germany as well as Germany’s sexist differences in the perception of politicians. Since I am culturally German and fluent in the language, this project provides a better entrance into research than my previous idea. I will go into more depth about the paper and the specific project in my next blog. Stay tuned!

You might be wondering why I am focusing on a different language at all as I could just as easily have done a similar project in English, where I would have had the added benefit of being able to access a lot more resources. There are two main reasons why I chose Germany:

The first reason is personal. While I am culturally German, I have never had the chance to intertwine my culture and heritage with my academic pursuits. Despite how important being German is to me, I rarely ever engage with German issues as most, if not all, of my information about these problems comes through conversations with my parents. The other night, I was embarrassed by my lack of knowledge about German issues when an argument broke out about Covid and Annalena Baerbock, a German politician, and I had no points to argue.

I knew then that my research was the perfect opportunity to study these issues, allowing me to further my passion whilst exploring a cornerstone of my identity. This interdisciplinary nature is something that I honestly find so incredibly beautiful about AI because it allows you to simultaneously pursue diverse interests. No matter what your passion is, I guarantee you that there is some way or another AI can be intertwined with that passion. Also, this flexibility allows you to work with people who might have completely different interests, making it one of the most collaborative and fascinating areas to work in.

The second reason leans more to the ethical side. I want to focus on a language other than English due to the linguistic inequality in Natural Language Processing. I have discussed the ethical challenges that the AI field faces before, but one of the problems I haven’t touched on yet is the lack of diversity in the development of AI systems. Specifically, I am going to focus on the fact that research done in AI is concentrated in rich countries (North America, Western Europe, and East Asia).

While this can be said for the academic research in any field, the impact a concentration of development in machine learning and AI has is especially devastating due to the field’s data-centric approach. The massive amounts of data these models require, tend to be at least somewhat unique to the country they were created in, making it infeasible for less advanced countries to implement these models due to cultural, societal, and physical differences.

Take for example autonomous driving, where in recent years Tesla has outperformed competitors due to the vast amount of data they have available to them. However, if a company in another country, say Columbia, wants to train their own autonomous vehicles, Tesla’s data would not be sufficient and might lead to dangerous errors due to the differences in the appearance of roads, street signs, and the natural environment.

It can easily be argued that this problem is the most severe in Natural Language Processing and Language Modeling. Stark linguistic differences make it difficult to create standardized benchmarks across languages, preventing researchers from accurately evaluating how model architectures trained in different languages compare to each other. This causes researchers to focus on the language that is the most widely studied (English), incentivizing the majority of datasets and future benchmarks to be created in that language and creating unequal access to a technology that has immense potential to improve our society. In case you’re curious, this journal article discusses the issue in more depth.

Tune in for the next post where I will discuss the Covid and sexist perceptions of politicians in German in more depth!