The Project

1/9/22


Welcome back to the Odyssey and Happy New Year! In today’s post I will go in more depth about the mythology used in KhudaBukhsh’s paper and discuss some potential be difficulties of my research.


What will I be doing?

As I mentioned in one of my earlier posts, I am researching the change in community perception of vaccinations in Germany during the pandemic as well as the sexist perceptions of female politicians such as Annalena Baerbock. That is a pretty vague statement though, so let me go a bit more in depth.

The Paper

In his paper, KhudaBukhsh uses a BERT[1] model to fill in cloze statements to mine the perception of topics related to Indian politics. Cloze statements are essentially sentences where one of the words has been removed. The language model, in my case BERT, that has been trained on the data set fills in what it thinks the word that was removed should be. For example, the model might be fed a phrase like “Apples are [MASK]”, and the model would then use the representation it has learned to fill in a word that it thinks makes sense.

Depending on the data used, that word could be ‘fruit’, if the model was trained on a dictionary, or ‘delicious’, if the model was trained on social media comments reflecting people’s opinions. By using this method, KhudaBukhsh shows that he is able to essentially ask the Indian population what their opinion in aggregate is about certain topics. For example, for the phrases “Hindus are [Mask]” and “Muslims are [Mask"]”, his model returned the results below. They illustrate that there is a very polarized political landscape in India with both religions being viewed very negatively.

[1] I will go into more depth about how BERT works and what a language model is in one of the upcoming posts

My Project

I will be using the same technique of filling in cloze statements using BERT to track the perception of vaccines and female politicians in Germany. For the Covid and vaccinations part of the project, I am interested in how the perception of the pandemic and the vaccine have changed over the course of the last two years.

To do this, I will train a BERT model on social media comments discussing Covid from each month, starting in January 2020 up until January 2022 (24 different models in total). I will feed each model the same cloze statements like “I think that covid is [MASK]” and the model will then use the representation it has learned to fill in the mask. This will capture the aggregate opinion of the German population during that month and by comparing the probability of the filled in word for each phrase across each month, I will not only quantitatively demonstrate how the perception of Covid and vaccines has changed over the course of the pandemic, but also how the perception of other contentious issues not directly related to Covid have been impacted by the pandemic.

My analysis of the perceptions of female politicians will be more similar to the KhudaBukhsh’s example above. I will compare comments about Annalena Baerbock, who has received misogynistic criticism for having two very young children whilst in office, to equivalent male counter parts to illustrate the sexism present in German politics. I will also feed the model statements specifically about the election to illustrate the impact her gender had on her chances to get votes.

The Plan

There are 3 main components to my project: data collection, data cleaning/preprocessing, and model training

  • Data Collection

    • In order for my model to make predictions that reflect the German population’s perception of these topics, I need to obtain data where Germans express their opinions on these issues. In order to do this, I will be scraping social media comments and using them to finetune the BERT model. I have never scraped data from social media, so I will need to learn how to use the APIs available as well as their limitations in order to obtain the data.

  • Data Cleaning/preprocessing

    • Social media comments are inherently messy and as a result will require preprocessing and cleaning to make them viable. Additionally, I need to make sure that the comments actually discuss the topic at hand, since otherwise the model will not learn a significant representation. This will require me to make use of the literature in the field and convert that theory into code.

  • Model Finetuning

    • This is the actual fun part, but also probably the most difficult. Massive language models like BERT are not meant to be built from scratch. Rather, I will use an existing implementation that has already been pretrained on generic data. My job is to fine-tune the model on the data I have collected. In order to do this I will need to get a better understanding of transformers and learn how to use the HuggingFace Transformers library, which has the python implementation of BERT that I will be using.

Potential Difficulties

Given that this is my first time using a state of the art language model, let alone conducting research with one, I am expecting to run into some issues along the way. In my previous projects I built neural networks from scratch and as a result knew the architecture inside and out, which allowed me to easily figure out any error that came up. In contrast, BERT is a very complex and opaque model that I didn’t build, making it a lot more difficult to understand errors.

The model I will be using also isn’t the original BERT model that was used by KhudaBukhsh, rather it is a version of that model that was pre-trained on German text. The inability to directly compare the performance of these models means that there is a chance that the model I use is unable to get a representation of the text that is equivalent to the one KhudaBukhsh achieved in terms of accuracy

Essentially, I am throwing myself into the deep end. But that’s the fun in it. I am looking forward to learning a lot of crucial skills that are necessary in my progression as a researcher. Further, by being in this inexperienced position, I hope that you as a reader will find this perspective useful to learn with me as I conduct my research.


Previous
Previous

BERT

Next
Next

What’s going on in Germany?