Fine Tuning

1/16/22


Welcome back to the Odyssey! Today I will be discussing how exactly I am going to be finetuning the German Bert model.


In order to study Germany’s perception of vaccines and female politicians, I need a dataset that captures the opinions of a subset of Germans regarding these issues who represent the average views of the entire country. In KhudaBukhsh’s case, he needed a data set that would capture India’s perception of political entities. His approach was to scrape comments from national and regional Indian YouTube news channels and use them to finetune his model. While this method worked well for him, it won’t work for me for two main reasons:

  1. India’s young population uses social media as their primary source to consume news media, with both regional and national news channels having millions of subscribers. In contrast, Germany has a much older population that primarily consumes news media through TV or the newspaper and as a result a lot of major news outlets either don’t have YouTube channels or have very low engagement. Therefore, using YouTube comments will not provide enough data and the data collected wouldn’t accurately represent the population’s views.

  2. In KhudaBukhsh’s study, around 70% of the video’s scraped were about politics, meaning that the model would be able to get a good understanding of the political discourse. German news media’s YouTube content on the other hand focuses mainly on clickbait videos discussing celebrity gossip, as those are the only types of videos that get views. Since filtering videos based on what topic they discuss is a pretty challenging task, it makes a lot more sense to look for an different data source.

Twitter is an excellent alternative as their search functionality allows me to filter for tweets that only contain certain words or phrases. This allows me to scrape tweets that express an opinion on the topics of interest and hence study the aggregate community perception of these issues. Twitter isn’t the only alternative though, I could also try a hybrid approach with Facebook where I scrape the replies to posts from news outlets, similar to KhudaBuksh’s YouTube approach, but only use the posts that contain key words relating to the topic of interest.

The next step moving forward is to learn about the Twitter API and teach myself to scrape tweets with it. Stay tuned!

Previous
Previous

API Setup and Tweepy

Next
Next

BERT