Scraping Tweets
1/27/22
Welcome back to the Odyssey! I am still waiting to hear back from Twitter about the higher access levels, so in the meantime I thought I would use this post do go more in depth about my Twitter scraping.
I’ve decided that I will start with just the Covid vaccine focus of my project for now since the politician aspect is a bit more complex. In order to scrape all of the tweets that are talking about Covid, I have created a set of key words that must be present in the tweets (image to the right). In addition, I am using Twitter’s language filter so that only tweets in German are scraped. Finally, I will be using Twitter’s time and date parameters to split up the data I scrape into datasets for each month.
The actual query that will be fed into the exhaustive search endpoint function is:
“Impfung OR Impfpflicht OR Querdenker OR Omicron OR Delta OR Corona OR covid OR Maskenpflicht OR (Verschwörung AND Corona) OR (Spaziergang AND Corona) lang:de until: ‘date2’ since: ‘date1’ “
In English:
“Vaccine OR Vaccine Mandate OR Lateral Thinker OR Omicron OR Delta OR Corona OR Covid OR Mask Mandate OR (Conspiracy AND Corona) OR (Stroll AND Corona) lang:de until: ‘date2’ since: ‘date1’ “
Some example results from the query are shown below.
Potential Failure Points
Since this is my first time scraping tweets, it is important to acknowledge some potential failure points that I might run into:
Effectiveness of Twitter
Do the Tweets actually capture the opinion of the population? By using Twitter, I am assuming the tweets I scrape provide a sample that accuratly represents the whole population. While the amount of literature using Twitter for this purpose is a pretty good reason to believe that this is in fact the case, there is definitely a possibility that that the people who frequently use twitter are just a small, but particularly outspoken, subset of the population. It will be difficult to confirm or reject my assumption because I can’t detect this error by simply looking at the data.
Reflection of German Opinion
I need to filter the tweets so that they only reflect the opinions of Germans. I will be using the language parameter in Twitter’s search query to ensure that this is the case. However, by doing so I am assuming that tweets in German imply that the author of the tweet is culturally German, which isn’t necessarily true. In order to account for this, I will sample some tweets from the scrape that have location tags. Going off KhudaBukhsh’s paper, as long as 70% or more of the Tweets are from Germany, the robustness of the BERT model will ensure that the representation learned reflects the opinoins of Germans. The reason why I am not using location tags to begin with is because they are optional and most users don’t enable them. Hence, I wouldn’t be able to scrape enough tweets to train the models.
Time and Storage Issues
I have no idea how many tweets will be scraped per month. Since Covid has been a major talking point for the last two years, it is likely that this number is in the millions. This presents two problems. First, I could run out of storage on my computer due to the volume of tweets. The bigger concern though is the time it will take to scrape and train the model. Depending on how fast my computer can pull the tweets , it could take upwards of one week. Further, the access level of my Twitter App will determine the maximum number of tweets I can scrape, which might be well below the number of tweets that I need. Fine-tuning the models on this quantity of tweets is another problem due to the time consuming nature of training BERT. To avoid this, I will take a random sample of equal size for every month to standardize the amount of Tweets used and to reduce the time of fine-tuning. I am also going to estimate the overall number of Tweets that I will be scraping by testing the API on only one months worth of data first, allowing me to then evaluate the total time needed and storage necessary.
Stay tuned for the next update!