Update on New Scrape

5/10/22


Welcome back to the Odyssey! Today I am going to provide a little update on the new scrape for the twitter data.


Updates

As I mentioned in my last post, I am scraping the replies to tweets from news organizations rather than simply scraping all tweets containing certain key words. The problem being that snscrape (the scraping library I am using) doesn’t have any built in functionality to scrape replies to tweets. In fact, the only way to efficiently obtain the replies to tweet is by using Twitter’s API, which I have previously discussed and I will not be able to use because of their restrictions.

There are two main ways that I can try to scrape the replies. In the first approach (below), I need to iterate through all of the tweets posted by one of the news organizations, and for each tweet they posted I need to iterate through their replies using snscrapes recursive function. However, the recursive function does not scrape nested replies and hence some tweets will be left out. The bigger issue with this approach, which I found out the hard way, is the amount of time it will take. After doing an initial test, I observed that scraping one week worth of tweets resulted in roughly 40,000 tweets and took around 2 hours. That means that for 2 years worth of data I will need to scrape for upwards of 200 hours. The second approach, which I am testing as I am writing this post, is to use twitter’s search functionality and scrape all tweets that were posted in the time frame I am interested in and contain a mention for one of the news accounts. The benefit of the first method is that I can control when the tweets from the news organizations are posted, meaning that the tweets replying will be in reaction to some event that happened during the timeframe. The second approach doesn’t have this feature as I can only control the date that the reply itself was posted, meaning that in theory my data could be full of tweets that are in reaction to some statistic/policy/event that happened in a different time frame which would defeat the purpose of the experiment. I will do some rigorous testing on the data to test whether this is the reality or not.

Stay tuned!


Code and News Organizations

News Organizations

  • Frankfurter allgemeine zeitung

  • Sueddeutsche Zeitung

  • Die welt

  • Handelsblatt

  • Der Spiegel

  • Die zeit

  • Bild

  • Tagesschau

  • ZDFheute Nachrichten

  • n-tv

  • Deutsche welle (dw)

Code

Next
Next

Covid: Analysis, Reflection, and Moving Forward