Servers and Remote Control

14 Mar

2/23/22

Welcome back to the Odyssey! Today I want to share what environment I am using to scrape my tweets.

Scraping takes a long time. If you read journal articles where the authors had to create a dataset of tweets, you will often see that it took multiple days for their program to execute.

This is a problem. As someone who has to use their computer every day for college work, I can’t just have a program running in the background for several days. It would kill my battery and I wouldn’t be able to close my laptop. Luckily there are several alternatives which I will discuss in this post.

Google Colaboratory

“Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs.” - Google Research

I use Colab pretty frequently because of its flexibility. By having all of my python files on my drive, I can access them on any computer and don’t have to worry about the hardware. Colab also allows you to have multiple sessions open at once, which is useful if you running longer programs (such as a scraper) and want to work on something else in the meantime. The main downside of Colab is that resources are not guaranteed, which means that you might not get the best CPUs or GPUs available every time you connect to Google’s server. A second downside is that the maximum runtime for a Colab notebook is 12 hours. This makes it annoying to use for scraping, which can often last for much longer than 12 hours. You can upgrade your Colab account to have longer runtimes, but that costs money.

Despite the runtime issue, I actually tried using Colab as my environment for this project, however I ran into an issue unique to the snscrape library. Snscrape requires Python 3.8, but Colab currently only supports up to Python 3.6. So if you were planning on using snscrape on Colab, you will have to resort to one of the alternatives.

Cloud Computing

Cloud computing is another excelent way to scrape data. Companies that offer cloud computing infrastructure allow you to rent GPUs or CPUs, which in my case could then be used to scrape tweets or train my language model. The major downside is that these servers cost money to use, which I am not willing to do. However, if you are interested in using a cloud server, I would recommend that you take a look at one of the following:

Amazon’s AWS

Linode

Remote Control

Remote desktop control is a very unconventional way of getting around the problem of using your own computer for computationally intensive tasks. Remote control allows you to control another computer through your own computer. I am currently living in Dublin, but I have a desktop back in the US that no one is using. Basically, since Colab didn’t work and I don’t want to pay for a cloud server, I thought it would be fun to mine Tweets from my local machine in the US. The remote control software that I am using is free and you can download it at the website below:

TeamViewer

P.S. The next post will be about the scraping results, so stay tuned!

Benjamin Pusch

Servers and Remote Control

Google Colaboratory

Cloud Computing

Remote Control

Success!

Brick Walls