Regular Expressions

4/12/22


Welcome back to the Odyssey! Unfortunately, end of term assignments have kept me from finishing my Covid research. However, I have continued to use a lot of what I have learned from my research in my assignments and I am going to use this post to briefly talk about one of the tools that has proved to be incredibly useful.


What are Regular Expressions?

Regular expressions (aka regex) are a sequence of characters that specify a search pattern in text. Basically, instead of matching a string exactly, you can use regex to match a set of strings that you are interested in. Regex is a universal tool and most major languages have regex directly built in or allow you to import a library that provides regex functionality.

Although the concept itself is pretty straight forward, the syntax can look pretty daunting. For example, this is one massive regular expression that I used to filter user input for a project I have been working on:

 
 

I am not going to talk about how to create regular expressions, but if you are interested in learning more I suggest you check out regexr.com which is the website that I used to learn about and test out regular expressions.

How have I used Regex?

This project

As you may have noticed if you have been following my blog, I use regex a lot to clean data. Dataframe objects make it especially easy to do this because the dataFrame apply function allows you to efficiently iterate through a column of a dataFrame and apply a lambda function. In my case the lamda function uses the function sub (from the regex library) to replace the string in question (a tweet) with the same tweet but with a specified pattern removed. For example, to remove all the links from my tweets I used the following expression:

clean_df['Tweet Text'] = clean_df['Tweet Text'].apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x)) #remove links

Orbit Visualization

Regex is also incredibly powerful if you want to filter user input. I am currently working on a data visualization project for one of my college classes and in order to have the user query data, I needed to be able to check whether their input matched the syntax that I had specified. For one part of the project, I created an interactive 3d orbit visualizer and instead of writing a complicated function to check each separate case for my input, regex allowed me to filter user input in just a couple of lines. The regex expression I used is the same one that I displayed above:

 
 

So now, only when I input a valid string using the rules described on the right, does my program update what orbits are displayed.

 

In case you are interested, the orbit visualizer is astronomically accurate (for April 2nd, 2022) and below are some of the possible visualizations that I found interesting. These visualizations show all active satellites that match the query I entered.

Thanks for reading! Next post will be about the Covid results! (coming soon)

Previous
Previous

Covid: Results!

Next
Next

Preprocessing 2.0