REGEX VS SPACY

Chiedu Mokwunye
4 min readFeb 20, 2024

--

Regular expressions, often abbreviated as regex, is a powerful tool for pattern matching within texts. It consists of a string of characters that define a search pattern, allowing for flexible and precise pattern matching. For example, in a scenario where you have a chat bot and you require the users to provide order numbers. You have to take into account that the users are not going to respond the same way. User A’s response might be “Order number is NA1234”, User B’s response might be “My order number is NA1234”, and User C’s response might be “NA1234”. Regardless of the responses one thing in common they all have is the order number. Now imagine as the frontend engineer the only response you can send to the backend is the exact order number while ignoring all other unnecessary inputs the user provided. Regex can be employed to extract the order numbers efficiently, disregarding any surrounding text.

On the other hand, Spacy is an open-source library tailored for Natural Language Processing (NLP), a branch of Artificial Intelligence focused on understanding and processing human language. Within Spacy, various functionalities are available, including a word tokenizer, which breaks down text into individual words or tokens. Additionally, Spacy provides specific methods like “like_email” to identify patterns resembling email addresses within text. For more information on what methods are available in a token, the dir can be used to list them.

For this article purpose we are interested in the “like_email”.

In a practical demonstration using a dataset stored in a CSV file, both regex and Spacy are applied to extract valid email addresses. The csv file has 3000 records with a column for emails. Now let’s dive into the codes and results.

In the provided code snippet, the file named “employee_data.csv” is accessed in read-only mode using the with open method in Python. The as file syntax assigns the opened file to a variable named file, which serves as a reference for subsequent operations. Within this context, each line of the CSV file is read and stored in a variable named data, resulting in a list of lines.

This code snippet means we are joining these lines into a single string. This enables easier manipulation and processing of the file’s contents.

Following the concatenation of the lines, we proceed to import the Spacy library and load the English small model, specifically chosen for this task. Subsequently, we utilize the library to tokenize the text into individual words.

Result of the word tokenizer.

Subsequently, we iterate through the list of tokens and examine each word to determine if it resembles an email address using the “like_email” method. We then extract and store the identified emails in a separate list

Spacy says we have only 2993 valid emails.

Let’s try another approach and find all valid emails using regular expression.

The initial step involves importing the “re” module for regular expressions and defining a pattern to match email addresses. Subsequently, we utilize the “findall” method to locate all instances where the specified pattern matches within the concatenated text. For creating and testing regular expressions, I recommend https://regex101.com/.

The regex says we have 3,000 valid email addresses.
Weird, right? Lets compare both the spacy list of email addresses and the regex list of email addresses.

The above code snippet shows the missing 7 email addresses that the Spacy library doesn’t recognise.

In conclusion, Spacy’s primary function is not centered around recognizing email addresses, whereas regex excels in precisely defining and matching patterns, making it more adept at identifying email addresses in text. The aim of this article is to show you options available when it comes to pattern matching and it is good to note there are other libraries available.
The github link is available here: https://github.com/Anniez94/spacy_vs_regex/blob/main/Spacy_vs_Regex.ipynb

--

--

Chiedu Mokwunye
Chiedu Mokwunye

Written by Chiedu Mokwunye

Web and Mobile Developer, Tech Lover.

No responses yet