Searching Text with grep

csul · September 22, 2022, 10:16pm

Searching Text with `grep`

When you’re working with a large number of files, there will be times when you need to find the location where a certain word or phrase is saved. The grep—global regular expression print—utility makes it possible to perform said search.

grep is often used in conjunction with other command line utilities to help filter text and display only the information that is needed. For example, when you run the command ps aux | grep "$USER", all the processes currently initiated and running on your computer are printed. The ps aux command prints detailed information about all processes running and | grep "$USER" funnels the output to grep, which filters out anything that doesn’t have your username in it.

With grep you can specify a “regular expression” or a pattern to search for. To give you an idea of the ways that regular expressions work, searching for gr[ae]y will match either ‘gray’ or ‘grey’.

To demonstrate the usefulness of grep, let’s look at the script
needle_in_haystack which generates a set of files containing random words. We will attempt to determine how many times the words “needle” or “needles” appear in a set of text files and grep will tell us in which file and line to find them.

To run the script, you will first have to download haystack_generator.py and install the random-word python package.

wget https://raw.githubusercontent.com/CeSul/needle_in_haystack/main/haystack_generator.py

pip install random-word

Note: If you want to run this on the Discovery cluster, you will also have to run module load python before installing random-word.

After you run the script you should see a directory named haystack_dir with 10 files, each with 100 words.

$ python3 haystack_generator.py
Generating file 0
Generating file 1
Generating file 2
Generating file 3
Generating file 4
Generating file 5
Generating file 6
Generating file 7
Generating file 8
Generating file 9
$ ls haystack_dir/
haystack00.txt haystack02.txt haystack04.txt haystack06.txt haystack08.txt
haystack01.txt haystack03.txt haystack05.txt haystack07.txt haystack09.txt

To start we can run

grep -rn 'need' haystack_dir

The -r flag enables a recursive search.

The -n flag prints the line number the match appears on.

All together, this tells us the location of anything that has “need” in it. You should see something similar to:

$ grep -rn 'need' haystack_dir
haystack_dir/haystack08.txt:21:need
haystack_dir/haystack08.txt:52:need
haystack_dir/haystack09.txt:45:needle
haystack_dir/haystack04.txt:48:needless
haystack_dir/haystack04.txt:77:need
haystack_dir/haystack04.txt:90:needle
haystack_dir/haystack04.txt:96:needles
haystack_dir/haystack04.txt:98:needle
haystack_dir/haystack07.txt:82:need
haystack_dir/haystack06.txt:18:needle
haystack_dir/haystack06.txt:26:need
haystack_dir/haystack06.txt:64:needles
haystack_dir/haystack03.txt:25:needle
haystack_dir/haystack03.txt:44:needless
haystack_dir/haystack01.txt:62:need
haystack_dir/haystack00.txt:2:needless
haystack_dir/haystack00.txt:72:needless

So, we can see that there are words we want and words we don’t want. Let’s see what happens when we change to “needle” to filter out “need”.

$ grep -rn 'needle' haystack_dir
haystack_dir/haystack09.txt:45:needle
haystack_dir/haystack04.txt:48:needless
haystack_dir/haystack04.txt:90:needle
haystack_dir/haystack04.txt:96:needles
haystack_dir/haystack04.txt:98:needle
haystack_dir/haystack06.txt:18:needle
haystack_dir/haystack06.txt:64:needles
haystack_dir/haystack03.txt:25:needle
haystack_dir/haystack03.txt:44:needless
haystack_dir/haystack00.txt:2:needless
haystack_dir/haystack00.txt:72:needless

We get rid of “need”, but because “needle” can be found in “needles” and “needless”, the latter shows up, as well. Regular expressions are a whole topic unto themselves, so we will only discuss them here superficially. We can add the special character \b which means “word boundary” to our regular expression. As the name implies, the word boundary will match to either the beginning or end of a word. By searching for needle\b, grep will only show text if it’s a word that ends with “needle”.

$ grep -rn 'needle\b' haystack_dir
haystack_dir/haystack09.txt:45:needle
haystack_dir/haystack04.txt:90:needle
haystack_dir/haystack04.txt:98:needle
haystack_dir/haystack06.txt:18:needle
haystack_dir/haystack03.txt:25:needle

This has eliminated “needless”, but also “needles”. Let’s adjust our regular expression to include the ? character: needles?\b. The ? signifies that it will match zero or one occurance of the preceeding character, “s”. So in plain English, our regular expression says to match any word that has “needle” followed by nothing or “needle” followed by one “s”. Either option is then followed by a word boundary.

The ? character is not supported by normal grep so we will use egrep, which supports extended regular expressions.

$ egrep -rn 'needles?\b' haystack_dir
haystack_dir/haystack09.txt:45:needle
haystack_dir/haystack04.txt:90:needle
haystack_dir/haystack04.txt:96:needles
haystack_dir/haystack04.txt:98:needle
haystack_dir/haystack06.txt:18:needle
haystack_dir/haystack06.txt:64:needles
haystack_dir/haystack03.txt:25:needle

It’s possible that there may be a word or pattern in your text like “abcneedles”, in which case, that will match too. To be safe, we should add on other word boundary at the beginning of the word to make our regular expression \bneedles?\b. This means a word that starts with needle, is followed by zero or one “s”, and is followed by another word boundary.

With this one command we can see that in this example, there are 5 occurences of “needle” and 2 of “needles”. We also know where they occur in each file so if we have to make changes, we know where to go. This is just a small taste of what you can do, but hopefully this has shown you the power of the grep command and has given you some ideas about how it can be used in your work.

Even without learning regular expressions, you can still get a lot done with grep. If you are interested in learning more, we strongly recommend Web Dev Simplified’s tutorial Learn Regular Expressions in 20 Minutes.

Or if you prefer reading, https://www.regular-expressions.info is another a great resouce.