When you’re working with a large number of files, there will be times when you need to find the location where a certain word or phrase is saved. The
grep—global regular expression print—utility makes it possible to perform said search.
grep is often used in conjunction with other command line utilities to help filter text and display only the information that is needed. For example, when you run the command
ps aux | grep "$USER", all the processes currently initiated and running on your computer are printed. The
ps aux command prints detailed information about all processes running and
| grep "$USER" funnels the output to
grep, which filters out anything that doesn’t have your username in it.
grep you can specify a “regular expression” or a pattern to search for. To give you an idea of the ways that regular expressions work, searching for
gr[ae]y will match either ‘gray’ or ‘grey’.
To demonstrate the usefulness of
grep, let’s look at the script
needle_in_haystack which generates a set of files containing random words. We will attempt to determine how many times the words “needle” or “needles” appear in a set of text files and
grep will tell us in which file and line to find them.
To run the script, you will first have to download
haystack_generator.py and install the random-word python package.
wget https://raw.githubusercontent.com/CeSul/needle_in_haystack/main/haystack_generator.py pip install random-word
Note: If you want to run this on the Discovery cluster, you will also have to run
module load pythonbefore installing
After you run the script you should see a directory named
haystack_dir with 10 files, each with 100 words.
$ python3 haystack_generator.py Generating file 0 Generating file 1 Generating file 2 Generating file 3 Generating file 4 Generating file 5 Generating file 6 Generating file 7 Generating file 8 Generating file 9 $ ls haystack_dir/ haystack00.txt haystack02.txt haystack04.txt haystack06.txt haystack08.txt haystack01.txt haystack03.txt haystack05.txt haystack07.txt haystack09.txt
To start we can run
grep -rn 'need' haystack_dir
-r flag enables a recursive search.
-n flag prints the line number the match appears on.
All together, this tells us the location of anything that has “need” in it. You should see something similar to:
$ grep -rn 'need' haystack_dir haystack_dir/haystack08.txt:21:need haystack_dir/haystack08.txt:52:need haystack_dir/haystack09.txt:45:needle haystack_dir/haystack04.txt:48:needless haystack_dir/haystack04.txt:77:need haystack_dir/haystack04.txt:90:needle haystack_dir/haystack04.txt:96:needles haystack_dir/haystack04.txt:98:needle haystack_dir/haystack07.txt:82:need haystack_dir/haystack06.txt:18:needle haystack_dir/haystack06.txt:26:need haystack_dir/haystack06.txt:64:needles haystack_dir/haystack03.txt:25:needle haystack_dir/haystack03.txt:44:needless haystack_dir/haystack01.txt:62:need haystack_dir/haystack00.txt:2:needless haystack_dir/haystack00.txt:72:needless
So, we can see that there are words we want and words we don’t want. Let’s see what happens when we change to “needle” to filter out “need”.
$ grep -rn 'needle' haystack_dir haystack_dir/haystack09.txt:45:needle haystack_dir/haystack04.txt:48:needless haystack_dir/haystack04.txt:90:needle haystack_dir/haystack04.txt:96:needles haystack_dir/haystack04.txt:98:needle haystack_dir/haystack06.txt:18:needle haystack_dir/haystack06.txt:64:needles haystack_dir/haystack03.txt:25:needle haystack_dir/haystack03.txt:44:needless haystack_dir/haystack00.txt:2:needless haystack_dir/haystack00.txt:72:needless
We get rid of “need”, but because “needle” can be found in “needles” and “needless”, the latter shows up, as well. Regular expressions are a whole topic unto themselves, so we will only discuss them here superficially. We can add the special character
\b which means “word boundary” to our regular expression. As the name implies, the word boundary will match to either the beginning or end of a word. By searching for
grep will only show text if it’s a word that ends with “needle”.
$ grep -rn 'needle\b' haystack_dir haystack_dir/haystack09.txt:45:needle haystack_dir/haystack04.txt:90:needle haystack_dir/haystack04.txt:98:needle haystack_dir/haystack06.txt:18:needle haystack_dir/haystack03.txt:25:needle
This has eliminated “needless”, but also “needles”. Let’s adjust our regular expression to include the
? signifies that it will match zero or one occurance of the preceeding character, “s”. So in plain English, our regular expression says to match any word that has “needle” followed by nothing or “needle” followed by one “s”. Either option is then followed by a word boundary.
? character is not supported by normal
grep so we will use
egrep, which supports extended regular expressions.
$ egrep -rn 'needles?\b' haystack_dir haystack_dir/haystack09.txt:45:needle haystack_dir/haystack04.txt:90:needle haystack_dir/haystack04.txt:96:needles haystack_dir/haystack04.txt:98:needle haystack_dir/haystack06.txt:18:needle haystack_dir/haystack06.txt:64:needles haystack_dir/haystack03.txt:25:needle
It’s possible that there may be a word or pattern in your text like “abcneedles”, in which case, that will match too. To be safe, we should add on other word boundary at the beginning of the word to make our regular expression
\bneedles?\b. This means a word that starts with needle, is followed by zero or one “s”, and is followed by another word boundary.
With this one command we can see that in this example, there are 5 occurences of “needle” and 2 of “needles”. We also know where they occur in each file so if we have to make changes, we know where to go. This is just a small taste of what you can do, but hopefully this has shown you the power of the
grep command and has given you some ideas about how it can be used in your work.
Even without learning regular expressions, you can still get a lot done with
grep. If you are interested in learning more, we strongly recommend Web Dev Simplified’s tutorial Learn Regular Expressions in 20 Minutes.
Or if you prefer reading, https://www.regular-expressions.info is another a great resouce.