Running 'wget' downloads in the background on Data Transfer Nodes and how to avoid using too much resources

The ‘wget’ utility is a tried and true utility up there with ‘rsync’ and ‘scp/sftp’. It’s been around a long time and you can find lots of information online about how to use it. If you go to any website or public repository that offers files to download it’ll inevitably offer a way to ‘wget’ or ‘sftp’ or maybe even FTP if it’s really old school. The data transfer nodes and login nodes at the Center for Advanced Research Computing all support these traditional tools (except FTP :slight_smile: ) and they’re fine for certain file transfer tasks.

There are better, more modern tools for transferring the enormous research datasets that our researchers work with every day, and Research Computing is either testing them or getting ready to roll them out to the community. Today we support Globus and other high-performance tools like ‘bbcp’ and we’re evaluating ‘aria2c’. Please stand by for announcements and offers to help us test better, faster ways to streamline your research. In the meantime, you are welcome to use the familiar tools like ‘wget’ for files that aren’t too big (under the 100 Gb range, better lets’ say the low 10s of Gb) and as long as the number of concurrent downloads aren’t too numerous (8-10 downloads at a time, maybe - it depends). The Research Computing data transfer and login nodes are resources shared among hundreds of researchers and many, many projects, so our research staff monitors the systems to make sure your download goals are achieved without jeopardizing the usability of the systems for other users.

So let’s say you have a manageable dataset to download from a public repository, like some genome sequences. You could log into hpc-transfer and run a command something like

wget http:<file1>; wget http:<file2>; wget:<file3>

Experienced users might start up a ‘screen’ session, kick off a command like this and log out for the day and check it later when they get home. But there are some options built into ‘wget’ that might make things a little easier for you.

The ‘wget’ utility can take an input file of links to download. If you open a text editor and cut and paste a list of links, something like this:

https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Capsicum_chinense/latest_assembly_versions/GCA_002271895.2_ASM227189v2/GCA_002271895.2_ASM227189v2_genomic.fna.gz
https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Capsicum_chinense/latest_assembly_versions/GCA_002271895.2_ASM227189v2/GCA_002271895.2_ASM227189v2_genomic.gbff.gz
https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Capsicum_chinense/latest_assembly_versions/GCA_002271895.2_ASM227189v2/GCA_002271895.2_ASM227189v2_genomic.gff.gz
https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Capsicum_chinense/latest_assembly_versions/GCA_002271895.2_ASM227189v2/GCA_002271895.2_ASM227189v2_genomic.gtf.gz

… and save it out, to say “download-files1.txt”, you can run your ‘wget’ process in the background and take the list of files from your input file, like this:

wget --no-verbose --background --input-file=download-files1.txt

Again, don’t put more that 10-15 links in this file until you get an idea of how big they are and how long they take to download- shoot for under an hour to complete the run. It’s hard to give exact guidelines because everybodys’ download jobs are different, but if you kick of a job that takes a month and a half to download 200 Tb you run the risk of exhausting system resources! Reach out to us in a Jira ticket if you have any questions.

Anyway, this command runs in the background and logs everything it does to a text file named ‘wget-log’. You can log out and go home, eat dinner, walk the dog and check later. The ‘–no-verbose’ tells it to log the successes and failures but skips the exhaustive byte-by-byte statistics. If you’re troubleshooting a throughput problem with a colleague leave ‘–no-verbose’ off and you’ll be able to get a general idea how the transfer went.

If you really like watching the progress of your download do a tail -f wget-log and watch it run, and hit CTRL-C when you see the “FINISHED” line (when your transfer works).

But, that list of links in ‘download-files2.txt’ is really long and unwieldy. What if you have a list of file names, and you know the URL path already, say from an earlier download run? The ‘wget’ utility has a handy “–base=URL” option to condense it’s input slightly.

Say you create a file that looks like

$ cat download-files2.txt
GCA_002271895.2_ASM227189v2_genomic.fna.gz
GCA_002271895.2_ASM227189v2_genomic.gbff.gz
GCA_002271895.2_ASM227189v2_genomic.gff.gz
GCA_002271895.2_ASM227189v2_genomic.gtf.gz
GCA_002271895.2_ASM227189v2_genomic_gaps.txt.gz
GCA_002271895.2_ASM227189v2_protein.faa.gz
GCA_002271895.2_ASM227189v2_protein.gpff.gz

You can run a command like this, specifying the ‘–base’ argument to be everything up to the file name (because it’s the same for every file):

wget --base='https://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Capsicum_chinense/latest_assembly_versions/GCA_002271895.2_ASM227189v2/' \
--input-file=download-files2.txt \
--no-verbose --background

Here I split the line with a trailing ‘’ to continue it on the next line, but it’s the same as if you ran it on one line or executed it from a simple shell script. Okay, it’s not that much prettier than the explicit link on every line, but it might save you making a typo when you plunk it in from a spreadsheet or a Word doc someone sent you.

This quick post can’t touch on every feature of ‘wget’, and like I said your Research Computing staff is working on faster, more modern tools to make your work easier. Effort is being expended on facilitating large research data transfers on campus and beyond, activities that exceed the abilities of our friends ‘wget’, ‘rsync’ and their ilk- stay tuned for exciting announcements. I hope this post helps you use the ‘wget’ command for your traditional, less resource intensive downloads.

4 Likes