Say you need to md5sum 46 files, all ending in “.foo” in a single directory. You might use your standard
`md5sum *.foo > md5sum.txt` command to checksum them all in one process. Get coffee, it’s done, move on.
Oh, I should mention those 46 files are a little under 6 terabytes in total size. The standard command might take a while. I drink a lot of coffee, but whoa.
Now imagine you have a 16 core server connected via modern InfiniBand to an otherwise idle pre-production parallel filesystem with several hundred disks and multiple controllers, each with their own cache. The odds are tilting in your favor. This is especially true if you read up on this pair of options in xargs(1), which inexplicably, shamefully, I Did Not Know:
--max-args=max-args, -n max-args Use at most max-args arguments per command line. Fewer than max-args arguments will be used if the size (see the -s option) is exceeded, unless the -x option is given, in which case xargs will exit. --max-procs=max-procs, -P max-procs Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time. Use the -n option with -P; otherwise chances are that only one exec will be done.
Sure, I could have run this on more than one such machine connected to the same filesystem. There are a number of tools that can split up work across multiple child processes on a single machine, none of which were installed in this environment. I wanted to see what I could get this single server to do with basic commands.
46 files / 16 cores = 2.875, so let’s give this a shot:
find . -type f -name "*.foo" | xargs -P 16 -n 3 md5sum | tee md5sum.out
English: For the files ending in “.foo” in this directory, run md5sum up to 16 times in parallel with up to three files per run, show results as they happen, and save the output.
Please Note: This will absolutely not help unless you have the storage infrastructure to handle it. Your Best Buy hard drive will not handle it. It has a strong chance of making your machine unhappy.
In this case, I got something lovely from top:
PID S %CPU COMMAND 29394 R 100.0 md5sum 29396 R 100.0 md5sum 29397 R 100.0 md5sum 29398 R 100.0 md5sum 29399 R 100.0 md5sum 29400 R 100.0 md5sum 29401 R 100.0 md5sum 29402 R 100.0 md5sum 29403 R 100.0 md5sum 29391 R 99.6 md5sum 29392 R 99.6 md5sum 29393 R 99.6 md5sum 29395 R 99.6 md5sum 29404 R 99.6 md5sum 29405 R 99.6 md5sum 29406 R 99.6 md5sum
Early on, there were some D states waiting for cache to warm up, and CPU dropped below 70% for one or two processes, but I’ll take it. I’ll especially take this:
Right on, xargs. Easy parallelization on one system for single file tasks driven from a file list or search.