Sales Call

I get a lot of cold calls. Today I heard a voicemail from someone with whom I’ve never spoken at a consulting firm I’ve never heard of wanting to tell all about the “race horses” they have available at their firm. 

I am really used to sales jargon, but this was a new one for me. Like they have all these hotshot thoroughbred geniuses of IT consulting (a.k.a. extracting your money), and they can’t wait to tell me all about their prowess at rocking out and recontextualizing my paradigms. I adjusted my Cheese Tolerator yet another regrettable notch higher and moved on. 

I then accidentally heard the message again and realized that the person said “resources.” Oh, right. Dehumanizing, but certainly more mainstream. 

The fact that I thought “race horses” was even a possible term is sad, and it indicates that the Tolerator may have to stay at the new setting. 

I Did Not Know: xargs -n and -P

Say you need to md5sum 46 files, all ending in “.foo” in a single directory. You might use your standard `md5sum *.foo > md5sum.txt` command to checksum them all in one process. Get coffee, it’s done, move on.

Oh, I should mention those 46 files are a little under 6 terabytes in total size. The standard command might take a while. I drink a lot of coffee, but whoa.

Now imagine you have a 16 core server connected via modern InfiniBand to an otherwise idle pre-production parallel filesystem with several hundred disks and multiple controllers, each with their own cache. The odds are tilting in your favor. This is especially true if you read up on this pair of options in xargs(1), which inexplicably, shamefully, I Did Not Know:

--max-args=max-args, -n max-args
       Use at most max-args  arguments  per  command  line.
       Fewer  than  max-args  arguments will be used if the
       size (see the -s option) is exceeded, unless the  -x
       option is given, in which case xargs will exit.
--max-procs=max-procs, -P max-procs
       Run up to max-procs processes at a time; the default
       is 1.  If max-procs is 0, xargs  will  run  as  many
       processes  as possible at a time.  Use the -n option
       with -P; otherwise chances are that  only  one  exec
       will be done.

Sure, I could have run this on more than one such machine connected to the same filesystem. There are a number of tools that can split up work across multiple child processes on a single machine, none of which were installed in this environment. I wanted to see what I could get this single server to do with basic commands.

46 files / 16 cores = 2.875, so let’s give this a shot:

find . -type f -name "*.foo" | xargs -P 16 -n 3 md5sum | tee md5sum.out

English: For the files ending in “.foo” in this directory, run md5sum up to 16 times in parallel with up to three files per run, show results as they happen, and save the output.

Please Note: This will absolutely not help unless you have the storage infrastructure to handle it. Your Best Buy hard drive will not handle it. It has a strong chance of making your machine unhappy.

In this case, I got something lovely from top:

  PID S %CPU COMMAND
29394 R 100.0 md5sum
29396 R 100.0 md5sum
29397 R 100.0 md5sum
29398 R 100.0 md5sum
29399 R 100.0 md5sum
29400 R 100.0 md5sum
29401 R 100.0 md5sum
29402 R 100.0 md5sum
29403 R 100.0 md5sum
29391 R 99.6 md5sum 
29392 R 99.6 md5sum 
29393 R 99.6 md5sum 
29395 R 99.6 md5sum 
29404 R 99.6 md5sum 
29405 R 99.6 md5sum 
29406 R 99.6 md5sum

Early on, there were some D states waiting for cache to warm up, and CPU dropped below 70% for one or two processes, but I’ll take it. I’ll especially take this:

real    31m33.147s

Right on, xargs. Easy parallelization on one system for single file tasks driven from a file list or search.

Good Software Practices Scale Down

Today I revisited some scripts I last touched on December 5, 2011 for very very carefully archiving research data with checksums, an audit trail, and other very very careful things like that.

One of the requirements for this project is that the first phase of my processing needs to accept input data from a provider. Unfortunately, this input format has never been the same twice. Grr.

Upon receipt of the second variation on July 12, 2011 (six days after I started the project), I took the time to make the script somewhat configurable with an external file.

This was handy in November 2011 when I needed to do a similar set of work for a second research dataset. I put everything in a configuration file stored alongside the input data. Date format strings, headers, fields of interest, key/values for data types, etc. That meant I could share code between datasets as they emerged from the wild.

So last week, I got another set of input data. Yep, another unique format. I haven’t thought about this in over a year, and I have a terrible memory. Today, I got the input data parsed and validated in five minutes after editing a config file, because:

  1. I had one place to do customization
  2. I took steps to encourage code reuse
  3. I wrote good comments and gave myself a -h option

All this despite knowing that I was probably the only one who would ever look at this again. And I have those dates because everything is in a Subversion repository. Did I mention that I wrote it in a language I don’t know very well?

Granted, this is a tiny little thing in the universe of computer things, but my point here is that it’s often worth doing the right thing for the next guy, even for small things, even if the next guy is you. Perhaps especially if it’s you.

I Did Not Know: multitail

I think I once knew this, but forgot about it. Still, the fact remains:

I Did Not Know about multitail.

I’m watching eight different log files from a Windows server through CIFS on my Windows 7 desktop where I’m running Cygwin and multitail. This is both pleasant and awesome-looking, which is not normally the case for watching eight different log files, especially on Windows.

New Job: St. Jude

I have a personal re-org announcement for you:

Yesterday was my last official day as a system administrator at The University of Memphis. I’m starting at the end of this month as a Enterprise Network Storage Architect at St. Jude Children’s Research Hospital.

I was at the U of M for a little over five years, and for the most part, I really enjoyed it. I worked with some good people in the IT Division and throughout the University. I did interesting technical work and had my share of successes. I developed professional and personal relationships that I truly appreciate. I started and finished a graduate degree in my spare time.

Most importantly, I met Molly.

I’m looking forward to St. Jude. It’ll be fun to work on projects where you can say “petabyte” with a straight face. I’m excited to contribute to the work the researchers are doing, and a quick walk through the halls of the main buildings makes it obvious that this work is important and often miraculous.

Most importantly, I can occasionally have lunch with Molly. And Simon.

– Yes, this is a mouthful.