If you’re on this turf, chances are you already know that de-NISTing is a technique used in e-discovery and computer forensics to reduce the number of files requiring review by excluding standard components of the computer’s operating system and off-the-shelf software applications like Word, Excel and other parts of Microsoft Office.  Everyone has this  digital detritus on their systems; things like Windows screen saver images, document templates, clip art, system sound files and so forth.  It’s the stuff that comes straight off the installation disks, and it’s just noise to a document review.

It’s called “de-NISTing” because those noise files are identified by matching their hash values (i.e., digital fingerprints) to a huge list of software hash values maintained and published by the National Software Reference Library, a branch of the National Institute for Standards and Technology (NIST).  The NIST list is free to download, and pretty much everyone who processes data for e-discovery and computer forensic examination uses it.  If you’re paying a vendor to de-NIST, you probably think you’re getting value for the service.  I expect nearly everybody who de-NISTs believes that they’re culling the most common operating system and application files.  I mean, that’s the whole point, right?

Sorry to burst your bubble.

Earlier this summer, I began to wonder why de-NISTing was doing such a poor job reducing the volume of files in systems I’d collected for review.  These were late-model systems running Windows Vista or Windows 7 and the latest release of Microsoft Office.  That is, they were the sort of machine one is likely to encounter in millions of homes and businesses today.

The NIST list is updated four times a year, and I was using the very latest release; but, most of the noise files I expected would be excluded by de-NISTing weren’t going away.  So, I ran a test.  I created a pristine install of Windows 7 on a sterile hard drive.  The pristine install consisted of 47,690 files, and everything on the drive that wasn’t fashioned on the fly as part of the install process came straight off the Windows installation disk.

But, do you know how may of those 47,690 files were on the latest NIST list? Just 7,277!  That’s right, the NIST list misses 85% of the files in a pristine Windows 7 installation.

Some of you surely share my astonishment.  The rest of you are rightly thinking, “Craig really needs to get out more.”

But seriously folks, that’s a terrible performance, and it translates into real, honest-to-goodness wasted wampum for litigants when the noise files that should have been culled pass through one of the pay-by-the-gigabyte tolls downstream.

I did some exploring and found that one reason the NIST list missed so many noise files is because NIST hasn’t yet processed Windows 7 for addition to the list.  More than 350 million machines run Windows 7, but apparently none at NIST.  Arrrgh!  What’s more, the NIST list doesn’t include the components of Microsoft Office 2010 either.  Only 100 million machines run Office 2010.

The purpose of this post isn’t to disparage some overworked government technician trying to catch up with last year’s work.  Instead, I’m questioning whether some vendors are using hash lists they’re calling NIST lists but are actually cobbled together on their own?  If you can trace and defend your process abetting the NIST list, great.  There’s nothing wrong with making your own exclusionary hash set; but, don’t try to pass it off as the official, government-issued NIST list.  Your Prada knockoffs may be pretty, but they aren’t the real McCoy.

As with Prada bags and Rolex watches, authenticity is a key component of value and inspirator of confidence.  We don’t quarrel about de-NISTing because the roster of items excluded derive from a government agency through a controlled, transparent process anyone can validate or test, as I did.  When vendors employ proprietary, undocumented exclusion mechanisms for ESI under the rubric of de-NISTing, it may be a better process (or not); but, it’s not a trustworthy process.

Advertisements