I once wrote a column titled “Page Equivalency and Other Fables.” It lambasted lawyers who larded their burden arguments with bogus page equivalencies like, “everyone knows a gigabyte of data equates to a pile of printed pages that would reach from Uranus to Earth.” We still see wacky page equivalencies, and “from Uranus” still aptly describes their provenance.
Back in 2007, I wrote, “It’s comforting to quantify electronically stored information as some number of pieces of paper or bankers’ boxes. Paper and lawyers are old friends. But you can’t reliably equate a volume of data with a number of pages unless you know the composition of the data. Even then, it’s a leap of faith.”
So, I’m happy to point you to some notable work by my friend, John Tredennick. I’ve known John since the emerging technology was fire and watched with awe and admiration as John transitioned from old-school trial lawyer to visionary forensic technology entrepreneur running e-discovery service provider, Catalyst. John is as close to a Renaissance man as anyone I know in e-discovery, and when John speaks, I listen.
Lately, John Tredennick shared some revealing metrics on the Catalyst blog looking at the relationship between data and document volumes, an update to his 2011 article called, How Many Documents in a Gigabyte? John again examines document volumes seen in the data that Catalyst receives and processes for its customers and, crucially, parses the data by file type. As the results bear out, the forms of the data still make an enormous difference in terms of data volume. Even as between documents we think of as being “the same” (like Word .doc and .docx formats), the differences are striking.
For example, John’s data suggests that there are almost 60% more documents in a gigabyte of Word files in the .docx format (7,085) than in a gigabyte of files stored in the predecessor .doc format (4,472). This makes sense because the newer .docx format incorporates zip compression, and text is highly compressible data.
[One exercise I require of the law students in my E-discovery class is to look at the file header of a Word .docx file to note its binary signature, PK, characteristic of a zip-compressed file and short for Phil Katz, author of the zip compression algorithm. For grins, you can change the file extension of a .docx file to .zip and open it to see what a Word document really looks like under the hood. Hint: it’s in XML].
John reports a similar discrepancy between new and old Excel spreadsheet formats (1,883 .xlsx files per gigabyte versus 1,307 for .xls). Here again, the .xlsx format builds in zip compression.
But, the results are reversed when it comes to PowerPoint presentations, with John finding that there are marginally fewer of the newer .pptx files in a gigabyte (505) than the older .ppt format files (580). This makes sense to me because Microsoft phased out the .doc format ten years ago. Since then, presenters have gotten better about adding visual enhancements to deadly-dull PowerPoints, and they tend to add ‘fatter’ components like video clips. The biggest factor is that pictures are highly incompressible, and common image formats (i.e., .jpg images) have always been compressed. Compressing data that’s already compressed tends to increase, not decrease its size.
Wisely, John speaks only of document volumes and makes no effort to project page equivalencies, not even by extrapolating some postulated ‘average-pages-per-file type.’ Anything like that would be as insupportable today as it was when I wrote about it in 2007. Also, when you look at John’s post, note that there is no data supplied concerning TIFF images. I’m not sure why, but I can promise you this: TIFF images are MUCH fatter files, costing far more in terms of storage space and ingestion costs than their native counterparts. Had John added TIFF to the mix, I’m confident his weighted averages would have been much different…and far less useful–much like TIFF images as a form of production. ;-)