An employee of an e-discovery service provider asked me to help him explain to his boss why deduplication works well for native files but frequently fails when applied to TIFF images. The question intrigued me because it requires we dip our toes into the shallow end of cryptographic hashing and dispel a common misconception about electronic documents.
Most people regard a Word document file, a PDF or TIFF image made from the document file, a printout of the file and a scan of the printout as being essentially “the same thing.” Understandably, they focus on content and pay little heed to form. But when it comes to electronically stored information, the form of the data—the structure, encoding and medium employed to store and deliver content–matters a great deal. As data, a Word document and its imaged counterpart are radically different data streams from one-another and from a digital scan of a paper printout. Visually, they are alike when viewed as an image or printout; but digitally, they bear not the slightest resemblance. Continue reading