I hate load filesI got a call from a lawyer I don’t know on Sunday evening.  He reported that he’d received production of ESI from a financial institution and spent the weekend going through it.  He’d found TIFF images of the pages of electronic documents, but couldn’t search them.  He also found a lot of “Notepad documents.”  He’d sought native production, so thought it odd that they produced so many pictures of documents and plain text files.

As it’s unlikely a bank would rely on Windows Notepad as its word processor, I probed further and learned that that the production included folders of TIFF images, folders of .TXT files (those “Notepad documents”) and folders of files with odd extensions like .DAT and .OPT.  My caller didn’t know what to do with these.

By now, you’ve doubtlessly figured out that my caller received an imaged production from an opponent who blew off his demand for native forms and simply printed to electronic paper.  The producing party expected the requesting party to buy or own an old-fashioned review tool capable of cobbling together page images with extracted text and metadata in load files.  Without such a tool, the production would be wholly unsearchable and largely unusable.  When my caller protests, the other side will tell him how all those other files represent the very great expense and trouble they’ve gone to in order to make the page images searchable, as if furnishing load files to add crude searchability to page images of inherently searchable electronic documents constitutes some great favor.

It brings to mind that classic Texas comeback, “Don’t piss in my boot and tell me it’s raining.”

It also reminds me that not everyone knows about load files, those unsung digital sherpas tasked to tote metadata and searchable text otherwise lost when ESI is converted to TIFF images.  Grasping the fundamentals of load files is important to fashioning a workable electronic production protocol, whether you’re dealing with TIFF images, native file formats or a mix of the two.  I’ve been wanting to write about load files for a long time, but avoided it because I just hate the damn things!  So, this post is a load (file) off my mind.

In simplest terms, load files carry data that has nowhere else to go.   They are called load files because they are used to load data into, i.e., to “populate” a database.  They first appeared in discovery in the 1980s in order to add a crude level of electronic searchability to paper documents.  Then as now, paper documents were scanned to TIFF image formats and the images subjected to optical character recognition (OCR).  Unlike Adobe PDF images, TIFF images weren’t designed to integrate searchable text; consequently, the text garnered using OCR was stored in simple ASCII[1] text files named with the Bates number of the corresponding page image.  Compared to paper documents alone, imaging and OCR added functionality.  It was 20th century computer technology improving upon 19th century printing technology, and if you were a lawyer in the Reagan-era, this was Star Wars stuff.

Metadata is “data about data.”  While we tend to think of metadata as a feature unique to electronic documents, paper documents have metadata, too.  They come from custodians, offices, files, folders, boxes and other physical locations that must be tracked.  Still more metadata takes the form of codes, tags and abstracts reflecting reviewers’ assessments of documents.  Then as now, all of this metadata needs somewhere to lodge as it accompanies page images on their journey to document review database tools (a/k/a “review platforms”) like Concordance or Summation–venerable products that survive to this day.  This data goes into load files.

Finally, we employ load files as a sort of road map and as assembly instructions laying out, inter alia, where document images and their various load files are located on disks or other media used to store and deliver productions and how the various pieces relate to one-another.

So, to review, some load files carry extracted text to facilitate search, some carry metadata about the documents and some carry information about how the pieces of the production are stored and how they fit together.  Load files are used because neither paper nor TIFF images are suited to carrying the same electronic content; and if it weren’t supplied electronically, you couldn’t load it into review tools or search it using computers.

load files

Before we move on, let’s spend a moment on the composition of load files.  If you were going to record many different pieces of information about a group of documents, you might create a table for that purpose.  Possibly, you’d use the first column of your table to give each document a number, then the next column for the document’s name and then each succeeding column would carry particular pieces of information about the document.  You might make it easier to tell one column form the next by drawing lines to delineate the rows and columns, like so:

load file table

Those lines separating rows and columns serve as delimiters; that is, as a means to (literally) delineate one item of data from the next.  Vertical and horizontal lines serve as excellent visual delimiters for humans, where computers work well with characters like commas, tabs and such.  So, if the data from the table were contained in a load file, it might appear as follows:

BEGDOC,ENDDOC,FILENAME,MODDATE,AUTHOR,DOCTYPE
0000001,0000004,Contract,01/12/2013,J. Smith,docx
0000005,0000005,Memo,02/03/2013,R. Jones,docx
0000006,0000073,Taxes_2013,04/14/2013,H. Block,xlsx
0000074,0000089,Policy,5/25/2013,A. Dobey,pdf

Note how each comma replaces a column divider and each line signifies another row.  Note also that the first or “header” row is used to define the type of data that will follow and the manner in which it is delimited.  When commas are used to separate values in a load file, it’s called (not surprisingly) a “comma separated values” or CSV file.  CSV files are just one example of standard forms used for load files.  More commonly, load files adhere to formats compatible with the Concordance and Summation review tools.  Concordance load files typically use the file extension DAT and the þ¶þ characters as delimiters, e.g.:

Concordance Load File

concordance load fileSummation load files typically use the file extension DII, but do not structure content in the same way as Concordance load files; instead, Summation load files separate each record like so:

Summation Load File

; Record 1
@T 0000001
@DOCID 0000001
@MEDIA eDoc
@C ENDDOC 0000004
@C PGCOUNT 4
@C AUTHOR J. Smith
@DATESAVED 01/12/2013
@EDOC \NATIVE\Contract.docx
 
; Record 2
@T 0000005
@DOCID 0000005
@MEDIA eDoc
@C ENDDOC 0000005
@C PGCOUNT 1
@C AUTHOR R. Jones
@DATESAVED 02/03/2013
@EDOC \NATIVE\Memo.docx
 
; Record 3
@T 0000006
@DOCID 0000006
@MEDIA eDoc
@C ENDDOC 0000073
@C PGCOUNT 68
@C AUTHOR H. Block
@DATESAVED 04/14/2013
@EDOC \NATIVE\Taxes_2013.xlsx
 
; Record 4
@T 0000074
@DOCID 0000074
@MEDIA eDoc
@C ENDDOC 0000089
@C PGCOUNT 15
@C AUTHOR A. Dobey
@DATESAVED 05/25/2013
@EDOC \NATIVE\Policy.pdf

Just as placing data in the wrong row or column of a table renders the table unreliable and potentially unusable, errors in load files render the load file unreliable, and any database it populates is potentially unusable.  Just a single absent, misplaced or malformed delimiter can result in numerous data fields being incorrectly populated.  Load files have always been an irritant and a hazard; but, the upside was they supplied a measure of searchability to unsearchable paper documents.

Fast forward to a post-personal computer, post-Internet era. 

The overwhelming majority of documents and communications are created and stored electronically, and only the tiniest fraction of these will ever be printed.  Electronic documents are inherently searchable and do things that paper documents can’t, like dynamically apply formulas to numbers (spreadsheets), animate text and images (presentations) or carry messages and tracked changes made visible or invisible at will (word processed documents).  Electronic documents also have complements of information within and without called metadata that tend to be lost when electronic documents are printed or imaged.  Some of this metadata has evidentiary value (e.g., date and time information) and some has organizational value (e.g., file names).

Because electronic documents are inherently electronically searchable, there’s no need to image them or use optical character recognition to extract searchable text.  Moreover, there’s less need for error-prone load files to populate review tools.  Despite these advantages, many lawyers prefer to approach electronic documents in the same way they handled paper documents.  That is, they convert searchable electronic documents to non-searchable, non-functional TIFF images and then attempt to graft on electronic searchability by extracting text and metadata to load files.

So, why is an old, error-prone method of data transfer still used in electronic discovery?  Good question; because it’s not cheaper, and it’s certainly not better.  Mostly, it’s just familiar, and they have a sunk cost in outmoded tools and techniques.  Why do some people still use thermal fax paper (for that matter, why do they still use fax machines)?

To be fair, there’s a lingering need for load files in e-discovery, too.  Even native electronic documents have outside-the-file or “system” metadata that must be loaded into review tools; plus, there’s still a need to keep track of such things as the original monikers of renamed native files and the layout of the production set on the production media.  In e-discovery, load files—and the headaches they bring–will be with us for a while; understanding load files helps ease the pain.


[1] ASCII is an acronym for American Standard Code for Information Interchange and describes one of the oldest and simplest standardized ways to use numbers—particularly binary numbers expressed as ones and zeroes–to denote a basic set of English language alphanumeric and punctuation characters.

Advertisements