The Light is Better Over Here - A Look at Predictive Coding
A man is looking for his lost keys under a street light. A cop walking by bends down to help. After a few minutes with no luck, he asks the man if he is sure about where he dropped them, and the man replies,
“I’m pretty sure I dropped them in the grass over there.”
“Then why aren’t you looking where you think you dropped them?”
“Because the light is better here.”
Far too, many people are still looking for documents “where the light is better” as opposed to where they are. Why? Because as much as we would like to think that we (experts, attorneys, and paralegals) are unmatched in our investigative prowess, our ability to construct the “killer” Boolean query, and our ability to assess documents at warp speed, the reality is something else.
In today’s e-discovery environment, there is just too much information. One of my favorite t-shirts has a picture of an old man that says “When I Was Your Age a Gigabyte Was A Lot!” Gigabytes have rolled into Terabytes, which have rolled into Petabytes. Combine lots of inexpensive storage, the speed of today’s processors, and the wide open bandwidth of today’s internet, and you can see why the definition of “monster” cases has evolved as well. Matters, which might have had 50,000 documents for review 25 years ago, have more than a million today. Headline cases today can easily fly past the 100,000,000 document mark. As much as we’d like to, there’s just not enough people, time or budget to do a manual review. Yet, like our friend above, we spend all of our time under the street light creating keyword lists, Boolean searches, and linear review batches in the vain hopes that we’ll find our keys.
From Spreadsheets to OCR to TAR
Enter technology. First came document “coding” - human data entry of bibliographic information, summaries and “ the issue codes” information entered into a table. Next, came Optical Character Recognition (OCR) with its much-heralded promise to eliminate the need for expensive coding. “You’ll be able to find whatever you need simply by searching the text!” was the claim. However, attorneys starting using it and discovered that the quality of the OCR text was so bad that they were missing much more than they were finding, and getting burned in court for it.
I believe that much of the reluctance to use current technologies, like predictive coding and concept searching, stems from the problems with OCR in the late 90s. Many of the up and coming associates that got burned then are now senior partners.
And so, with a bit of detached bemusement, I watched a few years ago as technology assisted review (or TAR) sprinted to the front of the pack as the newest in “must have” technology. The claim was: review a few thousand documents and predictive coding will take care of the other 5 million. It worked - right up to the point where it didn’t.
The Case for Predictive Coding
For the right case, I am a proponent of technology-assisted review. In the following circumstances, it can be a lifesaver:
- If the size of the collection warrants its use.
- If the nature of the documents in the collection (richness and variety of documents) is such that the math, which drives the basis for this technology, is allowed to work.
- If the text fed into the processing engine is “clean.” Extracted text from native files is great. OCR text from poor-quality scanned documents is a different story.
- If the attorneys and other Subject Matter Experts (SMEs) tasked with coding the initial “seed sets” that the machine needs to learn are willing to dedicate themselves to the effort. Sample, code, learn and repeat as necessary.
- If once the technology is applied we QC, QC again, and QC again.
Also, lastly, if everyone understands that there are limitations to what can be accomplished using predictive coding. There will be instances where the machine will miss documents that belong in a particular category (Responsive, Privileged, Non-Responsive), or it will place documents in the wrong category. There will be documents that will “fool” the engine.
Does this mean that we should abandon technology assisted review? No. Humans are no better and in fact, often worse. After all, we’ve been quoting the Blair and Maron study for 29 years now, and also TREC Legal Track, and newer studies are in progress or already published.
Want more information specific to predictive coding? For a detailed and comprehensive study, I recommend that you take a look at Maura R. Grossman & Gordon V. Cormack’s paper in the April 2011 Richmond Journal of Law and Technology here.