Inventus Blog

Inventus is a leading global discovery management practice, focused on reducing litigation costs through a suite of bundled, best-of-breed technologies.

How to Choose Your Ediscovery Bracket

Posted 03/31/16 5:27 PM by Megan Rowland

Recently a co-worker asked me, “How do you pick the teams that will move forward in your March Madness bracket?” I replied, “A lot is good guessing mixed in with strategy based on prior performance.” I realized this approach could also be applied to data culling in eDiscovery.

One never truly knows what may be included in their data set until they really start doing some analysis. For instance, what if the custodian (i.e. person the data originated from) used their work computer for both business and personal use? What if there is a considerable amount of scanned-in documents that provides poor optical character recognition (OCR)? For documents with poor OCR, relying on search terms to cull down a data set may not capture all relevant information because the text available may be missing characters or complete words based on the OCR results. Before we get too ahead of ourselves, let us back up a bit and review some of the preliminary data culling steps to best narrow down a data set.

How does one cull down to the most complete and hopefully least cumbersome data set? One of the more valuable decisions to help cull down a data set at the start is deduplication. In general, deduplication is “a method of replacing multiple identical copies of a [document] by a single instance of that [document].” There are two different options to consider for deduplication, global and within custodian. Global deduplication will remove all duplicates within a data set at a project level, whereas, custodian deduplication will remove all duplicates at a custodian level. In other words, if three custodians were on the same email and each instance of the email is collected, there would be three instances of the same email if processed without deduplication. This would create additional charges with respect to processing and impact timing and cost for review. If global deduplication were implemented there would only be one instance of the same document in a review set. However, many clients note that it is very important to know what custodian had access to a certain document and/or possibly saw a certain email. In this case, within custodian deduplication could be the better choice, especially if it was agreed upon at the initial “meet-and-confer” contemplated by Federal Rules of Civil Procedure 26(f) that requires processing per custodian and/or producing by custodian.

Many clients ask, “If global deduplication is performed can we still determine what document is associated with what custodian, and furthermore, how many instances of the document existed in the data set?” The simple answer is yes, depending on the processing tool used. Inventus uses Lexis Nexis LAW, which can provide different deduplication fields one can leverage for this type of knowledge without having to have multiple instances of the same document get to the review phase.

Of note, there are two fields that I have found to be very helpful when dealing with data that was deduplicated globally as well as within custodian. One field that can be created to include every custodian associated with a document that it originally "belonged to" in the data set for a project is commonly known as the All Custodians field. Another field that can be created is one that will include all source paths for each instance of a document, sometimes referred to as the All Source Paths field. This field is helpful not only for a document found in multiple custodians’ data but also a document found multiple times for the same custodian. For example, an email could be saved to a folder marked “Important” in its name as well as in a custodian’s deleted items, which may be significant to know to get additional context for a document. Another way to use the All Source Paths field is for a document that is cryptic in its subject matter, but saved in a folder titled with a relevant topic to the case and only saved in another custodian’s inbox. Overall, global deduplication does not mean you lose the ability to analyze a document at a custodian level and know what custodian had access to documents for a project.

Furthermore, even if there is a production requirement to produce by custodian, one can use the All Custodians field to make sure this requirement is met. Moreover, it is now even more defensible to globally dedupe based on the amended FRCP Rule 26(b)(1), which was recently updated as follows:

Unless otherwise limited by court order, the scope of discovery is as follows: Parties may obtain discovery regarding any nonprivileged matter that is relevant to any party’s claim or defense and proportional to the needs of the case, considering the importance of the issues at stake in the action, the amount in controversy, the parties’ relative access to relevant information, the parties’ resources, the importance of the discovery in resolving the issues, and whether the burden or expense of the proposed discovery outweighs its likely benefit. Information within this scope of discovery need not be admissible in evidence to be discoverable.  iIncluding the existence, description, nature, custody, condition, and location of any documents or other tangible things and the identity and location of persons who know of any discoverable matter. For good cause, the court may order discovery of any matter relevant to the subject matter involved in the action. Relevant information need not be admissible at the trial if the discovery appears reasonably calculated to lead to the discovery of admissible evidence. All discovery is subject to the limitations imposed by Rule 26(b)(2)(C).

Essentially, the requirement of providing the custody of a document has been removed and may be found to be unreasonably burdensome to have to process and review at a custodian level, as it may increase the expense associated with eDiscovery. There is also an additional benefit to global deduplication with respect to coding consistency because reviewing one instance of a document versus three different instances of the same document will guarantee a document is coded consistently for each custodian, and remove the possibility of inconsistently producing between custodians.

In the end, trying to make the right choices in your eDiscovery approach are best made from the beginning with the foresight of knowing how it will impact the end result as much as possible. There will always be some educated guessing in the beginning, but if those guesses are based on past eDiscovery experience and within the agreed upon eDiscovery protocols, you will be in a good place once you start reviewing the data that has been deduplicated to a more reasonable review set. The data culling does not end here, but it is certainly a great place to start.


Maura R. Grossman and Gordon V. Cormack, EDRM page  & The Grossman-Cormack Glossary of Technology-Assisted Review, with Foreword by John M. Facciola, U.S. Magistrate Judge, 2013 Fed. Cts. L. Rev. 7 (January 2013) ; page 14



New Call-to-action

Megan Rowland

About The Author

Megan has been working in the eDiscovery industry since 2010. Prior to becoming a Project Manager at Inventus, she worked at Huron Consulting Group as a Discovery Services Consultant. Megan studied at DePaul University College of Law and admitted the Illinois State Bar.


Reduce overall cost and risk of your entire legal process
Learn More