File Deduplication Software Tips for Ediscovery Document Collections

Nextpoint, Inc.
Contact

Nextpoint, Inc.

Big Data is making life increasingly complicated for lawyers. In ediscovery, a big part of the challenge is simply eliminating duplicate copies of stuff – emails, Word documents, spreadsheets, files, and metadata – that is often found in duplicate, triplicate, or almost infinite numbers in any collection. File deduplication software is the answer.

File Deduplication

As the name suggests, it is possible to remove exact copies of files, and sometimes near matches, from a data set. Typically, deduplication will remove files that contain a given percentage of replication with other unnecessary files.

File Deduplication Software

Fortunately, there are a couple of basic tools that eliminate unnecessary copies of files. However, each of these tools has unique purposes and limitations that affect how well they detect and remove unnecessary files.

Nextpoint provides deduping technology for eliminating redundant copies of documents. When you open a new Nextpoint case instance, file deduplication is turned on by default.

Deduplication settings exist in the application to allow your data to be deduped by MD5 hash values. Documents with the same content hash are always considered an Exact Match. Container files such as email (pst, mbox) or .zip are only considered duplicates if the contained/children files are also duplicates.

You can also choose an additional level of deduplication by adding Email-Message ID to your dedupe criteria. When this component is added, documents/emails with the same “Email-Message-ID” are also considered to be an Exact Match – even if their content hashes and headers do not exactly match. You can review how file deduplication software settings work in Nextpoint.

Deduplication Is Unique in Every Case

But one common question we hear is, “Why do I still have duplicate copies in my document collection if I already deduped?”

The answer is that a single document is often introduced into a document collection multiple ways. Different people will attach the same document to an email and send it to different recipients.

Once those separate emails are collected and introduced into a document collection, the attached document is also captured along with each email. Those emails are kept in different families, and deduplication will not eliminate copies in different families. That is because most customers need to keep duplicates in a collection to establish how the content was distributed and who may have been privy to the information.

The Importance of Metadata

Metadata is also used to either reject duplicates (in which case they are “near

duplicates”) or merge them into a single copy. An example of merged metadata would be when two identical emails have been collected from different sources. Since the documents are exact duplicates, the values for coding fields such as “Email Datetime,” “Mailbox Path” and “Batch” will be concatenated and merged into a single document.

An example of metadata rejecting a duplicate would be where Document 1a was coded with a value of “John’s Desktop” in a field called “source” and an identical Document 1b was imported. Because of the different values in the “source” coding field, you would end up with two unique documents. Within Nextpoint, reviewers can see in the sidebar how many copies of a document exist in a collection.

In this example the documents would be linked as “Documents with matching MD5 hashes” in the Related Documents section. Once marked as responsive, non-responsive, privileged, or otherwise, reviewers will know that it has been reviewed. However, there is obviously no need to then keep large numbers of copies.

Make Mine Custom Please

If the standard deduplication settings do not quite fit your particular document review needs, Nextpoint offers custom deduplication services as well. Document populations can be deduped by custodian, or across an entire population, and rules can be implemented to determine how to handle duplicate documents with different coding. That can include criteria like whether to merge metadata by removing one of the duplicates or to leave them separate.

Deduplication is often one of the first and most important steps to narrowing a document collection. Sadly, many document reviewers are enamored by expensive ‘new’ technologies for winnowing large data sets and often forget the simple but effective deduping tools available.

Taking advantage of file deduplication software (like Nextpoint) will help eliminate redundant copies of documents and streamline your review process.

Written by:

Nextpoint, Inc.
Contact
more
less

PUBLISH YOUR CONTENT ON JD SUPRA NOW

  • Increased visibility
  • Actionable analytics
  • Ongoing guidance

Nextpoint, Inc. on:

Reporters on Deadline

"My best business intelligence, in one easy email…"

Your first step to building a free, personalized, morning email brief covering pertinent authors and topics on JD Supra:
*By using the service, you signify your acceptance of JD Supra's Privacy Policy.
Custom Email Digest
- hide
- hide