Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Skip to content

Complex Document Information Processing (CDIP) dataset

Metadata Updated: September 30, 2025

This dataset is called the "IIT CDIP collection". "CDIP" stands for "Complex Document Information Processing" and "IIT" stands for "Illinois Institute of Technology" who originally built the dataset. The dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s. As a result of the settlement of that lawsuit (the "Master Settlement Agreement"), the companies had to make all the documents public in an archive, which currently resides at UCSF, the University of California, San Francisco.IIT used this data to build a dataset of "messy" documents that were challenging for existing systems to process. There is handwriting on the documents, stains, etc. TREC used an automatic text conversion of this dataset in the TREC Legal Track, and we also have the original TIFF scans of the documents. The dataset consists of around 7 million documents, preprocessed with 90s-era OCR, and also the original page scans in TIFF format. See contact information in this record for access to this dataset.

Access & Use Information

Public: This dataset is intended for public access and use. License: See this page for license information.

Downloads & Resources

Dates

Metadata Created Date September 30, 2025
Metadata Updated Date September 30, 2025
Data Update Frequency irregular

Metadata Source

Harvested from Commerce Non Spatial Data.json Harvest Source

Additional Metadata

Resource Type Dataset
Metadata Created Date September 30, 2025
Metadata Updated Date September 30, 2025
Publisher National Institute of Standards and Technology
Maintainer
Identifier ark:/88434/mds2-2531
Language en
Data Last Modified 1996-01-01 00:00:00
Category Information Technology:Data and informatics
Public Access Level public
Data Update Frequency irregular
Bureau Code 006:55
Metadata Context https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
Schema Version https://project-open-data.cio.gov/v1.1/schema
Catalog Describedby https://project-open-data.cio.gov/v1.1/schema/catalog.json
Harvest Object Id d9f6d41a-07de-4f51-a478-a2d3e720b69a
Harvest Source Id bce99b55-29c1-47be-b214-b8e71e9180b1
Harvest Source Title Commerce Non Spatial Data.json Harvest Source
Homepage URL https://data.nist.gov/od/id/mds2-2531
License https://www.nist.gov/open/license
Program Code 006:045
Source Datajson Identifier True
Source Hash a4a3420e2fe32968ba2d126a0a83be9829b75cd3b15f7d445fc87009963f12dc
Source Schema Version 1.1

Didn't find what you're looking for? Suggest a dataset here.