Sparse Machine Learning Methods for Understanding Large Text Corpora

Metadata Updated: July 17, 2020

Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents.

Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

Access & Use Information

Public: This dataset is intended for public access and use. License: No license information was provided. If this work was prepared by an officer or employee of the United States government as part of that person's official duties it is considered a U.S. Government Work.

Downloads & Resources

Dates

Metadata Created Date August 1, 2018
Metadata Updated Date July 17, 2020
Data Update Frequency irregular

Metadata Source

Harvested from NASA Data.json

Additional Metadata

Resource Type Dataset
Metadata Created Date August 1, 2018
Metadata Updated Date July 17, 2020
Publisher Dashlink
Unique Identifier DASHLINK_513
Maintainer
Ashok Srivastava
Maintainer Email
Public Access Level public
Data Update Frequency irregular
Bureau Code 026:00
Metadata Context https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
Metadata Catalog ID https://data.nasa.gov/data.json
Schema Version https://project-open-data.cio.gov/v1.1/schema
Catalog Describedby https://project-open-data.cio.gov/v1.1/schema/catalog.json
Harvest Object Id dee4ddcf-acbe-452d-adca-b8531d221179
Harvest Source Id 39e4ad2a-47ca-4507-8258-852babd0fd99
Harvest Source Title NASA Data.json
Data First Published 2012-01-27
Homepage URL https://c3.nasa.gov/dashlink/resources/513/
Data Last Modified 2020-01-29
Program Code 026:029
Source Datajson Identifier True
Source Hash a7aeb4118d212d1064cb7909dc057db48b3736db
Source Schema Version 1.1

Didn't find what you're looking for? Suggest a dataset here.