Sparse Machine Learning Methods for Understanding Large Text Corpora

Published by Dashlink | National Aeronautics and Space Administration | Metadata Last Checked: January 10, 2026 | Last Modified: 2025-03-31

Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classification; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

Resources

1 resource available

cidu2011-dashlink.pdf

PDF

Download

Find Related Datasets

Search by Tags

Click any tag below to search for similar datasets

Complete Metadata

@type	dcat:Dataset
accessLevel	public
accrualPeriodicity	irregular
bureauCode	[ "026:00" ]
contactPoint	{ "fn": "Ashok Srivastava", "@type": "vcard:Contact", "hasEmail": "mailto:ashok.n.srivastava@gmail.com" }
description	Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classification; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.
distribution	[ { "@type": "dcat:Distribution", "title": "cidu2011-dashlink.pdf", "format": "PDF", "mediaType": "application/pdf", "description": "cidu2011-dashlink.pdf", "downloadURL": "https://c3.nasa.gov/dashlink/static/media/publication/cidu2011-dashlink.pdf" } ]
identifier	DASHLINK_513
issued	2012-01-27
keyword	[ "ames", "dashlink", "nasa" ]
landingPage	https://c3.nasa.gov/dashlink/resources/513/
modified	2025-03-31
programCode	[ "026:029" ]
publisher	{ "name": "Dashlink", "@type": "org:Organization" }
title	Sparse Machine Learning Methods for Understanding Large Text Corpora