Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Skip to content

Judson_Mansouri_Automated_Chemical_Curation_QSAREnvRes_Data

Metadata Updated: May 2, 2021

Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publically available PHYSPROP physico-chemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers, and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest quality subset of the original dataset was compared to the larger curated and corrected data set. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publically available for further usage and integration by the scientific community.

This dataset is associated with the following publication: Mansouri, K., C. Grulke, A. Richard, R. Judson, and A. Williams. (SAR AND QSAR IN ENVIRONMENTAL RESEARCH) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modeling. SAR AND QSAR IN ENVIRONMENTAL RESEARCH. Taylor & Francis, Inc., Philadelphia, PA, USA, 27(11): 911-937, (2016).

Access & Use Information

Public: This dataset is intended for public access and use. License: See this page for license information.

Downloads & Resources

References

https://doi.org/10.1080/1062936x.2016.1253611

Dates

Metadata Created Date November 12, 2020
Metadata Updated Date May 2, 2021

Metadata Source

Harvested from EPA ScienceHub

Additional Metadata

Resource Type Dataset
Metadata Created Date November 12, 2020
Metadata Updated Date May 2, 2021
Publisher U.S. EPA Office of Research and Development (ORD)
Maintainer
Identifier https://doi.org/10.23719/1394621
Data Last Modified 2017-04-13
Public Access Level public
Bureau Code 020:00
Schema Version https://project-open-data.cio.gov/v1.1/schema
Harvest Object Id 1e02a823-9817-423a-9233-f2dd30c6c948
Harvest Source Id 04b59eaf-ae53-4066-93db-80f2ed0df446
Harvest Source Title EPA ScienceHub
License https://pasteur.epa.gov/license/sciencehub-license.html
Program Code 020:095
Publisher Hierarchy U.S. Government > U.S. Environmental Protection Agency > U.S. EPA Office of Research and Development (ORD)
Related Documents https://doi.org/10.1080/1062936x.2016.1253611
Source Datajson Identifier True
Source Hash 1b56eafbfd97657c6d71a62b239204582aea6966
Source Schema Version 1.1

Didn't find what you're looking for? Suggest a dataset here.