Full-length messenger RNA sequences greatly improve genome annotation

Published by National Institutes of Health | U.S. Department of Health & Human Services | Catalog Last Checked: September 07, 2025 at 08:14 PM | Dataset Last Updated: September 06, 2025

Background Annotation of eukaryotic genomes is a complex endeavor that requires the integration of evidence from multiple, often contradictory, sources. With the ever-increasing amount of genome sequence data now available, methods for accurate identification of large numbers of genes have become urgently needed. In an effort to create a set of very high-quality gene models, we used the sequence of 5,000 full-length gene transcripts from Arabidopsis to re-annotate its genome. We have mapped these transcripts to their exact chromosomal locations and, using alignment programs, have created gene models that provide a reference set for this organism. Results Approximately 35% of the transcripts indicated that previously annotated genes needed modification, and 5% of the transcripts represented newly discovered genes. We also discovered that multiple transcription initiation sites appear to be much more common than previously known, and we report numerous cases of alternative mRNA splicing. We include a comparison of different alignment software and an analysis of how the transcript data improved the previously published annotation. Conclusions Our results demonstrate that sequencing of large numbers of full-length transcripts followed by computational mapping greatly improves identification of the complete exon structures of eukaryotic genes. In addition, we are able to find numerous introns in the untranslated regions of the genes.

Resources

1 resource available

Official Government Data Source

TEXT/HTML

Download

Find Related Datasets

Search by Tags

Click any tag below to search for similar datasets

Complete Metadata

@type	dcat:Dataset
accessLevel	public
bureauCode	[ "009:25" ]
contactPoint	{ "fn": "NIH", "@type": "vcard:Contact", "hasEmail": "mailto:info@nih.gov" }
description	Background Annotation of eukaryotic genomes is a complex endeavor that requires the integration of evidence from multiple, often contradictory, sources. With the ever-increasing amount of genome sequence data now available, methods for accurate identification of large numbers of genes have become urgently needed. In an effort to create a set of very high-quality gene models, we used the sequence of 5,000 full-length gene transcripts from Arabidopsis to re-annotate its genome. We have mapped these transcripts to their exact chromosomal locations and, using alignment programs, have created gene models that provide a reference set for this organism. Results Approximately 35% of the transcripts indicated that previously annotated genes needed modification, and 5% of the transcripts represented newly discovered genes. We also discovered that multiple transcription initiation sites appear to be much more common than previously known, and we report numerous cases of alternative mRNA splicing. We include a comparison of different alignment software and an analysis of how the transcript data improved the previously published annotation. Conclusions Our results demonstrate that sequencing of large numbers of full-length transcripts followed by computational mapping greatly improves identification of the complete exon structures of eukaryotic genes. In addition, we are able to find numerous introns in the untranslated regions of the genes.
distribution	[ { "@type": "dcat:Distribution", "title": "Official Government Data Source", "mediaType": "text/html", "description": "Visit the original government dataset for complete information, documentation, and data access.", "downloadURL": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC116726/" } ]
identifier	https://healthdata.gov/api/views/sddt-4rnt
issued	2025-07-14
keyword	[ "gene-models", "genome-annotation", "mrna-sequences", "nih", "transcription-initiation" ]
landingPage	https://healthdata.gov/d/sddt-4rnt
modified	2025-09-06
programCode	[ "009:033" ]
publisher	{ "name": "National Institutes of Health", "@type": "org:Organization" }
theme	[ "NIH" ]
title	Full-length messenger RNA sequences greatly improve genome annotation

Have questions or suggestions about this dataset? Reach out to the contact below.