Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Skip to content

Data from polishCLR: Example input genome assemblies

Metadata Updated: March 30, 2024

[ NOTE - Data files added 2022-11-01:

Test long reads - test.1.filtered.bam_.gz Test short reads R1 - testpolish_R1.fastq Test short reads R2 - testpolish_R2.fastq Chromosome 30 of H. zea - GCF_022581195.2_ilHelZeax1.1_chr30.fasta ]

In order to produce the best possible de novo, chromosome-scale genome assembly from error prone Pacific BioSciences continuous long reads (CLR) reads, we developed a publicly available, flexible and reproducible workflow that is containerized so it can be run on any conventional HPC, called polishCLR. This dataset provides example input primary contig assemblies to test and reproduce the demonstrated utility of our workflow. The polishCLR workflow can be easily initiated from three input cases: Case 1: An unresolved primary assembly with associated contigs, the output of FALCON 2-asm: p_ctg.fasta and a_ctg.fasta Case 2: A haplotype-resolved but unpolished set, the output of FALCON-Unzip 3-unzip: all_p_ctg.fasta and all_h_ctg.fasta Case 3: A haplotype-resolved, CLR long-read, Arrow-polished set of primary and alternate contigs, the output of FALCON-Unzip 4-polish: cns_p_ctg.fasta and cns_h_ctg.fasta. These example data are the input contigs assemblies for the pest Helicoverpa zea. These contigs are built from 49.89 Gb of raw Pacific Biosciences (PacBio) CLR data generated from a single H. zea HzStark_Cry1AcR strain male. Adult H. zea were collected near the USDA-ARS Genetics and Sustainability Agricultural Research Unit, Starkville, MS, USA in 2011, and transported to and maintained in a colony at the USDA Southern Insect Management Unit (SIMRU), Stoneville, MS, USA as described previously. Larvae were selected on a diagnostic dose of 2.0 μg ml-1 purified Cry1Ac, and survivors used to create the strain, HzStark_Cry1AcR. HzStark_Cry1AcR was back-crossed every 5 generations to a susceptible line maintained at USDA-ARS SIMRU. A single male pupa (homogametic, ZZ sex chromosome) from HzStark_Cry1AcR was dissected laterally into eight ~20 μg sections. High molecular weight DNA was extracted. PacBio libraries were generated from unsheared DNA using a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA), and 20 hour run time movies generated on a single SMRT Cell 1M v3 using the Sequel I system (Pacific Biosciences). The raw continuous long read (CLR) subread bam files were converted to fastq format using bamtools v. 2.5.1 (Barnett et al. 2011), then used as input for the Falcon assembler (Chin et al. 2016) using the pb-assembly conda environment v. 0.0.8.1 (Pacific Biosciences; default parameters). Falcon-Unzip created primary and alternate contigs with one round of haplotype-aware polishing by Arrow (Pacific Biosciences). Resources in this dataset:Resource Title: Associated assembly contigs output from FALCON/2-asm-falcon. File Name: a_ctg_all.fastaResource Title: Primary assembly contigs output from FALCON/2-asm-falcon. File Name: p_ctg.fastaResource Title: Alternate haplotype assembly contigs output from FALCON Unzip 3-unzip. File Name: all_h_ctg.fastaResource Title: Primary assembly contigs output from FALCON Unzip 3-unzip. File Name: all_p_ctg.fastaResource Title: Alternate assembly contigs output from FALCON Unzip 4-polish. File Name: cns_h_ctg.fastaResource Title: Primary assembly contigs output from FALCON Unzip 4-polish. File Name: cns_pctg.fastaResource Title: Test long reads. File Name: test.1.filtered.bam.gzResource Description: For testing the pipeline, long reads that map to H. zea chromosome 30Resource Title: Test short reads R1. File Name: testpolish_R1.fastqResource Description: Short reads aligned to Chromosome 30 of H. zeaResource Title: Test short reads R2. File Name: testpolish_R2.fastqResource Description: Reverse pair (R2) short reads aligned to Chromosome 30 of H. zeaResource Title: Chromosome 30 of H. zea. File Name: GCF_022581195.2_ilHelZeax1.1_chr30.fasta

Access & Use Information

Public: This dataset is intended for public access and use. License: Creative Commons CCZero

Downloads & Resources

Dates

Metadata Created Date March 30, 2024
Metadata Updated Date March 30, 2024
Data Update Frequency irregular

Metadata Source

Harvested from USDA JSON

Additional Metadata

Resource Type Dataset
Metadata Created Date March 30, 2024
Metadata Updated Date March 30, 2024
Publisher Agricultural Research Service
Maintainer
Identifier 10.15482/USDA.ADC/1524676
Data Last Modified 2024-02-13
Public Access Level public
Data Update Frequency irregular
Bureau Code 005:18
Metadata Context https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld
Schema Version https://project-open-data.cio.gov/v1.1/schema
Catalog Describedby https://project-open-data.cio.gov/v1.1/schema/catalog.json
Harvest Object Id cadc7ac3-2fdf-4c23-9589-c9d75e0fc8c3
Harvest Source Id d3fafa34-0cb9-48f1-ab1d-5b5fdc783806
Harvest Source Title USDA JSON
License https://creativecommons.org/publicdomain/zero/1.0/
Old Spatial {"type": "Polygon", "coordinates": -88.857421875, 33.408798646313, -88.857421875, 33.486144342565, -88.737258911133, 33.486144342565, -88.737258911133, 33.408798646313, -88.857421875, 33.408798646313}
Program Code 005:040
Source Datajson Identifier True
Source Hash bb760554f06f6f78f8ccc721a31dbf3ce1f4ca0d68d03d159adc78e395b1c650
Source Schema Version 1.1
Spatial {"type": "Polygon", "coordinates": -88.857421875, 33.408798646313, -88.857421875, 33.486144342565, -88.737258911133, 33.486144342565, -88.737258911133, 33.408798646313, -88.857421875, 33.408798646313}

Didn't find what you're looking for? Suggest a dataset here.