After submitters deposit their genome assemblies to NCBI, they are processed by NCBI submission software that includes assembly quality checks. GenBank curators also manually review the submitted records and assign accession numbers to each individual sequence record in the assembly. Once the records are publicly released, you can find individual sequence records of an assembly in the Nucleotide database and/or in the Sequence Set Browser.
Since assemblies are products of large projects, the projects' metadata in the BioProject database accompany the sequence records. Moreover, NCBI staff create a record in the Assembly database for each individual assembly. Assembly records contain metadata and other information about the assemblies. Importantly, an Assembly database record aggregates the entire collection of sequences that comprise the assembly under a single and unique assembly accession number. Once you find the assembly of your interest in the Assembly database, you can use the record as your portal to the sequence data (sequence records on the web or the FTP site). Completing the picture, the Assembly database records further link to the records in the Genome database. While each record in the Assembly database is dedicated to an individual assembly, each Genome record focuses on an individual sequenced organism (species). If there is more than one assembly for a species, the Genome database will aggregate this information within the organism's record.
Finally, there is one more important step in the processing of a submitted assembly: NCBI RefSeq curators check the assembly, confirm that there are no issues that would warrant exclusion, and create a RefSeq version of the assembly. NCBI owns these derived assemblies, and this allows us to (1) remove certain low quality or contaminating sequence records from the assembly* (2) add the non-nuclear genome (mitochondrion and/or plastid) sequence to the assembly, (3) annotate the assembly, and (4) regularly update the annotation. While a large proportion of GenBank assemblies are included into RefSeq assemblies, a much smaller number of these are categorized as the RefSeq reference assemblies or RefSeq representative assemblies.
*Note that NCBI does not edit the sequence of the assembly records.
Views:
Keywords: NCBI, genome assembly, metadata, NCBI curation, reference sequences, NCBI RefSeq