Print

What is genome annotation?

Genome annotation is the process of finding and designating locations of individual genes and other features on raw DNA sequences, called assemblies. Annotation gives meaning to a given sequence and makes it much easier for researchers to view and analyze its contents. To visualize what annotation adds to our understanding of the sequence, you can compare the raw sequence (in FASTA format) with the GenBank or Graphics formats, both of which contain annotations. In both instances note the placement of individual genes and other features on the sequence.

When a group of researchers assemble a genome, they may also — with processes they establish themselves — annotate it at the same time. In the past, an assembly with annotation was known as a build. These days, the term build is rarely used, as the genome assembly process and its annotation process are often completely uncoupled. They can be conducted at different times by different parties. For example, the Genome Reference Consortium (GRC) is maintaining and updating the human reference assembly. GRC releases assembly (sequence) updates and deposits these to the International Nucleotide Sequence Database Collaboration (INSDC) without annotation. GRC prepared the latest major assembly update (major release designated as GRCh38) in December 2013 and it has since followed with several minor updates (patches). In further processing of an assembly update, the NCBI staff creates a RefSeq version of the submitted INSDC assembly. Following that, NCBI annotates the RefSeq version of the assembly. Each annotation release has its own designation and time stamp. For example, the latest (as of August 2023) NCBI annotation release is designated as GCF_000001405.40-RS_2023_03.

In addition to the human reference genome, NCBI staff annotate numerous eukaryotic genomes via the powerful Eukaryotic Genome Annotation Pipeline. Visit the Eukaryotic Genome Annotation at NCBI page to start exploring extensive documentation on the annotation process, and to follow the progress of individual genome annotation. 

NCBI staff have also developed the Prokaryotic Genome Annotation Pipeline that is available as a service to GenBank submitters and also as a stand-alone software package