What is nucleotide sequence/genome annotation?

Print Article: KA-03574

Annotation, including genome annotation, is the process of finding and designating locations of individual genes and other biological features on nucleotide sequences. A researcher may annotate a short sequence manually by comparing their sequence to other sequences in the database with tools like BLAST. However, annotating an entire prokaryotic/eukaryotic genome requires computational approaches. NCBI has developed computational annotation pipelines for:

All prokaryotic genomes: PGAP (NCBI Prokaryotic Genome Annotation Pipeline)
Eukaryotic genomes (excluding fungi, protozoan, and most Protostomia): EGAP (NCBI Eukaryotic Genome Annotation Pipeline)
Selected viral genomes (Caliciviridae and Flaviviridae): VADR (Viral Annotation DefineR)

What is the purpose of annotation?

Annotation gives meaning to a sequence and makes it much easier for researchers to view and analyze its contents. To see what annotation adds to our understanding of a sequence, you can first check the graphic display of a Bacillus siamensis contig sequence (accession AJVF01000001.1) that does not carry any annotation. All you see is a gray bar, representing the DNA sequence. Now compare that to the graphic display of the annotated RefSeq counterpart (accession NZ_AJVF01000001.1). The underlying sequence between the two is the same. However, because of the RefSeq annotation, you can now tell that the sequence contains several genes. Annotation also provides the exact location of each gene on the sequence, which can be seen through the GenBank display* of the same record.

Who annotates the nucleotide/genomes sequences, the submitter or GenBank/NCBI?

It depends on the data type:

Standard GenBank sequences that include individual gene sequences, organelle- and virus genomes, but exclude prokaryotic and eukaryotic genomes:
- GenBank requires submitters to annotate their sequences, except for some selected data (see below).
- GenBank offers forms and instructions for annotating these sequences within its submission tools.
Selected standard GenBank data for which GenBank provides automated annotations for the submitters:
- Ribosomal RNA (rRNA) or rRNA-ITS
- Metazoan (multicellular animal) COX1
- Virus sequences/genomes for: SARS-CoV-2, Influenza virus, Norovirus, and Dengue virus
GenBank prokaryotic genomes:
- Submitters can optionally annotate the genomes that they are submitting.
- Submitters can optionally request PGAP annotation from NCBI during genome submission.
- Researchers can use stand-alone PGAP software to annotate their genomes outside the submission process.
GenBank eukaryotic genomes:
- Submitters can optionally annotate the genomes that they are submitting.
- (NCBI is working on offering a stand-alone version of EGAP.)
RefSeq prokaryotic genomes:
- NCBI annotates all prokaryotic genomes that are selected for inclusion to RefSeq through PGAP.
RefSeq eukaryotic genomes:
- RefSeq generally selects — with input from the research community — one eukaryotic genome per species for inclusion in RefSeq and EGAP annotation.
- RefSeq staff also conducts manual curation of individual genes and transcripts for selected species.
RefSeq genomes for fungi, protozoans, Protostomia, viruses, and viroids:
- RefSeq generally selects one genome per species for inclusion and copies and standardizes its GenBank annotation into RefSeq.

Where can you learn more?

GenBank home:

The Documentation tab on GenBank home page provides access to several submission guides and other information.
Other tabs on the GenBank home page provide access to submission instructions for separate data types.

Knowledge articles:

NCBI Datasets documentation:

GenBank and RefSeq:

To learn more about NCBI’s collaboration in exchanging nucleotide sequence data, visit the International Nucleotide Sequence Database Collaboration (INSDC) site.
To access information on various NCBI RefSeq projects, visit the RefSeq home page.

*GenBank display is a type of a display format for sequence records and it is used for GenBank and also RefSeq records.