What are GenBank accession numbers and what information is embedded in their format?

Views:

GenBank (INSDC) accession numbers (or GenBank accessions) uniquely identify GenBank (INSDC) sequence records. INSDC stands for the International Nucleotide Sequence Databases Collaboration. It currently comprises three databases where researchers can choose to deposit their nucleotide sequence data: GenBank, DDBJ, and ENA. The INSDC collaborators share the deposited data across the three databases. While each database separately assigns accessions to the deposited records, the INSDC members share rules on accession formats and usage.
INSDC uses a variety of formats and accession prefixes. The following document provides the entire list of current formats and prefixes and those used in the past:

Accession Number prefixes: Where did the data originate?

While the document does not specifically address versioning, you need to keep in mind that sequence accession numbers (excluding SRA) also include version numbers. Hence, the generic format can be written as follows:

[alphabetical prefix] [series of digits] [.] [version number]

Here are two examples of accession numbers for standard* GenBank records that you will find in the Nucleotide database: PP750791.1 (a partial-gene sequence), and KT896233.1 (a complete viral genome sequence). Two examples of Protein accessions for proteins that originate from the annotated coding regions (CDS) on the above KT896233.1 record are APA37253.1 and APA37254.1. All these records have version number 1 that indicates these sequences have not changed from its original public release.

In yet another example, the ABKACV000000000.3 accession represents a Whole Genome Shotgun (WGS) master record. Its version “3” tells us that the project has been updated twice. Another way to write the above accession is with its shorthand form ABKACV03. WGS master records tie together the projects’ contig** and scaffold** sequences. For example, the last one of the 77 contig sequences belonging to the ABKACV03 project carries the ABKACV030000077.1 accession number. Note two types of versioning in this contig sequence accession: “03” for the project and “1” for the contig itself.
Assigning accessions in Transcriptome (TSA) and Targeted Loci Sequences (TLS) sequencing projects follows the same principles as those we outlined for WGS.

How do prefixes work within INSDC?

INSDC assigns alphabetical prefixes to each member, so that only one of the databases can use a certain prefix. For example, the INSDC members assigned the “KT” prefix to GenBank but not DDBJ or ENA. During a certain timespan in the past, GenBank indexers used the “KT” prefix to accession all standard records that they processed at the time. Once all digit combinations for the “KT” prefix were exhausted, they moved on to the next assigned prefix.

What information is embedded in INSDC prefixes?

GenBank (INSDC) prefixes can help you differentiate between standard and WGS/TSA/TLS records. If you consult the above document on data origin, you can also determine the database where the data was originally submitted. The prefixes, however, do not embed any biological information. For example, the “KT” prefix would have been used for any organism, not just viruses. It would have been assigned to DNA sequences as well as mRNAs and so on.

Where can you learn more?

Knowledge articles:

GenBank documentation on:

*Standard records exclude WGS, TSA, and TLS.

**See Datasets Glossary for contig and scaffold definitions.

Keywords: GenBank accession numbers, GenBank accessions, accessions, accession format, accession prefix, version numbers, versions, INSDC, International Nucleotide Sequence Databases Collaboration, DDBJ, ENA, NCBI, NCBI RefSeq

Comments (0)