What are the sources of the NCBI Protein database sequences?

Views:

The sequences in the NCBI Protein database originate from several different sources:

Translation of coding regions (CDS) that are annotated on the GenBank (INSDC) sequence records and archived in the Nucleotide database. The records are designated by accession numbers of the following format:

[three-letter alphabetical prefix][five digits][.][version number]

NCBI staff curates many of the GenBank (INSDC) Protein records into the Reference Sequence (RefSeq) collection. The accession format of the RefSeq proteins is distinctly recognizable.
NCBI also imports records from the Universal Protein Resource (UniProtKB) consortium. The UniProt help documentation describes UniProt accession number format.
The Protein Data Bank (PDB) records are those protein sequences that accompany three-dimensional protein structures that are available in the NCBI Structure database. The records are designated with unique PDB ID's.

To limit your records to a source database, first search the Protein database with a text term. On the search results page, click the desired Source databases facet (on the left side of the screen).

Keywords: NCBI Protein database, source sequences, GenBank, RefSeq, Universal Protein Resource (UniProtKB), Protein Data Bank (PDB)

Comments (0)