How do I obtain the current human proteome sequences from NCBI?

Views:

NCBI represents the human proteome with overlapping* sets of the Reference Sequences (RefSeq) protein sequences.

Set 1: Sequences currently annotated on the latest human reference genome; accessible from the Assembly database:

Search the Assembly database for the human organism. The following two search terms perform identical searches:

human[orgn]
Homo sapiens[Organism]

On the search results page use the left-side RefSeq category facet to select Reference.
Use the blue Download Assembly button at the top of the page and select the protein format of your choice.
Note the estimated size of the data (uncompressed). The data will download as a file with tar compression.

As of October 2019, the latest human reference assembly release is GRCh38.p13 and the last full annotation on the assembly is updated annotation release 109.20190905. Your download will include predicted models for proteins (the XP_ accessions) in addition to the known RefSeq proteins (the NP_ accessions). If you are interested in obtaining data for (1) interim annotation release that followed release 108 and excludes predicted models or (2) earlier assemblies and/or annotations, see the article on accessing such data on the Genomes FTP site.

Set 2: Cumulative sequence data, updated weekly, including those that are not annotated on the reference genome assembly:

Access the human mRNA_Prot directory at the RefSeq FTP site that contains transcript and protein records in arrangements of compressed files.
Choose between the two available formats: FASTA (the faa file extension) and GenPept (flat file) format (the gpff file extension).
Recursively download all of the files of the chosen format (for example all of the human.#.protein.faa.gz files to get the FASTA format).
For more information on the content of the mRNA_Prot directory refer to the README file in the directory.
Use the README file for the entire RefSeq FTP site to see how it is organized.

See the article on obtaining various subsets of the proteome, such as the subset that excludes the predicted models.

*A protein will not be included in set 1 if it is not annotated on the reference genome. Further differences between the two sets stem from different update/release dates. RefSeq staff update the transcripts and protein records daily and combine these in weekly releases for human (and some other organisms) and in releases that occur every two months for all organisms that are included in the RefSeq project. Annotation of eukaryotic genomes follows a different schedule. RefSeq staff do not archive data of previous weekly or bi-monthly RefSeq releases.

Comments (0)