Print

How do I obtain subsets of the human proteome at NCBI?

NCBI provides the human proteome as the Reference Sequence (RefSeq) protein records. You can retrieve the current human RefSeq proteins on the web in the Protein database with the following search term:

Homo sapiens[Organism] AND refseq[filter]

Be aware that downloading the entire proteome in FASTA or GenPept format from the web will be prohibitively slow, especially at low internet speeds. Instead, consider alternative downloading options to obtain the entire human proteome. Use the web or Entrez programing utilities (E-utilities)/Entrez Direct (EDirect) if you want a subset of the proteome. Here are some examples:


Example 1: Exclude the predicted models (those designated with the XP_ accession format) to obtain only known proteins (the NP_ accession format)

Use the following search term:

human[orgn] AND refseq[filter] NOT "srcdb refseq model"[Properties]


that is equivalent to:

human[orgn] AND refseq[filter] NOT XP_000000001:XP_999999999[pacc]


Example 2: Retrieve the records for an individual RefSeq curation status

  • MODEL ( XP_ ; provided by the NCBI Genome Annotation pipeline and is not subject to individual review or revision between annotation runs):
human[organism] AND "srcdb refseq model"[Properties]
  • INFERRED (NP_; predicted by genome sequence analysis, but it is not yet supported by experimental evidence. The record may be partially supported by homology data):
human[organism] AND "srcdb refseq inferred"[Properties]
  • PREDICTED (NP_; not yet been subject to individual review, and some aspect of the RefSeq record is predicted):
human[organism] AND "srcdb refseq predicted"[Properties]
  • PROVISIONAL (NP_;  not yet been subject to individual review. The initial sequence-to-gene association has been established by outside collaborators or NCBI staff):
human[organism] AND "srcdb refseq provisional"[Properties]
  • VALIDATED (NP_; has undergone an initial review to provide the preferred sequence standard. The record has not yet been subject to final review at which time additional functional information may be provided):
human[organism] AND "srcdb refseq validated"[Properties]
  • REVIEWED (NP_; has been reviewed by NCBI staff or by a collaborator. The NCBI review process includes assessing available sequence data and the literature. Some RefSeq records may incorporate expanded sequence and annotation information):
human[organism] AND "srcdb refseq reviewed"[Properties]

 

Example 3: Combine a query to get both, VALIDATED and REVIEWED entries


human[organism] AND ("srcdb refseq validated"[Properties] OR "srcdb refseq reviewed"[Properties])


In all three examples, use the Send to link to download the records (see downloading details and troubleshooting). Use the same search strategies if you are using E-utilities/EDirect.