How do I navigate the NCBI Genomes FTP site? · NLM Customer Support Center

Views:

To access current and actively updated genome assembly data, use the following three directories on the NCBI Genomes FTP site: genbank, refseq, and all.

genbank is a directory of primary genome assembly data and contains assembled genome sequences and associated annotations (if available) that sequencing centers or individual investigators submitted to GenBank or to another member of the International Nucleotide Sequence Database Collaboration (INSDC). You should use this directory if you are interested in obtaining all submitted genome assemblies and your main focus is not accessing genome annotation. The directory is organized by taxonomic groups and you will be able to browse it directly.
refseq is a directory of NCBI-derived genome assembly data containing assembled genomes that NCBI RefSeq staff selected from the primary INSDC data. You should use the refseq directory if you are interested in annotation data that are of high quality and regularly maintained. The sequences of a RefSeq genomic assembly are a copy of those present in the corresponding INSDC assembly. In some cases the copy may not be completely identical as the RefSeq staff may (1) remove smaller pieces (known as contigs) of a sequence or reported contaminants or (2) add non-nuclear genome sequences (for example, mitochondrion) to the assembly. To find primary GenBank (INSDC) assemblies used to create the RefSeq assemblies, use the assembly reports files. All RefSeq genome assemblies have annotations that RefSeq staff either propagated from the primary records or provided through NCBI prokaryotic or eukaryotic genome annotation pipelines. The number of genomic assemblies present in the refseq directory is smaller than that in the genbank directory. The directory is organized by taxonomic groups and you will be able to browse it directly.
all is a directory that combines the contents of the genbank and refseq directories. It consists of two main root directories: GCA for the GenBank and GCF for the RefSeq assembly data. Each of the two root directories is partitioned into a hierarchy of subdirectories that follows the pattern of digits in the assembly accession numbers. For example, starting from the GCF root, the “000” directory will contain sub-directories of those assemblies for which the first three digits of the accession number are 0, 0, 0. The “000/001/405” path ends in a series of GCF_000001405 directories with names that reflect versions and names of individual assemblies (GCF_000001405.36_GRCh38.p10 is for version 36 of the human reference assembly named GRCh38.p10).

All other directories on the NCBI Genomes FTP site are legacy directories and we will be sequentially archiving them. If you are using any of these directories, pay attention to their update dates to assure that you are obtaining current data. If you find a directory missing, check if it has already been moved into the archive directory, which you will also find on the Genomes FTP site. Read more about the FTP genomes site structure and learn details on the site reorganization, content, file formats, downloading instructions, and future plans.

Keywords: NCBI Genomes FTP site, FTP organization, FTP directory structure

Comments (0)