Skip to content

Selecting Ensembl Data

ensembl-tui requires a config file to specify the data that you want to select from Ensembl.

Create a template config file to edit

eti can write a template config file to a directory you specify. This step requires Internet access.

$ eti demo-config --outpath demo
Contents written to demo

The config template file is written to the specified directory along with a species.tsv file which includes a listing of Ensembl species from main site the latest species listing from Ensembl is downloaded and written to species-full.tsv.

$ ls demo/
sample.cfg
species-full.tsv
species.tsv

Note

Use eti demo-config --help to see the currently supported Ensembl domains to download species data from.

The species contents

$ head -n 5 demo/species-full.tsv
abbrev  common_name genome_name division    taxonomy_id assembly    assembly_accession  genebuild   variation   microarray  pan_compara peptide_compara genome_alignments   other_alignments    db_prefix   species_id
aca-poly    Spiny chromis   acanthochromis_polyacanthus EnsemblVertebrates  80966   ASM210954v1 GCA_002109545.1 2018-05-Ensembl/2020-03 N   N   N   Y   Y   Y   acanthochromis_polyacanthus_core_115_1  1
acc-nisu    Eurasian sparrowhawk    accipiter_nisus EnsemblVertebrates  211598  Accipiter_nisus_ver1.0  GCA_004320145.1 2019-07-Ensembl/2019-09 N   N   N   N   N   Y   accipiter_nisus_core_115_1  1
ail-mela    Giant panda ailuropoda_melanoleuca  EnsemblVertebrates  9646    ASM200744v2 GCA_002007445.2 2020-05-Ensembl/2020-06 N   N   N   Y   Y   Y   ailuropoda_melanoleuca_core_115_2   1
ama-coll    Yellow-billed parrot    amazona_collaria    EnsemblVertebrates  241587  ASM394721v1 GCA_003947215.1 2019-07-Ensembl/2019-09 N   N   N   N   N   Y   amazona_collaria_core_115_1 1

The config contents

$ head -n 15 demo/sample.cfg
[remote path]
# Specify which Ensembl domain to get the data from.
# Available domains: main, vertebrates, metazoa, protists
domain=main
[local path]
# Local paths correspond to where the data will be downloaded
# and where the installation will be placed.
# The paths can be absolute (begin with /), as in this case,
# or relative to the location of this cfg file (begin with ./).
staging_path=ensembl_download
install_path=ensembl_install
[release]
# The release of Ensembl that you want to sample data from.
release=115
[Saccharomyces cerevisiae]

The config format

This is a .ini format. A section is denoted by square brackets surrounding the section name, e.g. [remote path]. Variables within a section are denoted by the name followed by an =.

[remote path]

This is the section defining which Ensembl domain hosts the data you want, e.g. main, metazoa, protists.

[local path]

Specify where to download the data to (staging_path) on your machine and where to write your installation (install_path).

[release]

Specify the ensemble release.

[<species name>]

Selecting a species is done by providing a section with the species name as a section. You can provide an abbreviation of the species name, which can be found in the species-full.tsv file which is written out by the demo-config command.

For now, you must include db=core under the species section.

[species_map]

This section represents the species naming information that is used to map names of a genome name, abbreviation, common name and the Ensembl database prefix. A variant of this will be written out to the downloaded.cfg and installed.cfg files. You can edit the abbreviation data to more memorable, and easy to type, names.

[species_map]
header = genome_name    abbrev  common_name db_prefix
caenorhabditis_elegans = worm   caenorhabditis elegans (nematode, n2)   caenorhabditis_elegans
saccharomyces_cerevisiae = yeast    saccharomyces cerevisiae    saccharomyces_cerevisiae

Note

This is an optional section. If missing, a species map file wil need to be provided.

[compara]

Sources from the Compara database are indicated here. Including homologies = (without a value) indicates that you want to get the homology information for all the selected species.

You can indicate the alignments you want as comma separated values assigned to the variable align_names.

Note

To select the alignments that you want, you will need to navigate the Ensembl FTP site for the release you are interested in.

Implicit selection of genomes

If you specify a whole genome alignment and do not specify any specific species, then all of the species present in that whole genome alignment will be downloaded.

For example, the following config would download 10 primate genomes along with whole genome alignments and homology data.

[remote path]
domain=main
[local path]
staging_path=download_115
install_path=install_115
[release]
release=115
[compara]
align_names=10_primates.epo
homologies=