Selecting Ensembl Data¶
ensembl-tui requires a config file to specify the data that you want to select from Ensembl.
Create a template config file to edit¶
eti can write a template config file to a directory you specify. This step requires Internet access.
$ eti demo-config --outpath demo
Contents written to demo
The config template file is written to the specified directory along with a species.tsv file which includes a listing of Ensembl species from main site the latest species listing from Ensembl is downloaded and written to species-full.tsv.
$ ls demo/
sample.cfg
species-full.tsv
species.tsv
Note
Use eti demo-config --help to see the currently supported Ensembl domains to download species data from.
The species contents¶
$ head -n 5 demo/species-full.tsv
abbrev common_name genome_name division taxonomy_id assembly assembly_accession genebuild variation microarray pan_compara peptide_compara genome_alignments other_alignments db_prefix species_id
aca-poly Spiny chromis acanthochromis_polyacanthus EnsemblVertebrates 80966 ASM210954v1 GCA_002109545.1 2018-05-Ensembl/2020-03 N N N Y Y Y acanthochromis_polyacanthus_core_115_1 1
acc-nisu Eurasian sparrowhawk accipiter_nisus EnsemblVertebrates 211598 Accipiter_nisus_ver1.0 GCA_004320145.1 2019-07-Ensembl/2019-09 N N N N N Y accipiter_nisus_core_115_1 1
ail-mela Giant panda ailuropoda_melanoleuca EnsemblVertebrates 9646 ASM200744v2 GCA_002007445.2 2020-05-Ensembl/2020-06 N N N Y Y Y ailuropoda_melanoleuca_core_115_2 1
ama-coll Yellow-billed parrot amazona_collaria EnsemblVertebrates 241587 ASM394721v1 GCA_003947215.1 2019-07-Ensembl/2019-09 N N N N N Y amazona_collaria_core_115_1 1
The config contents¶
$ head -n 15 demo/sample.cfg
[remote path]
# Specify which Ensembl domain to get the data from.
# Available domains: main, vertebrates, metazoa, protists
domain=main
[local path]
# Local paths correspond to where the data will be downloaded
# and where the installation will be placed.
# The paths can be absolute (begin with /), as in this case,
# or relative to the location of this cfg file (begin with ./).
staging_path=ensembl_download
install_path=ensembl_install
[release]
# The release of Ensembl that you want to sample data from.
release=115
[Saccharomyces cerevisiae]
The config format¶
This is a .ini format. A section is denoted by square brackets surrounding the section name, e.g. [remote path]. Variables within a section are denoted by the name followed by an =.
[remote path]¶
This is the section defining which Ensembl domain hosts the data you want, e.g. main, metazoa, protists.
[local path]¶
Specify where to download the data to (staging_path) on your machine and where to write your installation (install_path).
[release]¶
Specify the ensemble release.
[<species name>]¶
Selecting a species is done by providing a section with the species name as a section. You can provide an abbreviation of the species name, which can be found in the species-full.tsv file which is written out by the demo-config command.
For now, you must include db=core under the species section.
[species_map]¶
This section represents the species naming information that is used to map names of a genome name, abbreviation, common name and the Ensembl database prefix. A variant of this will be written out to the downloaded.cfg and installed.cfg files. You can edit the abbreviation data to more memorable, and easy to type, names.
[species_map]
header = genome_name abbrev common_name db_prefix
caenorhabditis_elegans = worm caenorhabditis elegans (nematode, n2) caenorhabditis_elegans
saccharomyces_cerevisiae = yeast saccharomyces cerevisiae saccharomyces_cerevisiae
Note
This is an optional section. If missing, a species map file wil need to be provided.
[compara]¶
Sources from the Compara database are indicated here. Including homologies = (without a value) indicates that you want to get the homology information for all the selected species.
You can indicate the alignments you want as comma separated values assigned to the variable align_names.
Note
To select the alignments that you want, you will need to navigate the Ensembl FTP site for the release you are interested in.
Implicit selection of genomes¶
If you specify a whole genome alignment and do not specify any specific species, then all of the species present in that whole genome alignment will be downloaded.
For example, the following config would download 10 primate genomes along with whole genome alignments and homology data.
[remote path]
domain=main
[local path]
staging_path=download_115
install_path=install_115
[release]
release=115
[compara]
align_names=10_primates.epo
homologies=