Installing Ensembl Data¶

The install step converts the downloaded data into more efficient data structures. These are written to the install_path as specified in the config file.

The install command requires the path to the download directory.

$ eti install -d <dirname>

Note

You can utilize multiple processes on your machine for this installation step with the -np # argument. We recommend specifying the same number of processes as the number of genomes, e.g. -np 10 for ten genomes.

Warning

At present, installation is not interruptible. If you restart an installation, you will need to force overwriting of the current one using the --force_overwrite argument.

What Is Installed¶

$ ls data/apes-115
compara
genomes
installed.cfg

The installed.cfg file also specifies the Ensembl release, the software versions used during installation, and under the section species_map, the mapping of genome names to different "names", such as the abbreviation. This is just a plain text file which you can edit.

Note

Changing the abbreviations to something that you find easier to type can be useful as these are employed in the command line interface.

$ cat data/apes-115/installed.cfg
[release]
release = 115

[software versions]
ensembl_tui = 0.4.3
cogent3_h5seqs = 0.7.0
numpy = 2.3.3
polars = 1.33.1
typing_extensions = 4.15.0
trogon = 0.6.0
duckdb = 1.3.1
unsync = 1.4.0
rich = 14.1.0
click = 8.2.1
scitrack = 2024.10.8
pyarrow = 21.0.0
numba = 0.62.1
cogent3 = 2025.9.8a3

[species_map]
header = genome_name    abbrev  common_name db_prefix
pan_troglodytes = chimp chimpanzee  pan_troglodytes
gorilla_gorilla = gorilla   gorilla gorilla_gorilla
homo_sapiens = human    human   homo_sapiens

The output also lists the versions of the software dependencies that were present at the time of installation. This is intended for debugging purposes.

Check Your Installation¶

Once you have finished your installation, you can check its contents using the installed command. This includes the listing of software versions at the time of the installation (useful for troubleshooting) plus species names, abbreviations etc..

$ eti installed -i data/apes-115
Ensembl release: 115
Installed genomes:                         
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ abbrev  ┃ genome          ┃ common name ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ gorilla │ gorilla_gorilla │ gorilla     │
│ human   │ homo_sapiens    │ human       │
│ chimp   │ pan_troglodytes │ chimpanzee  │
└─────────┴─────────────────┴─────────────┘
Installed homologies: ✅
Installed alignments: ✅
Installation software versions:   
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ package           ┃ version    ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ click             │ 8.2.1      │
│ cogent3           │ 2025.9.8a3 │
│ cogent3_h5seqs    │ 0.7.0      │
│ duckdb            │ 1.3.1      │
│ ensembl_tui       │ 0.4.3      │
│ numba             │ 0.62.1     │
│ numpy             │ 2.3.3      │
│ polars            │ 1.33.1     │
│ pyarrow           │ 21.0.0     │
│ rich              │ 14.1.0     │
│ scitrack          │ 2024.10.8  │
│ trogon            │ 0.6.0      │
│ typing_extensions │ 4.15.0     │
│ unsync            │ 1.4.0      │
└───────────────────┴────────────┘

Note

Here we start specifying the installation directory using the -i option. This is required for all commands that reference an installation.