Skip to content

Querying Genomes

Summary of this installation

$ eti installed -i data/apes-115
Ensembl release: 115
Installed genomes:                         
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ abbrev  ┃ genome          ┃ common name ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ gorilla │ gorilla_gorilla │ gorilla     │
│ human   │ homo_sapiens    │ human       │
│ chimp   │ pan_troglodytes │ chimpanzee  │
└─────────┴─────────────────┴─────────────┘
Installed homologies: ✅
Installed alignments: ✅
Installation software versions:   
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ package           ┃ version    ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ click             │ 8.2.1      │
│ cogent3           │ 2025.9.8a3 │
│ cogent3_h5seqs    │ 0.7.0      │
│ duckdb            │ 1.3.1      │
│ ensembl_tui       │ 0.4.3      │
│ numba             │ 0.62.1     │
│ numpy             │ 2.3.3      │
│ polars            │ 1.33.1     │
│ pyarrow           │ 21.0.0     │
│ rich              │ 14.1.0     │
│ scitrack          │ 2024.10.8  │
│ trogon            │ 0.6.0      │
│ typing_extensions │ 4.15.0     │
│ unsync            │ 1.4.0      │
└───────────────────┴────────────┘

Note

Download all the data (zip, ~212 MB).

Summary for a species

$ eti species-summary -i data/apes-115 --species human
Homo sapiens features                         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ biotype                            ┃ count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ snoRNA                             │    11 │
│ miRNA                              │    46 │
│ unprocessed_pseudogene             │    52 │
│ IG_V_pseudogene                    │    43 │
│ scaRNA                             │     3 │
│ lncRNA                             │   747 │
│ protein_coding                     │   447 │
│ rRNA                               │     1 │
│ snRNA                              │    26 │
│ processed_pseudogene               │   142 │
│ TEC                                │    10 │
│ IG_C_pseudogene                    │     5 │
│ IG_V_gene                          │    37 │
│ IG_J_gene                          │     7 │
│ rRNA_pseudogene                    │     5 │
│ misc_RNA                           │    62 │
│ transcribed_unprocessed_pseudogene │    69 │
│ transcribed_unitary_pseudogene     │     7 │
│ transcribed_processed_pseudogene   │    21 │
│ IG_C_gene                          │     4 │
│ unitary_pseudogene                 │     2 │
└────────────────────────────────────┴───────┘
Homo sapiens repeat                                     
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ repeat_type             ┃ repeat_class      ┃  count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ Centromere              │ centromere        │      1 │
│ Dust                    │ dust              │ 39,843 │
│ LTRs                    │ LTR/ERV1?         │     36 │
│ LTRs                    │ LTR               │     39 │
│ LTRs                    │ LTR/ERVL?         │     52 │
│ LTRs                    │ LTR/Gypsy?        │     71 │
│ LTRs                    │ LTR?              │    102 │
│ LTRs                    │ LTR/Gypsy         │    181 │
│ LTRs                    │ LTR/ERVK          │    280 │
│ LTRs                    │ LTR/ERVL          │  3,163 │
│ LTRs                    │ LTR/ERV1          │  4,193 │
│ LTRs                    │ LTR/ERVL-MaLR     │  7,897 │
│ Low complexity regions  │ Low_complexity    │  3,341 │
│ RNA repeats             │ scRNA             │      4 │
│ RNA repeats             │ rRNA              │     14 │
│ RNA repeats             │ RNA               │     24 │
│ RNA repeats             │ srpRNA            │     85 │
│ Satellite repeats       │ Satellite/acro    │     22 │
│ Satellite repeats       │ Satellite/telo    │     73 │
│ Satellite repeats       │ Satellite/centr   │    186 │
│ Satellite repeats       │ Satellite         │    295 │
│ Simple repeats          │ Simple_repeat     │  5,802 │
│ Tandem repeats          │ trf               │ 22,623 │
│ Type I Transposons/LINE │ LINE/Dong-R4      │      6 │
│ Type I Transposons/LINE │ LINE/Penelope     │     20 │
│ Type I Transposons/LINE │ LINE/RTE-BovB     │    104 │
│ Type I Transposons/LINE │ LINE/RTE-X        │    246 │
│ Type I Transposons/LINE │ LINE/CR1          │  1,541 │
│ Type I Transposons/LINE │ LINE/L2           │ 13,974 │
│ Type I Transposons/LINE │ LINE/L1           │ 23,179 │
│ Type I Transposons/SINE │ SINE/tRNA-Deu     │      6 │
│ Type I Transposons/SINE │ SINE/tRNA         │     42 │
│ Type I Transposons/SINE │ SINE/tRNA-RTE     │     68 │
│ Type I Transposons/SINE │ SINE/5S-Deu-L2    │     80 │
│ Type I Transposons/SINE │ SINE/MIR          │ 22,446 │
│ Type I Transposons/SINE │ SINE/Alu          │ 49,780 │
│ Type II Transposons     │ DNA/TcMar-Pogo    │      2 │
│ Type II Transposons     │ DNA/PIF-Harbinger │      4 │
│ Type II Transposons     │ DNA/TcMar         │      6 │
│ Type II Transposons     │ DNA/hAT-Tag1      │      6 │
│ Type II Transposons     │ DNA/PiggyBac?     │      6 │
│ Type II Transposons     │ DNA/hAT?          │      8 │
│ Type II Transposons     │ DNA               │     27 │
│ Type II Transposons     │ DNA?              │     28 │
│ Type II Transposons     │ DNA/MuDR          │     37 │
│ Type II Transposons     │ DNA/MULE-MuDR     │     37 │
│ Type II Transposons     │ DNA/hAT-Tip100?   │     43 │
│ Type II Transposons     │ DNA/PiggyBac      │     46 │
│ Type II Transposons     │ DNA/hAT-Ac        │     84 │
│ Type II Transposons     │ DNA/TcMar-Tc2     │    202 │
│ Type II Transposons     │ DNA/hAT           │    231 │
│ Type II Transposons     │ DNA/TcMar-Mariner │    269 │
│ Type II Transposons     │ DNA/hAT-Blackjack │    444 │
│ Type II Transposons     │ DNA/hAT-Tip100    │    976 │
│ Type II Transposons     │ DNA/TcMar-Tigger  │  2,513 │
│ Type II Transposons     │ DNA/hAT-Charlie   │  6,140 │
│ Unknown                 │ RC?/Helitron?     │      2 │
│ Unknown                 │ RC/Helitron       │     44 │
│ Unknown                 │ Unknown           │     88 │
│ Unknown                 │ Retroposon/SVA    │    213 │
└─────────────────────────┴───────────────────┴────────┘

Export gene meta-data for a species

Note

The list of data from this query only covers human chromosome 22 because we are using a custom subset of the original Ensembl data.

$ eti dump-genes -i data/apes-115 --species human -od human_data
Finished: wrote 'human_data/homo_sapiens-115-gene_metadata.tsv'!
$ head human_data/homo_sapiens-115-gene_metadata.tsv
species seqid   source  biotype transcript_biotypes num_transcripts start   stop    strand  symbol  description
homo_sapiens    22      lncRNA  lncRNA,retained_intron  44  24632819    24653362    1   BCRP3   BCR pseudogene 3 [Source:NCBI gene (formerly Entrezgene);Acc:644165]
homo_sapiens    22      lncRNA  lncRNA  1   22559342    22566602    1   LL22NC03-63E9.3 uncharacterized LOC648691 [Source:NCBI gene (formerly Entrezgene);Acc:648691]
homo_sapiens    22      lncRNA  lncRNA  3   21991098    22043934    -1  PRAMENP PRAME N-terminal like, pseudogene [Source:NCBI gene (formerly Entrezgene);Acc:649179]
homo_sapiens    22      lncRNA  lncRNA  18  18533645    18578894    -1  PI4KAP1 phosphatidylinositol 4-kinase alpha pseudogene 1 [Source:NCBI gene (formerly Entrezgene);Acc:728233]
homo_sapiens    22      lncRNA  lncRNA,retained_intron  29  25448052    25540803    1   CRYBB2P1    crystallin beta B2 pseudogene 1 [Source:NCBI gene (formerly Entrezgene);Acc:1416]
homo_sapiens    22      lncRNA  lncRNA  8   32121930    32143398    1   AP1B1P1 AP1B1 pseudogene 1 [Source:NCBI gene (formerly Entrezgene);Acc:23782]
homo_sapiens    22      lncRNA  retained_intron,lncRNA  29  20701113    20704617    1   TMEM191A    transmembrane protein 191A (pseudogene) [Source:NCBI gene (formerly Entrezgene);Acc:84222]
homo_sapiens    22      lncRNA  lncRNA  4   37339564    37427445    -1  ELFN2   extracellular leucine rich repeat and fibronectin type III domain containing 2 [Source:NCBI gene (formerly Entrezgene);Acc:114794]
homo_sapiens    22      lncRNA  lncRNA  1   36965247    36968172    1   LL22NC01-81G9.3 uncharacterized protein FLJ39582-like [Source:NCBI gene (formerly Entrezgene);Acc:100506241]

Defining intergenic regions

In order to utilize ensembl-tui for sampling non-genic regions you need to write code that will produce a coordinate file. We're going to do that here using cogent3. In brief, the algorithmic steps are

  1. Load the metadata file into a cogent3 table
  2. Sort the table by the genomic coordinate columns seqid, start, stop
  3. For each seqid (e.g. "22" for chromosome 22)
  4. Get the start, stop coordinates for all genes and merge overlapping
  5. Defining intergenic as last gene stop and current gene start
  6. Write these out to a tab delimited file with the correct column headings
from cogent3 import load_table, make_table
from cogent3.util.misc import get_merged_overlapping_coords

table = load_table("human_data/homo_sapiens-115-gene_metadata.tsv")
# make sure the seqid column is a string type
table.columns["seqid"] = table.columns["seqid"].astype(str)
table = table.sorted(columns=["seqid", "start", "stop"])

# if we were doing this for real, we would work on each unique
# seqid at a time
seqids = table.distinct_values("seqid")

# but we just do one seqid here, chrom 22
seqid = "22"
chrom22 = table.filtered(lambda x: x == seqid, columns="seqid")
start_stop = chrom22.to_list(["start", "stop"])

# we use the utility function to merge overlapping gene coordinates
start_stop = get_merged_overlapping_coords(start_stop)

# define the remaining constants to be output into the tsv file
# required by eti
species = "homo_sapiens"
strand = 1
inter_genic = [(species, seqid, 0, start_stop[0][0], strand)]
last_end = start_stop[0][1]
for start, stop in start_stop:
    inter_genic.append((species, seqid, last_end, start, strand))
    last_end = stop

intergen_tab = make_table(
    header=["species", "seqid", "start", "stop", "strand"], data=inter_genic
)
intergen_tab.write("data/chrom22-intergenic.tsv")