Querying Genomes¶
Summary of this installation¶
$ eti installed -i data/apes-115
Ensembl release: 115
Installed genomes:
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ abbrev ┃ genome ┃ common name ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ gorilla │ gorilla_gorilla │ gorilla │
│ human │ homo_sapiens │ human │
│ chimp │ pan_troglodytes │ chimpanzee │
└─────────┴─────────────────┴─────────────┘
Installed homologies: ✅
Installed alignments: ✅
Installation software versions:
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ package ┃ version ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ click │ 8.2.1 │
│ cogent3 │ 2025.9.8a3 │
│ cogent3_h5seqs │ 0.7.0 │
│ duckdb │ 1.3.1 │
│ ensembl_tui │ 0.4.3 │
│ numba │ 0.62.1 │
│ numpy │ 2.3.3 │
│ polars │ 1.33.1 │
│ pyarrow │ 21.0.0 │
│ rich │ 14.1.0 │
│ scitrack │ 2024.10.8 │
│ trogon │ 0.6.0 │
│ typing_extensions │ 4.15.0 │
│ unsync │ 1.4.0 │
└───────────────────┴────────────┘
Note
Download all the data (zip, ~212 MB).
Summary for a species¶
$ eti species-summary -i data/apes-115 --species human
Homo sapiens features
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ biotype ┃ count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ snoRNA │ 11 │
│ miRNA │ 46 │
│ unprocessed_pseudogene │ 52 │
│ IG_V_pseudogene │ 43 │
│ scaRNA │ 3 │
│ lncRNA │ 747 │
│ protein_coding │ 447 │
│ rRNA │ 1 │
│ snRNA │ 26 │
│ processed_pseudogene │ 142 │
│ TEC │ 10 │
│ IG_C_pseudogene │ 5 │
│ IG_V_gene │ 37 │
│ IG_J_gene │ 7 │
│ rRNA_pseudogene │ 5 │
│ misc_RNA │ 62 │
│ transcribed_unprocessed_pseudogene │ 69 │
│ transcribed_unitary_pseudogene │ 7 │
│ transcribed_processed_pseudogene │ 21 │
│ IG_C_gene │ 4 │
│ unitary_pseudogene │ 2 │
└────────────────────────────────────┴───────┘
Homo sapiens repeat
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ repeat_type ┃ repeat_class ┃ count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ Centromere │ centromere │ 1 │
│ Dust │ dust │ 39,843 │
│ LTRs │ LTR/ERV1? │ 36 │
│ LTRs │ LTR │ 39 │
│ LTRs │ LTR/ERVL? │ 52 │
│ LTRs │ LTR/Gypsy? │ 71 │
│ LTRs │ LTR? │ 102 │
│ LTRs │ LTR/Gypsy │ 181 │
│ LTRs │ LTR/ERVK │ 280 │
│ LTRs │ LTR/ERVL │ 3,163 │
│ LTRs │ LTR/ERV1 │ 4,193 │
│ LTRs │ LTR/ERVL-MaLR │ 7,897 │
│ Low complexity regions │ Low_complexity │ 3,341 │
│ RNA repeats │ scRNA │ 4 │
│ RNA repeats │ rRNA │ 14 │
│ RNA repeats │ RNA │ 24 │
│ RNA repeats │ srpRNA │ 85 │
│ Satellite repeats │ Satellite/acro │ 22 │
│ Satellite repeats │ Satellite/telo │ 73 │
│ Satellite repeats │ Satellite/centr │ 186 │
│ Satellite repeats │ Satellite │ 295 │
│ Simple repeats │ Simple_repeat │ 5,802 │
│ Tandem repeats │ trf │ 22,623 │
│ Type I Transposons/LINE │ LINE/Dong-R4 │ 6 │
│ Type I Transposons/LINE │ LINE/Penelope │ 20 │
│ Type I Transposons/LINE │ LINE/RTE-BovB │ 104 │
│ Type I Transposons/LINE │ LINE/RTE-X │ 246 │
│ Type I Transposons/LINE │ LINE/CR1 │ 1,541 │
│ Type I Transposons/LINE │ LINE/L2 │ 13,974 │
│ Type I Transposons/LINE │ LINE/L1 │ 23,179 │
│ Type I Transposons/SINE │ SINE/tRNA-Deu │ 6 │
│ Type I Transposons/SINE │ SINE/tRNA │ 42 │
│ Type I Transposons/SINE │ SINE/tRNA-RTE │ 68 │
│ Type I Transposons/SINE │ SINE/5S-Deu-L2 │ 80 │
│ Type I Transposons/SINE │ SINE/MIR │ 22,446 │
│ Type I Transposons/SINE │ SINE/Alu │ 49,780 │
│ Type II Transposons │ DNA/TcMar-Pogo │ 2 │
│ Type II Transposons │ DNA/PIF-Harbinger │ 4 │
│ Type II Transposons │ DNA/TcMar │ 6 │
│ Type II Transposons │ DNA/hAT-Tag1 │ 6 │
│ Type II Transposons │ DNA/PiggyBac? │ 6 │
│ Type II Transposons │ DNA/hAT? │ 8 │
│ Type II Transposons │ DNA │ 27 │
│ Type II Transposons │ DNA? │ 28 │
│ Type II Transposons │ DNA/MuDR │ 37 │
│ Type II Transposons │ DNA/MULE-MuDR │ 37 │
│ Type II Transposons │ DNA/hAT-Tip100? │ 43 │
│ Type II Transposons │ DNA/PiggyBac │ 46 │
│ Type II Transposons │ DNA/hAT-Ac │ 84 │
│ Type II Transposons │ DNA/TcMar-Tc2 │ 202 │
│ Type II Transposons │ DNA/hAT │ 231 │
│ Type II Transposons │ DNA/TcMar-Mariner │ 269 │
│ Type II Transposons │ DNA/hAT-Blackjack │ 444 │
│ Type II Transposons │ DNA/hAT-Tip100 │ 976 │
│ Type II Transposons │ DNA/TcMar-Tigger │ 2,513 │
│ Type II Transposons │ DNA/hAT-Charlie │ 6,140 │
│ Unknown │ RC?/Helitron? │ 2 │
│ Unknown │ RC/Helitron │ 44 │
│ Unknown │ Unknown │ 88 │
│ Unknown │ Retroposon/SVA │ 213 │
└─────────────────────────┴───────────────────┴────────┘
Export gene meta-data for a species¶
Note
The list of data from this query only covers human chromosome 22 because we are using a custom subset of the original Ensembl data.
$ eti dump-genes -i data/apes-115 --species human -od human_data
Finished: wrote 'human_data/homo_sapiens-115-gene_metadata.tsv'!
$ head human_data/homo_sapiens-115-gene_metadata.tsv
species seqid source biotype transcript_biotypes num_transcripts start stop strand symbol description
homo_sapiens 22 lncRNA lncRNA,retained_intron 44 24632819 24653362 1 BCRP3 BCR pseudogene 3 [Source:NCBI gene (formerly Entrezgene);Acc:644165]
homo_sapiens 22 lncRNA lncRNA 1 22559342 22566602 1 LL22NC03-63E9.3 uncharacterized LOC648691 [Source:NCBI gene (formerly Entrezgene);Acc:648691]
homo_sapiens 22 lncRNA lncRNA 3 21991098 22043934 -1 PRAMENP PRAME N-terminal like, pseudogene [Source:NCBI gene (formerly Entrezgene);Acc:649179]
homo_sapiens 22 lncRNA lncRNA 18 18533645 18578894 -1 PI4KAP1 phosphatidylinositol 4-kinase alpha pseudogene 1 [Source:NCBI gene (formerly Entrezgene);Acc:728233]
homo_sapiens 22 lncRNA lncRNA,retained_intron 29 25448052 25540803 1 CRYBB2P1 crystallin beta B2 pseudogene 1 [Source:NCBI gene (formerly Entrezgene);Acc:1416]
homo_sapiens 22 lncRNA lncRNA 8 32121930 32143398 1 AP1B1P1 AP1B1 pseudogene 1 [Source:NCBI gene (formerly Entrezgene);Acc:23782]
homo_sapiens 22 lncRNA retained_intron,lncRNA 29 20701113 20704617 1 TMEM191A transmembrane protein 191A (pseudogene) [Source:NCBI gene (formerly Entrezgene);Acc:84222]
homo_sapiens 22 lncRNA lncRNA 4 37339564 37427445 -1 ELFN2 extracellular leucine rich repeat and fibronectin type III domain containing 2 [Source:NCBI gene (formerly Entrezgene);Acc:114794]
homo_sapiens 22 lncRNA lncRNA 1 36965247 36968172 1 LL22NC01-81G9.3 uncharacterized protein FLJ39582-like [Source:NCBI gene (formerly Entrezgene);Acc:100506241]
Defining intergenic regions¶
In order to utilize ensembl-tui for sampling non-genic regions you need to write code that will produce a coordinate file. We're going to do that here using cogent3. In brief, the algorithmic steps are
- Load the metadata file into a
cogent3table - Sort the table by the genomic coordinate columns seqid, start, stop
- For each seqid (e.g. "22" for chromosome 22)
- Get the start, stop coordinates for all genes and merge overlapping
- Defining intergenic as last gene stop and current gene start
- Write these out to a tab delimited file with the correct column headings
from cogent3 import load_table, make_table
from cogent3.util.misc import get_merged_overlapping_coords
table = load_table("human_data/homo_sapiens-115-gene_metadata.tsv")
# make sure the seqid column is a string type
table.columns["seqid"] = table.columns["seqid"].astype(str)
table = table.sorted(columns=["seqid", "start", "stop"])
# if we were doing this for real, we would work on each unique
# seqid at a time
seqids = table.distinct_values("seqid")
# but we just do one seqid here, chrom 22
seqid = "22"
chrom22 = table.filtered(lambda x: x == seqid, columns="seqid")
start_stop = chrom22.to_list(["start", "stop"])
# we use the utility function to merge overlapping gene coordinates
start_stop = get_merged_overlapping_coords(start_stop)
# define the remaining constants to be output into the tsv file
# required by eti
species = "homo_sapiens"
strand = 1
inter_genic = [(species, seqid, 0, start_stop[0][0], strand)]
last_end = start_stop[0][1]
for start, stop in start_stop:
inter_genic.append((species, seqid, last_end, start, strand))
last_end = stop
intergen_tab = make_table(
header=["species", "seqid", "start", "stop", "strand"], data=inter_genic
)
intergen_tab.write("data/chrom22-intergenic.tsv")