API Reference

This page documents the ePLACE Python API for programmatic access.

Core Modules

eplace_lib.blast_analysis

BLAST analysis module for sequence comparison.

This module provides functionality for running BLAST searches and filtering results based on sequence identity and coverage criteria.

class eplace_lib.blast_analysis.BlastHit(query_id: str, subject_id: str, percent_identity: float, alignment_length: int, query_length: int, subject_length: int, query_start: int, query_end: int, subject_start: int, subject_end: int, evalue: float, bit_score: float, query_coverage: float, subject_taxid: str, subject_taxids: str, subject_taxonomy: Dict[str, Tuple[str, str]] | None = None)[source]

Bases: object

Represents a single BLAST hit result.

query_id

Query sequence identifier

Type:

str

subject_id

Subject (database) sequence identifier

Type:

str

percent_identity

Percentage of identical matches

Type:

float

alignment_length

Length of alignment

Type:

int

query_length

Length of query sequence

Type:

int

subject_length

Length of subject sequence

Type:

int

query_start

Start position in query

Type:

int

query_end

End position in query

Type:

int

subject_start

Start position in subject

Type:

int

subject_end

End position in subject

Type:

int

evalue

Expectation value

Type:

float

bit_score

Bit score

Type:

float

query_coverage

Percentage of query covered by alignment

Type:

float

subject_taxonomy

The subjects taxonomy information. A dictionary with rank as key and a tuple of (taxid, name) as value.

Type:

Dict[str, Tuple[str, str]] | None

alignment_length: int
bit_score: float
evalue: float
get_accession() str[source]

Extract the accession number from the subject_id.

BLAST IDs can be in various formats: - gi|2273658778|gb|MZ387488.1| -> MZ387488.1 - ref|NZ_CP123456.1| -> NZ_CP123456.1 - gb|MZ387488.1| -> MZ387488.1 - MZ387488.1 -> MZ387488.1 (already in accession format)

Note: gnl|database|identifier format is handled by returning the identifier, but these may not be standard accessions.

Returns:

The accession number extracted from subject_id, or the full subject_id if no standard format is detected

get_subject_taxonomy(rank: str) tuple[str, str] | None[source]

Return the taxonomy information as a tuple of (taxid, name) for the given rank. If the rank is not found, return None.

Parameters:

rank – The rank to return the taxonomy information for.

Returns:

tuple of (taxid, name) for the given rank, or None if the rank is not found.

percent_identity: float
query_coverage: float
query_end: int
query_id: str
query_length: int
query_start: int
subject_end: int
subject_id: str
subject_length: int
subject_start: int
subject_taxid: str
subject_taxids: str
subject_taxonomy: Dict[str, Tuple[str, str]] | None = None
class eplace_lib.blast_analysis.BlastRunner(blastdb_path: Path | None = None)[source]

Bases: object

Class for running BLAST searches and parsing results.

__init__(blastdb_path: Path | None = None)[source]

Initialize the BlastRunner.

Parameters:

blastdb_path – Path to BLAST database directory. If None, uses BLASTDB env var.

check_blastn_available() bool[source]

Check if blastn is available in the system.

Returns:

True if blastn is available, False otherwise

filter_blast_hits(hits: list[BlastHit], min_identity: float = 90.0, min_coverage: float = 80.0, min_alignment_length: int | None = None) list[BlastHit][source]

Filter BLAST hits based on identity and coverage thresholds.

Parameters:
  • hits – list of BlastHit objects

  • min_identity – Minimum percent identity (default: 90.0)

  • min_coverage – Minimum query coverage percentage (default: 80.0)

  • min_alignment_length – Minimum alignment length (optional)

Returns:

Filtered list of BlastHit objects

parse_blast_results(blast_output: Path, query_lengths: dict[str, int] | None = None) list[BlastHit][source]

Parse BLAST tabular output.

Parameters:
  • blast_output – Path to BLAST output file (tabular format)

  • query_lengths – dictionary of query sequence lengths. If None, uses qlen from results.

Returns:

list of BlastHit objects

Raises:
run_blastn(query_fasta: Path, output_file: Path, database: str = 'core_nt', num_threads: int = 1, max_target_seqs: int = 100, evalue: float = 1e-05, outfmt: str = '6 qseqid sseqid pident length qlen slen qstart qend sstart send evalue bitscore staxid staxids') bool[source]

Run blastn search.

Parameters:
  • query_fasta – Path to query FASTA file

  • output_file – Path to output file

  • database – Name of BLAST database (default: “core_nt”)

  • num_threads – Number of threads to use

  • max_target_seqs – Maximum number of target sequences to report

  • evalue – E-value threshold

  • outfmt – Output format string

Returns:

True if BLAST ran successfully, False otherwise

Raises:
class eplace_lib.blast_analysis.FastaReader[source]

Bases: object

Class for reading FASTA files.

static get_sequence_lengths(fasta_path: Path) dict[str, int][source]

Get the length of each sequence in a FASTA file.

Parameters:

fasta_path – Path to the FASTA file

Returns:

dictionary mapping sequence IDs to their lengths

static read_fasta(fasta_path: Path) dict[str, str][source]

Read sequences from a FASTA file.

Parameters:

fasta_path – Path to the FASTA file

Returns:

dictionary mapping sequence IDs to sequences

Raises:
class eplace_lib.blast_analysis.MMseqs2Runner(db_path: Path | None = None)[source]

Bases: object

Class for running MMseqs2 searches and parsing results.

MMseqs2 (Many-against-Many sequence searching) is an alternative to BLAST for sequence similarity searching, offering improved speed and sensitivity. Results are parsed into BlastHit objects for compatibility with the rest of the ePLACE pipeline.

The target database can be either a pre-built MMseqs2 database (created with mmseqs createdb) or a FASTA file that MMseqs2 indexes automatically. Taxonomy fields (taxid) are populated only when the database was built with taxonomy information (mmseqs createtaxdb); otherwise they default to “0”.

Database selection: To keep results comparable with the BLAST workflow (which uses NCBI core_nt), the recommended MMseqs2 database should be built from the same underlying sequence collection as core_nt. This means creating an MMseqs2 database from the FASTA sequences that make up NCBI core_nt (e.g. by exporting them with blastdbcmd -db core_nt -entry all and then running mmseqs createdb). Using a different nucleotide collection will change the search space and may produce classification differences that reflect the database rather than the search algorithm. There is no official pre-built MMseqs2 core_nt database; users must provide their own.

__init__(db_path: Path | None = None)[source]

Initialize the MMseqs2Runner.

Parameters:

db_path – Path to the MMseqs2 database directory. If None the MMSEQS_DB_DIR environment variable is used; if unset, MMSEQS2DB is used as a legacy fallback; if both are unset, the directory ~/mmseqs2db is used.

check_mmseqs_available() bool[source]

Check if mmseqs is available in the system PATH.

Returns:

True if mmseqs is available, False otherwise

filter_hits(hits: list[BlastHit], min_identity: float = 90.0, min_coverage: float = 80.0, min_alignment_length: int | None = None) list[BlastHit][source]

Filter MMseqs2 hits based on identity and coverage thresholds.

Parameters:
  • hits – list of BlastHit objects

  • min_identity – Minimum percent identity (default: 90.0)

  • min_coverage – Minimum query coverage percentage (default: 80.0)

  • min_alignment_length – Minimum alignment length (optional)

Returns:

Filtered list of BlastHit objects

parse_mmseqs_results(mmseqs_output: Path, query_lengths: dict[str, int] | None = None) list[BlastHit][source]

Parse MMseqs2 tabular output into BlastHit objects.

Expects output generated with --format-output set to: query,target,pident,alnlen,qlen,tlen,qstart,qend,tstart,tend,evalue,bits,taxid,taxlineage

The taxid and taxlineage columns are optional; if absent or set to “N/A” / “0”, subject_taxid will be stored as “0”.

Parameters:
  • mmseqs_output – Path to MMseqs2 output file

  • query_lengths – Unused; kept for API compatibility with BlastRunner.parse_blast_results.

Returns:

list of BlastHit objects

Raises:

Run an MMseqs2 easy-search.

The output is written in a tab-separated format with the following columns (in order): query, target, pident, alnlen, qlen, tlen, qstart, qend, tstart, tend, evalue, bits, taxid, taxlineage

Parameters:
  • query_fasta – Path to query FASTA file

  • output_file – Path to output file

  • database – Name of the MMseqs2 database inside db_path (default: “core_nt”). There is no official pre-built MMseqs2 core_nt database; users must build their own from the same sequence collection as BLAST core_nt to keep results comparable across backends.

  • num_threads – Number of threads to use

  • max_target_seqs – Maximum number of target sequences to report

  • evalue – E-value threshold

  • sensitivity – MMseqs2 sensitivity (1–7.5, default: 5.7)

  • tmp_dir – Temporary directory for MMseqs2 intermediate files. Defaults to a mmseqs_tmp subdirectory next to output_file.

  • search_type – MMseqs2 search type passed as --search-type to easy-search. Commonly used values: 2 (translated), 3 (nucleotide), 4 (translated nucleotide backtrace). Default is 3 (nucleotide). See MMseqs2 documentation for all valid values.

  • split_memory_limit – Maximum RAM for the MMseqs2 prefilter/index step, passed as --split-memory-limit to easy-search (e.g. "400G"). When None the flag is omitted and MMseqs2 uses its own default.

  • timeout – Maximum runtime for the MMseqs2 search in seconds (default: 3600).

Returns:

True if MMseqs2 ran successfully, False otherwise

Raises:
eplace_lib.blast_analysis.normalize_sequence_id(seq_id: str) str[source]

Normalize an arbitrary sequence or tree label to a canonical accession-like identifier.

This is used to compare IDs from different sources (BLAST subject IDs, FASTA headers, tree leaf labels) that may be formatted differently but refer to the same sequence.

Normalization steps: 1. Strip a leading ‘>’ (FASTA header prefix). 2. Take only the first whitespace-delimited token. 3. Remove MAFFT reverse-complement markers: a leading ‘_R_’ prefix or a trailing ‘_R_’ suffix. 4. If the token contains pipes (‘|’), extract the accession via _extract_accession_from_pipe_id()

(gi|…|gb|ACC|, ref|ACC|, gb|ACC|, etc.).

  1. Otherwise return the token unchanged.

Parameters:

seq_id – Raw sequence identifier from any source.

Returns:

Canonical accession string suitable for exact comparison.

Convenience function to run BLAST search and return filtered hits.

Parameters:
  • query_fasta – Path to query FASTA file

  • output_file – Path to output file

  • min_identity – Minimum percent identity (default: 90.0)

  • min_coverage – Minimum query coverage percentage (default: 80.0)

  • database – Name of BLAST database (default: “core_nt”)

  • blastdb_path – Path to BLAST database directory

  • num_threads – Number of threads to use

  • skip_existing – Skip search if output file already exists (default: True)

Returns:

bool, filtered_hits: list[BlastHit])

Return type:

Tuple of (success

Convenience function to run an MMseqs2 search and return filtered hits.

To keep results comparable with the BLAST workflow (which searches NCBI core_nt), the MMseqs2 database should be built from the same underlying sequence collection as core_nt. There is no official pre-built MMseqs2 core_nt database; users must create one from the relevant FASTA sequences (e.g. exported from BLAST core_nt with blastdbcmd). Using a different nucleotide collection changes the search space and may produce classification differences unrelated to the choice of search engine.

Parameters:
  • query_fasta – Path to query FASTA file

  • output_file – Path to output file

  • min_identity – Minimum percent identity (default: 90.0)

  • min_coverage – Minimum query coverage percentage (default: 80.0)

  • database – Name of MMseqs2 database inside db_path (default: “core_nt”)

  • db_path – Path to the MMseqs2 database directory

  • num_threads – Number of threads to use

  • sensitivity – MMseqs2 sensitivity (1–7.5, default: 5.7)

  • skip_existing – Skip search if output file already exists (default: True)

  • search_type – MMseqs2 search type passed as --search-type to easy-search. Commonly used values: 2 (translated), 3 (nucleotide), 4 (translated nucleotide backtrace). Default is 3 (nucleotide). See MMseqs2 documentation for all valid values.

  • memory_limit – Maximum RAM for the MMseqs2 prefilter/index step, passed as --split-memory-limit to easy-search (e.g. "400G"). When None the flag is omitted.

  • timeout – Maximum runtime for the MMseqs2 search in seconds (default: 3600).

Returns:

bool, filtered_hits: list[BlastHit])

Return type:

Tuple of (success

Raises:

ValueError – If sensitivity is outside the valid range (1–7.5)

eplace_lib.blast_analysis.validate_mmseqs_memory_limit(value: str) str[source]

Validate a MMseqs2-style memory limit string.

Accepts a positive integer (no leading zeros) followed by a single unit suffix K, M, G, or T (case-sensitive, no space). Examples of valid values:

64G   128G   400G   1T   512M
Parameters:

value – The memory limit string to validate.

Returns:

The validated string unchanged.

Raises:

ValueError – If the string is empty, missing units, has an invalid unit suffix, or is otherwise malformed.

eplace_lib.taxonomy

Taxonomy extraction and sequence retrieval module.

This module provides functionality for extracting taxonomic information from BLAST results, selecting representative sequences per taxonomic rank, and extracting sequences from databases.

class eplace_lib.taxonomy.SequenceExtractor(blastdb_path: Path | None = None)[source]

Bases: object

Class for extracting sequences from BLAST databases.

__init__(blastdb_path: Path | None = None)[source]

Initialize the SequenceExtractor.

Parameters:

blastdb_path – Path to BLAST database directory. If None, uses BLASTDB env var.

check_blastdbcmd_available() bool[source]

Check if blastdbcmd is available in the system.

Returns:

True if blastdbcmd is available, False otherwise

extract_representatives_for_query(query_id: str, representative_hits: list[BlastHit], output_dir: Path, database: str = 'core_nt') Path | None[source]

Extract representative sequences for a single query to a FASTA file.

Parameters:
  • query_id – Query sequence identifier

  • representative_hits – list of representative BlastHit objects

  • output_dir – Output directory for FASTA files

  • database – Name of BLAST database

Returns:

Path to output FASTA file if successful, None otherwise

extract_sequences(sequence_ids: list[str], output_fasta: Path, database: str = 'core_nt') bool[source]

Extract sequences from BLAST database using blastdbcmd.

Parameters:
  • sequence_ids – list of sequence IDs to extract

  • output_fasta – Path to output FASTA file

  • database – Name of BLAST database (default: “core_nt”)

Returns:

True if extraction was successful, False otherwise

Raises:

RuntimeError – If blastdbcmd is not available

class eplace_lib.taxonomy.TaxonomyExtractor[source]

Bases: object

Class for extracting taxonomic information from sequence IDs.

group_hits_by_query(hits: list[BlastHit]) dict[str, list[BlastHit]][source]

Group BLAST hits by query sequence.

Parameters:

hits – list of BlastHit objects

Returns:

dictionary mapping query IDs to lists of hits

parse_taxids(tax_ids: list[str]) dict[str, dict[str, tuple[str, str]]][source]

Parse taxonomic information from the taxonomy IDs from the BLAST hits

Parameters:

tax_ids – the taxonomy IDs reported by BLAST

Returns:

dictionary containing the rank and a tuple of the taxonomy ID and the name

select_representatives_by_rank(hits: list[BlastHit], rank: str, max_per_rank: int = 1, preferred_representatives: Dict[str, str] | None = None) list[BlastHit][source]

Select representative sequences per taxonomic rank.

Parameters:
  • hits – list of BlastHit objects for a single query

  • rank – Taxonomic rank for representative selection

  • max_per_rank – Maximum number of representatives per rank (default: 1)

  • preferred_representatives – Optional dictionary mapping rank_tid to preferred subject_id to ensure consistent representatives across queries

Returns:

list of representative BlastHit objects

eplace_lib.taxonomy.generate_classification_summary(sequences: dict[str, str], blast_hits: List[BlastHit], output_file: Path, rank: str = 'genus', group_rank: str = 'class', tree_label_rank: str = 'genus', tree_files: dict[str, Path] | None = None) bool[source]

Generate a classification summary TSV file for each query sequence.

This function creates a TSV file that reports: - Query sequence ID - Closest organism at the classification rank (–rank) - Closest organism at the grouping rank (–group-rank) - Closest organism at the tree labeling rank (–tree-label-rank) - Whether the sequence appears in multiple groups - Whether the sequence has no appropriate classification

The classification is based on the phylogenetically nearest neighbor in the tree (if available), otherwise falls back to the best BLAST hit by bit score.

Parameters:
  • sequences – dictionary of sequences that we read from the fasta file

  • blast_hits – List of BlastHit objects with taxonomy information

  • output_file – Path to output TSV file

  • rank – Taxonomic rank for classification (default: genus)

  • group_rank – Taxonomic rank for grouping (default: class)

  • tree_label_rank – Taxonomic rank for tree labeling (default: genus)

  • tree_files – Optional dict mapping query_id to tree file paths for finding nearest neighbors

Returns:

True if successful, False otherwise

eplace_lib.taxonomy.process_blast_results_for_taxonomy(blast_hits: List[BlastHit], output_dir: Path, rank: str = 'genus', database: str = 'core_nt', blastdb_path: Path | None = None) Dict[str, Path | None][source]

Process BLAST hits to extract representative sequences per taxonomic rank.

Parameters:
  • blast_hits – list of BlastHit objects

  • output_dir – Output directory for FASTA files

  • rank – Taxonomic rank for representative selection

  • database – Name of BLAST database

  • blastdb_path – Path to BLAST database directory

Returns:

dictionary mapping query IDs to output FASTA file paths

eplace_lib.taxonomy.rewrite_blast_hits(blast_hits: List[BlastHit], output_file: Path, header: bool = True) bool[source]

Rewrite the blast hits when we have annotated them

Parameters:
  • blast_hits – list of BlastHit objects

  • output_file – the file to write to

  • header – whether to include a header line in the file

Returns:

True on success

eplace_lib.taxonomy.sort_strings_and_numbers(s: str)[source]

Extract text and numbers from strings for proper sorting.

Parameters:

s – string to extract the number from

Returns:

A tuple (text_part, num_part) that can be used as a sort key. For strings matching the pattern <non-digits><digits>, this is the non-digit prefix and the trailing integer. For non-matching strings, returns (s, 0).

Return type:

Returns

eplace_lib.sequences

Sequence analysis module for eDNA data.

This module provides basic functionality for analyzing environmental DNA sequences.

class eplace_lib.sequences.SequenceAnalyzer[source]

Bases: object

A class for analyzing eDNA sequences.

This class provides methods for basic sequence analysis operations commonly used in environmental DNA studies.

calculate_gc_content(sequence: str) float[source]

Calculate the GC content of a DNA sequence.

Parameters:

sequence – DNA sequence string

Returns:

GC content as a percentage (0-100)

count_bases(sequence: str) Dict[str, int][source]

Count the occurrence of each base in a DNA sequence.

Parameters:

sequence – DNA sequence string

Returns:

Dictionary with base counts

reverse_complement(sequence: str) str[source]

Calculate the reverse complement of a DNA sequence.

Parameters:

sequence – DNA sequence string

Returns:

Reverse complement of the input sequence

eplace_lib.alignment

Sequence alignment and phylogenetic tree building module.

This module provides functionality for trimming sequences based on BLAST alignments, aligning sequences using MAFFT, and building phylogenetic trees using IQTree.

class eplace_lib.alignment.IQTreeBuilder[source]

Bases: object

Class for building phylogenetic trees using IQTree.

static build_tree(alignment_fasta: Path, output_prefix: Path, model: str = 'MFP', num_threads: int = None) bool[source]

Build a phylogenetic tree using IQTree.

Parameters:
  • alignment_fasta – Path to aligned FASTA file

  • output_prefix – Prefix for output files

  • model – Substitution model (default: “MFP” for automatic ModelFinder Plus selection)

  • num_threads – Number of threads to use (default: None, which uses AUTO)

Returns:

True if tree building was successful, False otherwise

static build_tree_background(alignment_fasta: Path, output_prefix: Path, model: str = 'MFP') Dict | None[source]

Start building a phylogenetic tree using IQTree in the background.

This method starts IQTree as a background process and returns immediately, allowing multiple trees to be built in parallel.

Parameters:
  • alignment_fasta – Path to aligned FASTA file

  • output_prefix – Prefix for output files

  • model – Substitution model (default: “MFP” for automatic ModelFinder Plus selection)

Returns:

  • ‘process’: subprocess.Popen object

  • ’output_prefix’: output prefix path

  • ’alignment_fasta’: input alignment file path

  • ’tree_file’: expected tree file path

Return type:

Dictionary with process information if successful, None otherwise

static check_iqtree_available() Tuple[bool, str | None][source]

Check if IQTree is available in the system.

Returns:

bool, command: str or None)

Return type:

Tuple of (available

static relabel_tree_with_taxonomy(tree_file: Path, blast_hits: List[BlastHit], output_tree: Path, taxonomic_rank: str) bool[source]

Relabel tree nodes with taxonomic names.

This reads a Newick tree file and replaces sequence IDs with taxonomic names from the BLAST hits.

Parameters:
  • tree_file – Path to input tree file (Newick format)

  • blast_hits – List of BlastHit objects with taxonomic information

  • output_tree – Path to output tree file with relabeled nodes

  • taxonomic_rank – the taxonomic rank to use for relabeling (e.g., “genus”)

Returns:

True if successful, False otherwise

static wait_for_tree_jobs(jobs: List[Dict], timeout: int = 14400) Dict[str, bool][source]

Wait for multiple IQTree jobs to complete.

This method polls all running processes and waits for them to complete. Since the processes are already running in parallel (started with Popen), this method just collects their results as they finish.

Parameters:
  • jobs – List of job dictionaries returned by build_tree_background()

  • timeout – Maximum time to wait for each individual job in seconds (default: 14400 = 4 hours) Increased because of mega tree created at the end!

Returns:

Dictionary mapping tree_file path to success status (True/False)

class eplace_lib.alignment.MAFFTAligner[source]

Bases: object

Class for running MAFFT sequence alignments.

static align_sequences(input_fasta: Path, output_fasta: Path, auto_orient: bool = True, num_threads: int = 1, strategy: str = 'default') bool[source]

Align sequences using MAFFT.

Parameters:
  • input_fasta – Path to input FASTA file with sequences to align

  • output_fasta – Path to output aligned FASTA file

  • auto_orient – Use MAFFT’s auto-orient feature (default: True)

  • num_threads – Number of threads to use

  • strategy – MAFFT alignment strategy (default: ‘default’) Options: ‘default’, ‘auto’, ‘retree2’, ‘fftns’ ‘auto’: Let MAFFT choose the best strategy automatically ‘retree2’: Fast progressive method, good for large datasets ‘fftns’: Fastest method for very large datasets

Returns:

True if alignment was successful, False otherwise

static check_mafft_available() bool[source]

Check if MAFFT is available in the system.

Returns:

True if MAFFT is available, False otherwise

class eplace_lib.alignment.SequenceTrimmer[source]

Bases: object

Class for trimming sequences based on BLAST alignment coordinates.

static trim_sequence_by_coordinates(sequence: str, start: int, end: int) str[source]

Trim a sequence to extract the region between start and end coordinates.

BLAST coordinates are 1-indexed, so we need to adjust for Python’s 0-indexing.

Parameters:
  • sequence – The full sequence string

  • start – Start position (1-indexed, inclusive)

  • end – End position (1-indexed, inclusive)

Returns:

Trimmed sequence string

static trim_sequences_from_blast_hits(fasta_path: Path, blast_hits: List[BlastHit], output_fasta: Path, query_id: str, taxonomic_rank: str) bool[source]

Trim sequences in a FASTA file based on BLAST hit coordinates.

This reads the representative sequences, trims them to the aligned regions, and writes them to a new FASTA file along with the query sequence.

Parameters:
  • fasta_path – Path to input FASTA file with full-length sequences

  • blast_hits – List of BlastHit objects for this query

  • output_fasta – Path to output FASTA file with trimmed sequences

  • query_id – The query sequence ID to include in output

  • taxonomic_rank – the taxonomic rank to use for taxonomic labels (e.g., “genus”)

Returns:

True if successful, False otherwise

class eplace_lib.alignment.SimpleNewickNode(name: str = '', distance: float = 0.0)[source]

Bases: object

Simple Newick tree node representation for finding nearest neighbors.

get_leaves() List[SimpleNewickNode][source]

Get all leaf nodes under this node.

is_leaf() bool[source]

Check if this node is a leaf.

eplace_lib.alignment.check_alignment_consistency(blast_hits: List[BlastHit], tolerance: int = 50) Dict[str, bool][source]

Check if BLAST hits align to similar locations on reference sequences.

For each reference sequence that appears in multiple hits, check if the alignment coordinates are consistent (within tolerance).

Parameters:
  • blast_hits – List of BlastHit objects to check

  • tolerance – Maximum allowed difference in coordinates (default: 50 bp)

Returns:

Dictionary mapping subject_id to consistency status (True if consistent)

eplace_lib.alignment.concatenate_all_groups_and_build_tree(output_dir: Path, query_fasta: Path, classification_file: Path, blast_hits: List[BlastHit], combined_tree_label_rank: str = 'genus', num_threads: int = 1, alignment_strategy: str = 'auto') Dict[str, Path | None][source]

Concatenate all group _trimmed.fasta files, add queries with 0 blast hits, build a final alignment and tree.

This function: 1. Finds all *_trimmed.fasta files in group directories 2. Reads the classification file to identify queries with 0 blast hits 3. Concatenates all sequences into a single file 4. Uses MAFFT to build an alignment (with optimal parameters for many sequences) 5. Uses IQTree to build a phylogenetic tree 6. Relabels tree nodes with taxonomic names

Parameters:
  • output_dir – Output directory containing group subdirectories

  • query_fasta – Original query FASTA file

  • classification_file – Path to classifications.tsv file

  • blast_hits – List of all BlastHit objects with taxonomy information

  • combined_tree_label_rank – Taxonomic rank for tree labeling (default: genus)

  • num_threads – Number of threads for alignment and tree building (default: 1)

  • alignment_strategy – MAFFT alignment strategy (default: ‘auto’) Options: ‘default’, ‘auto’, ‘retree2’, ‘fftns’

Returns:

  • ‘combined_fasta’: Combined sequences from all groups + zero-hit queries

  • ’alignment’: Aligned sequences

  • ’tree’: Phylogenetic tree

  • ’labeled_tree’: Tree with taxonomic labels

Return type:

Dictionary with paths to generated files

eplace_lib.alignment.create_grouped_fasta_with_queries(group_tid: str, group_name: str, query_hits_map: Dict[str, List[BlastHit]], labeling_rank: str, query_fasta: Path, output_fasta: Path, database: str = 'core_nt', blastdb_path: Path | None = None) bool[source]

Create a FASTA file for a taxonomic group containing all queries and unique references.

Parameters:
  • group_tid – Taxonomy ID of the group

  • group_name – Name of the taxonomic group

  • query_hits_map – Dictionary mapping query_id to list of BlastHit objects

  • labeling_rank – Taxonomic rank to use for labeling (e.g., “genus”)

  • query_fasta – Path to original query FASTA file

  • output_fasta – Path to output grouped FASTA file

  • database – Name of BLAST database

  • blastdb_path – Path to BLAST database directory

Returns:

True if successful, False otherwise

eplace_lib.alignment.find_nearest_neighbor_in_tree(tree_file: Path, query_id: str) str | None[source]

Find the nearest neighbor (closest leaf) to a query sequence in a phylogenetic tree.

This function parses the Newick tree and finds the leaf node that is phylogenetically closest to the query sequence based on tree topology and branch lengths.

Parameters:
  • tree_file – Path to the Newick tree file (.treefile)

  • query_id – Query sequence identifier to find neighbors for

Returns:

Name of the nearest neighbor leaf node, or None if not found or error

eplace_lib.alignment.group_hits_by_group_rank(blast_hits: List[BlastHit], group_rank: str) Dict[str, Dict[str, List[BlastHit]]][source]

Group BLAST hits by group_rank across all queries.

Parameters:

blast_hits – List of BlastHit objects with group taxonomy information

Returns:

Dictionary mapping group_rank_name (taxonomy name) to another dict mapping query_id to list of hits. Format: {group_rank_name: {query_id: [hits]}}

eplace_lib.alignment.parse_simple_newick(newick_str: str) SimpleNewickNode | None[source]

Parse a simple Newick tree string into a tree structure.

This is a lightweight parser that handles basic Newick format with branch lengths. Format: ((A:0.1,B:0.2):0.3,C:0.4);

Parameters:

newick_str – Newick format tree string

Returns:

Root node of the parsed tree, or None if parsing fails

eplace_lib.alignment.process_grouped_alignment_and_tree(group_name: str, group_dir: Path, taxonomic_rank: str, blast_hits: List[BlastHit], query_ids: List[str], num_threads: int = 1) Dict[str, Path | None][source]

Complete pipeline for a taxonomic group: trim, align, and build tree.

Parameters:
  • group_name – The name of the group, used for file naming

  • group_dir – Directory containing group-specific files

  • taxonomic_rank – Taxonomic rank to use for labeling the tree

  • blast_hits – List of BlastHit objects for all queries in the group

  • query_ids – List of query sequence IDs in this group

  • num_threads – Number of threads to use

Returns:

  • ‘combined_fasta’: Combined sequences (queries + references)

  • ’trimmed_fasta’: Trimmed sequences

  • ’alignment’: Aligned sequences

  • ’tree’: Phylogenetic tree

  • ’labeled_tree’: Tree with taxonomic labels

Return type:

Dictionary with paths to generated files

eplace_lib.alignment.process_grouped_alignment_and_tree_parallel(group_name: str, group_dir: Path, taxonomic_rank: str, blast_hits: List[BlastHit], query_ids: List[str], num_threads: int = 1, background_tree: bool = False) Dict[str, Path | None][source]

Complete pipeline for a taxonomic group: trim, align, and optionally build tree in background.

This is similar to process_grouped_alignment_and_tree, but with an option to start tree building in the background and return immediately without waiting for completion.

Parameters:
  • group_name – The name of the group, used for file naming

  • group_dir – Directory containing group-specific files

  • taxonomic_rank – Taxonomic rank to use for labeling the tree

  • blast_hits – List of BlastHit objects for all queries in the group

  • query_ids – List of query sequence IDs in this group

  • num_threads – Number of threads to use

  • background_tree – If True, start tree building in background and return immediately

Returns:

  • ‘combined_fasta’: Combined sequences (queries + references)

  • ’trimmed_fasta’: Trimmed sequences

  • ’alignment’: Aligned sequences

  • ’tree_job’: Background job info if background_tree=True, None otherwise

  • ’tree_file’: Expected tree file path

  • ’blast_hits’: BLAST hits for later tree relabeling

  • ’taxonomic_rank’: Taxonomic rank for later tree relabeling

Return type:

Dictionary with paths to generated files

eplace_lib.alignment.process_query_alignment_and_tree(query_id: str, query_dir: Path, blast_hits: List[BlastHit], query_fasta: Path, taxonomic_rank: str, num_threads: int = 1) Dict[str, Path | None][source]

Complete pipeline for a single query: trim, align, and build tree.

Parameters:
  • query_id – Query sequence identifier

  • query_dir – Directory containing query-specific files

  • blast_hits – List of BlastHit objects for this query (with taxonomy info)

  • query_fasta – Path to original query FASTA file

  • taxonomic_rank – The taxonomic rank to use for relabeling the tree

  • num_threads – Number of threads to use

Returns:

  • ‘trimmed_fasta’: Trimmed sequences

  • ’alignment’: Aligned sequences

  • ’tree’: Phylogenetic tree

  • ’labeled_tree’: Tree with taxonomic labels

Return type:

Dictionary with paths to generated files

eplace_lib.alignment.process_query_alignment_and_tree_parallel(query_id: str, query_dir: Path, blast_hits: List[BlastHit], query_fasta: Path, taxonomic_rank: str, num_threads: int = 1, background_tree: bool = False) Dict[str, Path | None][source]

Complete pipeline for a single query: trim, align, and optionally build tree in background.

This is similar to process_query_alignment_and_tree, but with an option to start tree building in the background and return immediately without waiting for completion.

Parameters:
  • query_id – Query sequence identifier

  • query_dir – Directory containing query-specific files

  • blast_hits – List of BlastHit objects for this query (with taxonomy info)

  • query_fasta – Path to original query FASTA file

  • taxonomic_rank – The taxonomic rank to use for relabeling the tree

  • num_threads – Number of threads to use

  • background_tree – If True, start tree building in background and return immediately

Returns:

  • ‘trimmed_fasta’: Trimmed sequences

  • ’alignment’: Aligned sequences

  • ’tree_job’: Background job info if background_tree=True, None otherwise

  • ’tree_file’: Expected tree file path

  • ’blast_hits’: BLAST hits for later tree relabeling

  • ’taxonomic_rank’: Taxonomic rank for later tree relabeling

Return type:

Dictionary with paths to generated files

eplace_lib.alignment.trim_grouped_sequences(input_fasta: Path, blast_hits: List[BlastHit], output_fasta: Path, query_ids: List[str]) bool[source]

Trim sequences in a grouped FASTA file based on BLAST hit coordinates.

This is similar to trim_sequences_from_blast_hits but handles multiple queries.

Parameters:
  • input_fasta – Path to input FASTA file with full-length sequences

  • blast_hits – List of BlastHit objects for all queries in the group

  • output_fasta – Path to output FASTA file with trimmed sequences

  • query_ids – List of query sequence IDs to include (untrimmed)

Returns:

True if successful, False otherwise

eplace_lib.ncbi_download

NCBI database download module.

This module provides functionality for downloading and managing NCBI BLAST databases, specifically the core nucleotide (nt) database.

class eplace_lib.ncbi_download.MMseqsDownloader(db_dir: Path | None = None)[source]

Bases: object

Download and configure MMseqs2 NT databases and taxonomy sidecar files.

Directory resolution for MMseqs2 databases prefers $MMSEQS_DB_DIR, then $MMSEQS2DB (legacy), then ~/mmseqs2db.

Workflow: 1. Download NT with mmseqs databases. 2. Optionally fetch accession2taxid files and build mapping TSV. 3. Attach taxonomy sidecars with mmseqs createtaxdb.

ACC2TAXID_BASE = 'https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/'
LEGACY_MMSEQS_DB_DIR_ENV = 'MMSEQS2DB'
MMSEQS_DB_DIR_ENV = 'MMSEQS_DB_DIR'
NUCLEOTIDE_DB_NAME = 'NT'
__init__(db_dir: Path | None = None)[source]

Initialize the MMseqs downloader.

add_taxonomy_to_database(mmseqs_db: Path, ncbi_taxonomy: Path, threads: int = 1, acc2taxid_dir: Path | None = None, taxonomy_workdir: Path | None = None) Tuple[bool, str][source]

Add NCBI taxonomy sidecar files to an MMseqs2 NT database.

Parameters:
  • mmseqs_db – Path to MMseqs2 NT database base file (e.g. .../NT).

  • ncbi_taxonomy – Directory with NCBI taxonomy dump files.

  • threads – Number of threads for mmseqs createtaxdb.

  • acc2taxid_dir – Optional directory with accession2taxid files.

  • taxonomy_workdir – Optional working directory for mapping files.

Returns:

Tuple (success, message).

Side effects:

Creates mapping files in taxonomy_workdir and writes MMseqs2 taxonomy sidecar files adjacent to mmseqs_db.

download_nt_database(force_download: bool = False, threads: int = 1) Tuple[bool, str, Path | None][source]

Download MMseqs2 NT database using mmseqs databases.

get_mmseqsdb_directory() Path[source]

Get or determine the MMseqs2 database directory.

Resolution order: 1. Explicit path passed at initialization. 2. $MMSEQS_DB_DIR. 3. $MMSEQS2DB (legacy fallback). 4. ~/mmseqs2db.

class eplace_lib.ncbi_download.NCBIDownloader[source]

Bases: object

A class for managing NCBI BLAST database downloads.

This class handles checking for existing databases, downloading from NCBI FTP, verifying checksums, and extracting database files.

CORE_NT_PREFIX = 'core_nt'
NCBI_FTP_BASE = 'https://ftp.ncbi.nlm.nih.gov/blast/db/'
__init__()[source]

Initialize the NCBIDownloader.

check_database_exists(db_dir: Path | None = None) bool[source]

Check if NCBI core_nt database files exist in the specified directory.

Parameters:

db_dir – Directory to check. If None, uses the default BLASTDB directory.

Returns:

True if at least one core_nt database file exists, False otherwise

download_and_setup_database(force_download: bool = False, verbose: bool = True) Tuple[bool, str][source]

Main function to download and setup the NCBI core_nt database.

This function: 1. Determines the BLASTDB directory 2. Checks if database already exists (unless force_download is True) 3. Downloads all core_nt.* files from NCBI FTP 4. Verifies MD5 checksums 5. Extracts the database files

Parameters:
  • force_download – If True, downloads even if database exists

  • verbose – If True, logs progress information (default: True)

Returns:

bool, message: str)

Return type:

Tuple of (success

download_file(filename: str, dest_dir: Path, show_progress: bool = True) Path[source]

Download a file from NCBI FTP server.

Parameters:
  • filename – Name of the file to download

  • dest_dir – Destination directory

  • show_progress – Whether to show download progress (not implemented yet)

Returns:

Path to the downloaded file

Raises:
  • URLError – If download fails

  • ValueError – If filename contains path traversal sequences

extract_tarball(tarball_path: Path, dest_dir: Path) None[source]

Extract a tar.gz file to the specified directory.

Parameters:
  • tarball_path – Path to the tar.gz file

  • dest_dir – Destination directory for extraction

Raises:
get_available_files() List[str][source]

Get list of available core_nt files from NCBI FTP server.

Returns:

List of filenames matching core_nt pattern

Raises:

URLError – If unable to connect to FTP server

get_blastdb_directory() Path[source]

Get or determine the BLASTDB directory.

Checks if the BLASTDB environment variable is set. If it exists and points to a valid directory, uses that. Otherwise, creates and returns a path to ~/blastdb.

Returns:

Path object pointing to the BLASTDB directory

verify_md5(file_path: Path, md5_file_path: Path) bool[source]

Verify the MD5 checksum of a file.

Parameters:
  • file_path – Path to the file to verify

  • md5_file_path – Path to the MD5 checksum file

Returns:

True if checksum matches, False otherwise

Raises:

ValueError – If MD5 file format is invalid

eplace_lib.ncbi_download.check_available_memory_gb(required_gb: float) Tuple[bool, float][source]

Check whether total system memory meets a required threshold in GiB.

eplace_lib.ncbi_download.get_total_memory_gb() float[source]

Get total system memory in GiB.

On Linux this first reads /proc/meminfo (MemTotal). If that is not available, it falls back to POSIX os.sysconf. Returns 0.0 when both strategies fail.

eplace_lib.ncbi_download.setup_mmseqs_database(force_download: bool = False, threads: int = 1, db_dir: Path | None = None) Tuple[bool, str, Path | None][source]

Convenience function to download MMseqs2 NT database.

eplace_lib.ncbi_download.setup_mmseqs_taxonomy(mmseqs_db: Path, ncbi_taxonomy: Path, threads: int = 1, acc2taxid_dir: Path | None = None, taxonomy_workdir: Path | None = None, db_dir: Path | None = None) Tuple[bool, str][source]

Convenience function to add taxonomy to an MMseqs2 database.

eplace_lib.ncbi_download.setup_ncbi_database(force_download: bool = False, verbose: bool = True) Tuple[bool, str][source]

Convenience function to setup the NCBI core_nt database.

Parameters:
  • force_download – If True, downloads even if database exists

  • verbose – If True, logs progress information (default: True)

Returns:

bool, message: str)

Return type:

Tuple of (success

eplace_lib.cli

ePLACE: environmental Phylogenetic Localisation and Clade Estimation

Main command-line interface for ePLACE toolkit. Provides unified access to database download, BLAST analysis, and grouped workflows.

eplace_lib.cli.blast_command(args)[source]

Handle the blast subcommand - individual workflow.

eplace_lib.cli.download_command(args)[source]

Handle the download subcommand.

eplace_lib.cli.grouped_command(args)[source]

Handle the grouped subcommand - grouped workflow.

eplace_lib.cli.main()[source]

Main entry point for the ePLACE CLI.

eplace_lib.cli.relabel_command(args)[source]

Handle the relabel subcommand - relabel tree with taxonomy.

Quick Examples

BLAST Analysis

from pathlib import Path
from eplace_lib import run_blast_search, process_blast_results_for_taxonomy

# Run BLAST search with filtering
success, filtered_hits = run_blast_search(
    query_fasta=Path("query.fasta"),
    output_file=Path("blast_results.txt"),
    min_identity=90.0,
    min_coverage=80.0
)

# Extract representative sequences
results = process_blast_results_for_taxonomy(
    blast_hits=filtered_hits,
    output_dir=Path("output"),
    rank="genus"
)

Database Download

from eplace_lib import setup_ncbi_database

# Download the core_nt database
success, message = setup_ncbi_database()
print(f"Success: {success}, Message: {message}")

FASTA Reading

from pathlib import Path
from eplace_lib.blast_analysis import FastaReader

# Read sequences
sequences = FastaReader.read_fasta(Path("input.fasta"))

# Get sequence lengths
lengths = FastaReader.get_sequence_lengths(Path("input.fasta"))

Sequence Alignment

from pathlib import Path
from eplace_lib.alignment import align_sequences, build_phylogenetic_tree

# Align sequences
success = align_sequences(
    input_fasta=Path("sequences.fasta"),
    output_fasta=Path("aligned.fasta"),
    num_threads=4
)

# Build tree
success = build_phylogenetic_tree(
    alignment_fasta=Path("aligned.fasta"),
    output_prefix=Path("tree"),
    num_threads=4
)

Data Structures

BlastHit

Represents a single BLAST hit with the following attributes:

  • query_id: Query sequence identifier

  • subject_id: Subject (database) sequence identifier

  • percent_identity: Percentage of identical matches

  • alignment_length: Length of alignment

  • query_length: Length of query sequence

  • subject_length: Length of subject sequence

  • query_start: Start position in query

  • query_end: End position in query

  • subject_start: Start position in subject

  • subject_end: End position in subject

  • evalue: Expectation value

  • bit_score: Bit score

  • query_coverage: Percentage of query covered by alignment

  • subject_taxonomy: Dictionary containing taxonomic information (phylum, class, order, family, genus, species)

Example usage:

from eplace_lib.blast_analysis import BlastHit

# Create a BlastHit
hit = BlastHit(
    query_id="query1",
    subject_id="NC_001234.5",
    percent_identity=95.5,
    alignment_length=500,
    query_length=550,
    subject_length=5000,
    query_start=1,
    query_end=500,
    subject_start=100,
    subject_end=599,
    evalue=1e-100,
    bit_score=900,
    query_coverage=90.9,
    subject_taxonomy={"genus": "Escherichia", "species": "coli"}
)

Common Workflows

Complete BLAST to Tree Workflow

from pathlib import Path
from eplace_lib import (
    run_blast_search,
    process_blast_results_for_taxonomy,
)
from eplace_lib.sequences import trim_sequences_to_blast_coordinates
from eplace_lib.alignment import align_sequences, build_phylogenetic_tree

# Step 1: BLAST search
success, filtered_hits = run_blast_search(
    query_fasta=Path("query.fasta"),
    output_file=Path("blast_results.txt"),
    min_identity=90.0,
    min_coverage=80.0,
    num_threads=4
)

# Step 2: Extract representatives
results = process_blast_results_for_taxonomy(
    blast_hits=filtered_hits,
    output_dir=Path("output"),
    rank="genus"
)

# Step 3: Process each query
for query_id, fasta_path in results.items():
    # Trim sequences
    trimmed_path = fasta_path.parent / f"{query_id}_trimmed.fasta"
    trim_sequences_to_blast_coordinates(
        input_fasta=fasta_path,
        output_fasta=trimmed_path,
        blast_hits=filtered_hits
    )

    # Align sequences
    aligned_path = fasta_path.parent / f"{query_id}_aligned.fasta"
    align_sequences(
        input_fasta=trimmed_path,
        output_fasta=aligned_path,
        num_threads=4
    )

    # Build tree
    tree_prefix = fasta_path.parent / f"{query_id}_tree"
    build_phylogenetic_tree(
        alignment_fasta=aligned_path,
        output_prefix=tree_prefix,
        num_threads=4
    )

Custom BLAST Parameters

from pathlib import Path
from eplace_lib.blast_analysis import BlastRunner

runner = BlastRunner()

# Run BLAST with custom parameters
success = runner.run_blastn(
    query_fasta=Path("query.fasta"),
    output_file=Path("blast_results.txt"),
    database="core_nt",
    num_threads=8,
    max_target_seqs=500,
    evalue=1e-10,
    word_size=11
)

# Parse and filter results
hits = runner.parse_blast_results(Path("blast_results.txt"))
filtered_hits = runner.filter_blast_hits(
    hits,
    min_identity=95.0,
    min_coverage=90.0
)

Working with Taxonomic Data

from eplace_lib.taxonomy import TaxonomyExtractor

extractor = TaxonomyExtractor()

# Group hits by query
grouped_hits = extractor.group_hits_by_query(blast_hits)

# Select representatives at different ranks
for query_id, query_hits in grouped_hits.items():
    # At genus level
    genus_reps = extractor.select_representatives_by_rank(
        hits=query_hits,
        rank="genus",
        max_per_rank=1
    )

    # At species level
    species_reps = extractor.select_representatives_by_rank(
        hits=query_hits,
        rank="species",
        max_per_rank=2
    )

Error Handling

Most functions return success indicators and provide error messages:

from pathlib import Path
from eplace_lib import run_blast_search

success, result = run_blast_search(
    query_fasta=Path("query.fasta"),
    output_file=Path("output.txt"),
    min_identity=90.0,
    min_coverage=80.0
)

if not success:
    print(f"BLAST failed: {result}")
else:
    print(f"Found {len(result)} hits")

For functions that don’t return tuples, check return values:

from pathlib import Path
from eplace_lib.alignment import align_sequences

success = align_sequences(
    input_fasta=Path("sequences.fasta"),
    output_fasta=Path("aligned.fasta")
)

if not success:
    print("Alignment failed")

Type Hints

ePLACE uses type hints throughout the codebase for better IDE support:

from pathlib import Path
from typing import List, Dict, Tuple
from eplace_lib.blast_analysis import BlastHit

def process_hits(
    hits: List[BlastHit],
    min_identity: float = 90.0
) -> Tuple[bool, List[BlastHit]]:
    """Process BLAST hits with type hints."""
    filtered = [h for h in hits if h.percent_identity >= min_identity]
    return True, filtered

Logging

ePLACE uses Python’s logging module. Configure logging in your scripts:

import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Now run ePLACE functions
from eplace_lib import run_blast_search

Advanced Usage

Custom Database Management

from eplace_lib.ncbi_download import NCBIDownloader

downloader = NCBIDownloader()

# Get database directory
db_dir = downloader.get_blastdb_directory()

# Check if database exists
exists = downloader.check_database_exists()

# Get available files
files = downloader.get_available_files()

# Download specific file
downloader.download_file('core_nt.00.tar.gz', db_dir)

Sequence Extraction

from pathlib import Path
from eplace_lib.taxonomy import SequenceExtractor

extractor = SequenceExtractor()

# Extract specific sequences
success = extractor.extract_sequences(
    sequence_ids=["NC_001234.5", "NC_005678.9"],
    output_fasta=Path("extracted.fasta"),
    database="core_nt"
)

See Also