API Reference

This page documents the ePLACE Python API for programmatic access.

Core Modules

eplace_lib.blast_analysis

BLAST analysis module for sequence comparison.

This module provides functionality for running BLAST searches and filtering results based on sequence identity and coverage criteria.

class eplace_lib.blast_analysis.BlastHit(query_id: str, subject_id: str, percent_identity: float, alignment_length: int, query_length: int, subject_length: int, query_start: int, query_end: int, subject_start: int, subject_end: int, evalue: float, bit_score: float, query_coverage: float, subject_taxid: str, subject_taxids: str, subject_taxonomy: Dict[str, Tuple[str, str]] | None = None)[source]

Bases: object

Represents a single BLAST hit result.

query_id

Query sequence identifier

Type:: str

subject_id

Subject (database) sequence identifier

Type:: str

percent_identity

Percentage of identical matches

Type:: float

alignment_length

Length of alignment

Type:: int

query_length

Length of query sequence

Type:: int

subject_length

Length of subject sequence

Type:: int

query_start

Start position in query

Type:: int

query_end

End position in query

Type:: int

subject_start

Start position in subject

Type:: int

subject_end

End position in subject

Type:: int

evalue

Expectation value

Type:: float

bit_score

Bit score

Type:: float

query_coverage

Percentage of query covered by alignment

Type:: float

subject_taxonomy

The subjects taxonomy information. A dictionary with rank as key and a tuple of (taxid, name) as value.

Type:: Dict[str, Tuple[str, str]] | None

alignment_length: int

bit_score: float

evalue: float

get_accession() → str[source]

Extract the accession number from the subject_id.

BLAST IDs can be in various formats: - gi|2273658778|gb|MZ387488.1| -> MZ387488.1 - ref|NZ_CP123456.1| -> NZ_CP123456.1 - gb|MZ387488.1| -> MZ387488.1 - MZ387488.1 -> MZ387488.1 (already in accession format)

Note: gnl|database|identifier format is handled by returning the identifier, but these may not be standard accessions.

Returns:: The accession number extracted from subject_id, or the full subject_id if no standard format is detected

get_subject_taxonomy(rank: str) → tuple[str, str] | None[source]

Return the taxonomy information as a tuple of (taxid, name) for the given rank. If the rank is not found, return None.

Parameters:: rank – The rank to return the taxonomy information for.
Returns:: tuple of (taxid, name) for the given rank, or None if the rank is not found.

percent_identity: float

query_coverage: float

query_end: int

query_id: str

query_length: int

query_start: int

subject_end: int

subject_id: str

subject_length: int

subject_start: int

subject_taxid: str

subject_taxids: str

subject_taxonomy: Dict[str, Tuple[str, str]] | None = None

class eplace_lib.blast_analysis.BlastRunner(blastdb_path: Path | None = None)[source]

Bases: object

Class for running BLAST searches and parsing results.

__init__(blastdb_path: Path | None = None)[source]

Initialize the BlastRunner.

Parameters:: blastdb_path – Path to BLAST database directory. If None, uses BLASTDB env var.

check_blastn_available() → bool[source]

Check if blastn is available in the system.

Returns:: True if blastn is available, False otherwise

filter_blast_hits(hits: list[BlastHit], min_identity: float = 90.0, min_coverage: float = 80.0, min_alignment_length: int | None = None) → list[BlastHit][source]

Filter BLAST hits based on identity and coverage thresholds.

Parameters:

hits – list of BlastHit objects
min_identity – Minimum percent identity (default: 90.0)
min_coverage – Minimum query coverage percentage (default: 80.0)
min_alignment_length – Minimum alignment length (optional)

Returns:

Filtered list of BlastHit objects

parse_blast_results(blast_output: Path, query_lengths: dict[str, int] | None = None) → list[BlastHit][source]

Parse BLAST tabular output.

Parameters:

blast_output – Path to BLAST output file (tabular format)
query_lengths – dictionary of query sequence lengths. If None, uses qlen from results.

Returns:

list of BlastHit objects

Raises:

FileNotFoundError – If BLAST output file doesn’t exist
ValueError – If BLAST output is malformed

run_blastn(query_fasta: Path, output_file: Path, database: str = 'core_nt', num_threads: int = 1, max_target_seqs: int = 100, evalue: float = 1e-05, outfmt: str = '6 qseqid sseqid pident length qlen slen qstart qend sstart send evalue bitscore staxid staxids') → bool[source]

Run blastn search.

Parameters:

query_fasta – Path to query FASTA file
output_file – Path to output file
database – Name of BLAST database (default: “core_nt”)
num_threads – Number of threads to use
max_target_seqs – Maximum number of target sequences to report
evalue – E-value threshold
outfmt – Output format string

Returns:

True if BLAST ran successfully, False otherwise

Raises:

FileNotFoundError – If query file doesn’t exist
RuntimeError – If blastn is not available

class eplace_lib.blast_analysis.FastaReader[source]

Bases: object

Class for reading FASTA files.

static get_sequence_lengths(fasta_path: Path) → dict[str, int][source]

Get the length of each sequence in a FASTA file.

Parameters:: fasta_path – Path to the FASTA file
Returns:: dictionary mapping sequence IDs to their lengths

static read_fasta(fasta_path: Path) → dict[str, str][source]

Read sequences from a FASTA file.

Parameters:

fasta_path – Path to the FASTA file

Returns:

dictionary mapping sequence IDs to sequences

Raises:

FileNotFoundError – If FASTA file doesn’t exist
ValueError – If FASTA file is malformed

class eplace_lib.blast_analysis.MMseqs2Runner(db_path: Path | None = None)[source]

Bases: object

Class for running MMseqs2 searches and parsing results.

MMseqs2 (Many-against-Many sequence searching) is an alternative to BLAST for sequence similarity searching, offering improved speed and sensitivity. Results are parsed into BlastHit objects for compatibility with the rest of the ePLACE pipeline.

The target database can be either a pre-built MMseqs2 database (created with mmseqs createdb) or a FASTA file that MMseqs2 indexes automatically. Taxonomy fields (taxid) are populated only when the database was built with taxonomy information (mmseqs createtaxdb); otherwise they default to “0”.

Database selection: To keep results comparable with the BLAST workflow (which uses NCBI core_nt), the recommended MMseqs2 database should be built from the same underlying sequence collection as core_nt. This means creating an MMseqs2 database from the FASTA sequences that make up NCBI core_nt (e.g. by exporting them with blastdbcmd -db core_nt -entry all and then running mmseqs createdb). Using a different nucleotide collection will change the search space and may produce classification differences that reflect the database rather than the search algorithm. There is no official pre-built MMseqs2 core_nt database; users must provide their own.

__init__(db_path: Path | None = None)[source]

Initialize the MMseqs2Runner.

Parameters:: db_path – Path to the MMseqs2 database directory. If None the MMSEQS_DB_DIR environment variable is used; if unset, MMSEQS2DB is used as a legacy fallback; if both are unset, the directory ~/mmseqs2db is used.

check_mmseqs_available() → bool[source]

Check if mmseqs is available in the system PATH.

Returns:: True if mmseqs is available, False otherwise

filter_hits(hits: list[BlastHit], min_identity: float = 90.0, min_coverage: float = 80.0, min_alignment_length: int | None = None) → list[BlastHit][source]

Filter MMseqs2 hits based on identity and coverage thresholds.

Parameters:

hits – list of BlastHit objects
min_identity – Minimum percent identity (default: 90.0)
min_coverage – Minimum query coverage percentage (default: 80.0)
min_alignment_length – Minimum alignment length (optional)

Returns:

Filtered list of BlastHit objects

parse_mmseqs_results(mmseqs_output: Path, query_lengths: dict[str, int] | None = None) → list[BlastHit][source]

Parse MMseqs2 tabular output into BlastHit objects.

Expects output generated with --format-output set to: query,target,pident,alnlen,qlen,tlen,qstart,qend,tstart,tend,evalue,bits,taxid,taxlineage

The taxid and taxlineage columns are optional; if absent or set to “N/A” / “0”, subject_taxid will be stored as “0”.

Parameters:

mmseqs_output – Path to MMseqs2 output file
query_lengths – Unused; kept for API compatibility with BlastRunner.parse_blast_results.

Returns:

list of BlastHit objects

Raises:

FileNotFoundError – If the output file doesn’t exist
ValueError – If the output is malformed

run_easy_search(query_fasta: Path, output_file: Path, database: str = 'core_nt', num_threads: int = 1, max_target_seqs: int = 100, evalue: float = 1e-05, sensitivity: float = 5.7, tmp_dir: Path | None = None, search_type: int = 3, split_memory_limit: str | None = None, timeout: int = 3600) → bool[source]

Run an MMseqs2 easy-search.

The output is written in a tab-separated format with the following columns (in order): query, target, pident, alnlen, qlen, tlen, qstart, qend, tstart, tend, evalue, bits, taxid, taxlineage

Parameters:

query_fasta – Path to query FASTA file
output_file – Path to output file
database – Name of the MMseqs2 database inside db_path (default: “core_nt”). There is no official pre-built MMseqs2 core_nt database; users must build their own from the same sequence collection as BLAST core_nt to keep results comparable across backends.
num_threads – Number of threads to use
max_target_seqs – Maximum number of target sequences to report
evalue – E-value threshold
sensitivity – MMseqs2 sensitivity (1–7.5, default: 5.7)
tmp_dir – Temporary directory for MMseqs2 intermediate files. Defaults to a mmseqs_tmp subdirectory next to output_file.
search_type – MMseqs2 search type passed as --search-type to easy-search. Commonly used values: 2 (translated), 3 (nucleotide), 4 (translated nucleotide backtrace). Default is 3 (nucleotide). See MMseqs2 documentation for all valid values.
split_memory_limit – Maximum RAM for the MMseqs2 prefilter/index step, passed as --split-memory-limit to easy-search (e.g. "400G"). When None the flag is omitted and MMseqs2 uses its own default.
timeout – Maximum runtime for the MMseqs2 search in seconds (default: 3600).

Returns:

True if MMseqs2 ran successfully, False otherwise

Raises:

FileNotFoundError – If query file doesn’t exist
RuntimeError – If mmseqs is not available

eplace_lib.blast_analysis.normalize_sequence_id(seq_id: str) → str[source]

Normalize an arbitrary sequence or tree label to a canonical accession-like identifier.

This is used to compare IDs from different sources (BLAST subject IDs, FASTA headers, tree leaf labels) that may be formatted differently but refer to the same sequence.

Normalization steps: 1. Strip a leading ‘>’ (FASTA header prefix). 2. Take only the first whitespace-delimited token. 3. Remove MAFFT reverse-complement markers: a leading ‘_R_’ prefix or a trailing ‘_R_’ suffix. 4. If the token contains pipes (‘|’), extract the accession via _extract_accession_from_pipe_id()

(gi|…|gb|ACC|, ref|ACC|, gb|ACC|, etc.).

Otherwise return the token unchanged.

Parameters:: seq_id – Raw sequence identifier from any source.
Returns:: Canonical accession string suitable for exact comparison.

eplace_lib.blast_analysis.run_blast_search(query_fasta: Path, output_file: Path, min_identity: float = 90.0, min_coverage: float = 80.0, database: str = 'core_nt', blastdb_path: Path | None = None, num_threads: int = 1, skip_existing: bool = True) → tuple[bool, list[BlastHit]][source]

Convenience function to run BLAST search and return filtered hits.

Parameters:

query_fasta – Path to query FASTA file
output_file – Path to output file
min_identity – Minimum percent identity (default: 90.0)
min_coverage – Minimum query coverage percentage (default: 80.0)
database – Name of BLAST database (default: “core_nt”)
blastdb_path – Path to BLAST database directory
num_threads – Number of threads to use
skip_existing – Skip search if output file already exists (default: True)

Returns:

bool, filtered_hits: list[BlastHit])

Return type:

Tuple of (success

eplace_lib.blast_analysis.run_mmseqs_search(query_fasta: Path, output_file: Path, min_identity: float = 90.0, min_coverage: float = 80.0, database: str = 'core_nt', db_path: Path | None = None, num_threads: int = 1, sensitivity: float = 5.7, skip_existing: bool = True, search_type: int = 3, memory_limit: str | None = None, timeout: int = 3600) → tuple[bool, list[BlastHit]][source]

Convenience function to run an MMseqs2 search and return filtered hits.

To keep results comparable with the BLAST workflow (which searches NCBI core_nt), the MMseqs2 database should be built from the same underlying sequence collection as core_nt. There is no official pre-built MMseqs2 core_nt database; users must create one from the relevant FASTA sequences (e.g. exported from BLAST core_nt with blastdbcmd). Using a different nucleotide collection changes the search space and may produce classification differences unrelated to the choice of search engine.

Parameters:

query_fasta – Path to query FASTA file
output_file – Path to output file
min_identity – Minimum percent identity (default: 90.0)
min_coverage – Minimum query coverage percentage (default: 80.0)
database – Name of MMseqs2 database inside db_path (default: “core_nt”)
db_path – Path to the MMseqs2 database directory
num_threads – Number of threads to use
sensitivity – MMseqs2 sensitivity (1–7.5, default: 5.7)
skip_existing – Skip search if output file already exists (default: True)
search_type – MMseqs2 search type passed as --search-type to easy-search. Commonly used values: 2 (translated), 3 (nucleotide), 4 (translated nucleotide backtrace). Default is 3 (nucleotide). See MMseqs2 documentation for all valid values.
memory_limit – Maximum RAM for the MMseqs2 prefilter/index step, passed as --split-memory-limit to easy-search (e.g. "400G"). When None the flag is omitted.
timeout – Maximum runtime for the MMseqs2 search in seconds (default: 3600).

Returns:

bool, filtered_hits: list[BlastHit])

Return type:

Tuple of (success

Raises:

ValueError – If sensitivity is outside the valid range (1–7.5)

eplace_lib.blast_analysis.validate_mmseqs_memory_limit(value: str) → str[source]

Validate a MMseqs2-style memory limit string.

Accepts a positive integer (no leading zeros) followed by a single unit suffix K, M, G, or T (case-sensitive, no space). Examples of valid values:

64G   128G   400G   1T   512M

Parameters:: value – The memory limit string to validate.
Returns:: The validated string unchanged.
Raises:: ValueError – If the string is empty, missing units, has an invalid unit suffix, or is otherwise malformed.

eplace_lib.taxonomy

Taxonomy extraction and sequence retrieval module.

This module provides functionality for extracting taxonomic information from BLAST results, selecting representative sequences per taxonomic rank, and extracting sequences from databases.

class eplace_lib.taxonomy.SequenceExtractor(blastdb_path: Path | None = None)[source]

Bases: object

Class for extracting sequences from BLAST databases.

__init__(blastdb_path: Path | None = None)[source]

Initialize the SequenceExtractor.

Parameters:: blastdb_path – Path to BLAST database directory. If None, uses BLASTDB env var.

check_blastdbcmd_available() → bool[source]

Check if blastdbcmd is available in the system.

Returns:: True if blastdbcmd is available, False otherwise

extract_representatives_for_query(query_id: str, representative_hits: list[BlastHit], output_dir: Path, database: str = 'core_nt') → Path | None[source]

Extract representative sequences for a single query to a FASTA file.

Parameters:

query_id – Query sequence identifier
representative_hits – list of representative BlastHit objects
output_dir – Output directory for FASTA files
database – Name of BLAST database

Returns:

Path to output FASTA file if successful, None otherwise

extract_sequences(sequence_ids: list[str], output_fasta: Path, database: str = 'core_nt') → bool[source]

Extract sequences from BLAST database using blastdbcmd.

Parameters:

sequence_ids – list of sequence IDs to extract
output_fasta – Path to output FASTA file
database – Name of BLAST database (default: “core_nt”)

Returns:

True if extraction was successful, False otherwise

Raises:

RuntimeError – If blastdbcmd is not available

class eplace_lib.taxonomy.TaxonomyExtractor[source]

Bases: object

Class for extracting taxonomic information from sequence IDs.

group_hits_by_query(hits: list[BlastHit]) → dict[str, list[BlastHit]][source]

Group BLAST hits by query sequence.

Parameters:: hits – list of BlastHit objects
Returns:: dictionary mapping query IDs to lists of hits

parse_taxids(tax_ids: list[str]) → dict[str, dict[str, tuple[str, str]]][source]

Parse taxonomic information from the taxonomy IDs from the BLAST hits

Parameters:: tax_ids – the taxonomy IDs reported by BLAST
Returns:: dictionary containing the rank and a tuple of the taxonomy ID and the name

select_representatives_by_rank(hits: list[BlastHit], rank: str, max_per_rank: int = 1, preferred_representatives: Dict[str, str] | None = None) → list[BlastHit][source]

Select representative sequences per taxonomic rank.

Parameters:

hits – list of BlastHit objects for a single query
rank – Taxonomic rank for representative selection
max_per_rank – Maximum number of representatives per rank (default: 1)
preferred_representatives – Optional dictionary mapping rank_tid to preferred subject_id to ensure consistent representatives across queries

Returns:

list of representative BlastHit objects

eplace_lib.taxonomy.generate_classification_summary(sequences: dict[str, str], blast_hits: List[BlastHit], output_file: Path, rank: str = 'genus', group_rank: str = 'class', tree_label_rank: str = 'genus', tree_files: dict[str, Path] | None = None) → bool[source]

Generate a classification summary TSV file for each query sequence.

This function creates a TSV file that reports: - Query sequence ID - Closest organism at the classification rank (–rank) - Closest organism at the grouping rank (–group-rank) - Closest organism at the tree labeling rank (–tree-label-rank) - Whether the sequence appears in multiple groups - Whether the sequence has no appropriate classification

The classification is based on the phylogenetically nearest neighbor in the tree (if available), otherwise falls back to the best BLAST hit by bit score.

Parameters:

sequences – dictionary of sequences that we read from the fasta file
blast_hits – List of BlastHit objects with taxonomy information
output_file – Path to output TSV file
rank – Taxonomic rank for classification (default: genus)
group_rank – Taxonomic rank for grouping (default: class)
tree_label_rank – Taxonomic rank for tree labeling (default: genus)
tree_files – Optional dict mapping query_id to tree file paths for finding nearest neighbors

Returns:

True if successful, False otherwise

eplace_lib.taxonomy.process_blast_results_for_taxonomy(blast_hits: List[BlastHit], output_dir: Path, rank: str = 'genus', database: str = 'core_nt', blastdb_path: Path | None = None) → Dict[str, Path | None][source]

Process BLAST hits to extract representative sequences per taxonomic rank.

Parameters:

blast_hits – list of BlastHit objects
output_dir – Output directory for FASTA files
rank – Taxonomic rank for representative selection
database – Name of BLAST database
blastdb_path – Path to BLAST database directory

Returns:

dictionary mapping query IDs to output FASTA file paths

eplace_lib.taxonomy.rewrite_blast_hits(blast_hits: List[BlastHit], output_file: Path, header: bool = True) → bool[source]

Rewrite the blast hits when we have annotated them

Parameters:

blast_hits – list of BlastHit objects
output_file – the file to write to
header – whether to include a header line in the file

Returns:

True on success

eplace_lib.taxonomy.sort_strings_and_numbers(s: str)[source]

Extract text and numbers from strings for proper sorting.

Parameters:: s – string to extract the number from
Returns:: A tuple (text_part, num_part) that can be used as a sort key. For strings matching the pattern <non-digits><digits>, this is the non-digit prefix and the trailing integer. For non-matching strings, returns (s, 0).
Return type:: Returns

eplace_lib.sequences

Sequence analysis module for eDNA data.

This module provides basic functionality for analyzing environmental DNA sequences.

class eplace_lib.sequences.SequenceAnalyzer[source]

Bases: object

A class for analyzing eDNA sequences.

This class provides methods for basic sequence analysis operations commonly used in environmental DNA studies.

calculate_gc_content(sequence: str) → float[source]

Calculate the GC content of a DNA sequence.

Parameters:: sequence – DNA sequence string
Returns:: GC content as a percentage (0-100)

count_bases(sequence: str) → Dict[str, int][source]

Count the occurrence of each base in a DNA sequence.

Parameters:: sequence – DNA sequence string
Returns:: Dictionary with base counts

reverse_complement(sequence: str) → str[source]

Calculate the reverse complement of a DNA sequence.

Parameters:: sequence – DNA sequence string
Returns:: Reverse complement of the input sequence

eplace_lib.alignment

Sequence alignment and phylogenetic tree building module.

This module provides functionality for trimming sequences based on BLAST alignments, aligning sequences using MAFFT, and building phylogenetic trees using IQTree.

class eplace_lib.alignment.IQTreeBuilder[source]

Bases: object

Class for building phylogenetic trees using IQTree.

static build_tree(alignment_fasta: Path, output_prefix: Path, model: str = 'MFP', num_threads: int = None) → bool[source]

Build a phylogenetic tree using IQTree.

Parameters:

alignment_fasta – Path to aligned FASTA file
output_prefix – Prefix for output files
model – Substitution model (default: “MFP” for automatic ModelFinder Plus selection)
num_threads – Number of threads to use (default: None, which uses AUTO)

Returns:

True if tree building was successful, False otherwise

static build_tree_background(alignment_fasta: Path, output_prefix: Path, model: str = 'MFP') → Dict | None[source]

Start building a phylogenetic tree using IQTree in the background.

This method starts IQTree as a background process and returns immediately, allowing multiple trees to be built in parallel.

Parameters:

alignment_fasta – Path to aligned FASTA file
output_prefix – Prefix for output files
model – Substitution model (default: “MFP” for automatic ModelFinder Plus selection)

Returns:

‘process’: subprocess.Popen object
’output_prefix’: output prefix path
’alignment_fasta’: input alignment file path
’tree_file’: expected tree file path

Return type:

Dictionary with process information if successful, None otherwise

static check_iqtree_available() → Tuple[bool, str | None][source]

Check if IQTree is available in the system.

Returns:: bool, command: str or None)
Return type:: Tuple of (available

static relabel_tree_with_taxonomy(tree_file: Path, blast_hits: List[BlastHit], output_tree: Path, taxonomic_rank: str) → bool[source]

Relabel tree nodes with taxonomic names.

This reads a Newick tree file and replaces sequence IDs with taxonomic names from the BLAST hits.

Parameters:

tree_file – Path to input tree file (Newick format)
blast_hits – List of BlastHit objects with taxonomic information
output_tree – Path to output tree file with relabeled nodes
taxonomic_rank – the taxonomic rank to use for relabeling (e.g., “genus”)

Returns:

True if successful, False otherwise

static wait_for_tree_jobs(jobs: List[Dict], timeout: int = 14400) → Dict[str, bool][source]

Wait for multiple IQTree jobs to complete.

This method polls all running processes and waits for them to complete. Since the processes are already running in parallel (started with Popen), this method just collects their results as they finish.

Parameters:

jobs – List of job dictionaries returned by build_tree_background()
timeout – Maximum time to wait for each individual job in seconds (default: 14400 = 4 hours) Increased because of mega tree created at the end!

Returns:

Dictionary mapping tree_file path to success status (True/False)

class eplace_lib.alignment.MAFFTAligner[source]

Bases: object

Class for running MAFFT sequence alignments.

static align_sequences(input_fasta: Path, output_fasta: Path, auto_orient: bool = True, num_threads: int = 1, strategy: str = 'default') → bool[source]

Align sequences using MAFFT.

Parameters:

input_fasta – Path to input FASTA file with sequences to align
output_fasta – Path to output aligned FASTA file
auto_orient – Use MAFFT’s auto-orient feature (default: True)
num_threads – Number of threads to use
strategy – MAFFT alignment strategy (default: ‘default’) Options: ‘default’, ‘auto’, ‘retree2’, ‘fftns’ ‘auto’: Let MAFFT choose the best strategy automatically ‘retree2’: Fast progressive method, good for large datasets ‘fftns’: Fastest method for very large datasets

Returns:

True if alignment was successful, False otherwise

static check_mafft_available() → bool[source]

Check if MAFFT is available in the system.

Returns:: True if MAFFT is available, False otherwise

class eplace_lib.alignment.SequenceTrimmer[source]

Bases: object

Class for trimming sequences based on BLAST alignment coordinates.

static trim_sequence_by_coordinates(sequence: str, start: int, end: int) → str[source]

Trim a sequence to extract the region between start and end coordinates.

BLAST coordinates are 1-indexed, so we need to adjust for Python’s 0-indexing.

Parameters:

sequence – The full sequence string
start – Start position (1-indexed, inclusive)
end – End position (1-indexed, inclusive)

Returns:

Trimmed sequence string

static trim_sequences_from_blast_hits(fasta_path: Path, blast_hits: List[BlastHit], output_fasta: Path, query_id: str, taxonomic_rank: str) → bool[source]

Trim sequences in a FASTA file based on BLAST hit coordinates.

This reads the representative sequences, trims them to the aligned regions, and writes them to a new FASTA file along with the query sequence.

Parameters:

fasta_path – Path to input FASTA file with full-length sequences
blast_hits – List of BlastHit objects for this query
output_fasta – Path to output FASTA file with trimmed sequences
query_id – The query sequence ID to include in output
taxonomic_rank – the taxonomic rank to use for taxonomic labels (e.g., “genus”)

Returns:

True if successful, False otherwise

class eplace_lib.alignment.SimpleNewickNode(name: str = '', distance: float = 0.0)[source]

Bases: object

Simple Newick tree node representation for finding nearest neighbors.

get_leaves() → List[SimpleNewickNode][source]: Get all leaf nodes under this node.

is_leaf() → bool[source]: Check if this node is a leaf.

eplace_lib.alignment.check_alignment_consistency(blast_hits: List[BlastHit], tolerance: int = 50) → Dict[str, bool][source]

Check if BLAST hits align to similar locations on reference sequences.

For each reference sequence that appears in multiple hits, check if the alignment coordinates are consistent (within tolerance).

Parameters:

blast_hits – List of BlastHit objects to check
tolerance – Maximum allowed difference in coordinates (default: 50 bp)

Returns:

Dictionary mapping subject_id to consistency status (True if consistent)

eplace_lib.alignment.concatenate_all_groups_and_build_tree(output_dir: Path, query_fasta: Path, classification_file: Path, blast_hits: List[BlastHit], combined_tree_label_rank: str = 'genus', num_threads: int = 1, alignment_strategy: str = 'auto') → Dict[str, Path | None][source]

Concatenate all group _trimmed.fasta files, add queries with 0 blast hits, build a final alignment and tree.

This function: 1. Finds all *_trimmed.fasta files in group directories 2. Reads the classification file to identify queries with 0 blast hits 3. Concatenates all sequences into a single file 4. Uses MAFFT to build an alignment (with optimal parameters for many sequences) 5. Uses IQTree to build a phylogenetic tree 6. Relabels tree nodes with taxonomic names

Parameters:

output_dir – Output directory containing group subdirectories
query_fasta – Original query FASTA file
classification_file – Path to classifications.tsv file
blast_hits – List of all BlastHit objects with taxonomy information
combined_tree_label_rank – Taxonomic rank for tree labeling (default: genus)
num_threads – Number of threads for alignment and tree building (default: 1)
alignment_strategy – MAFFT alignment strategy (default: ‘auto’) Options: ‘default’, ‘auto’, ‘retree2’, ‘fftns’

Returns:

‘combined_fasta’: Combined sequences from all groups + zero-hit queries
’alignment’: Aligned sequences
’tree’: Phylogenetic tree
’labeled_tree’: Tree with taxonomic labels

Return type:

Dictionary with paths to generated files

eplace_lib.alignment.create_grouped_fasta_with_queries(group_tid: str, group_name: str, query_hits_map: Dict[str, List[BlastHit]], labeling_rank: str, query_fasta: Path, output_fasta: Path, database: str = 'core_nt', blastdb_path: Path | None = None) → bool[source]

Create a FASTA file for a taxonomic group containing all queries and unique references.

Parameters:

group_tid – Taxonomy ID of the group
group_name – Name of the taxonomic group
query_hits_map – Dictionary mapping query_id to list of BlastHit objects
labeling_rank – Taxonomic rank to use for labeling (e.g., “genus”)
query_fasta – Path to original query FASTA file
output_fasta – Path to output grouped FASTA file
database – Name of BLAST database
blastdb_path – Path to BLAST database directory

Returns:

True if successful, False otherwise

eplace_lib.alignment.find_nearest_neighbor_in_tree(tree_file: Path, query_id: str) → str | None[source]

Find the nearest neighbor (closest leaf) to a query sequence in a phylogenetic tree.

This function parses the Newick tree and finds the leaf node that is phylogenetically closest to the query sequence based on tree topology and branch lengths.

Parameters:

tree_file – Path to the Newick tree file (.treefile)
query_id – Query sequence identifier to find neighbors for

Returns:

Name of the nearest neighbor leaf node, or None if not found or error

eplace_lib.alignment.group_hits_by_group_rank(blast_hits: List[BlastHit], group_rank: str) → Dict[str, Dict[str, List[BlastHit]]][source]

Group BLAST hits by group_rank across all queries.

Parameters:: blast_hits – List of BlastHit objects with group taxonomy information
Returns:: Dictionary mapping group_rank_name (taxonomy name) to another dict mapping query_id to list of hits. Format: {group_rank_name: {query_id: [hits]}}

eplace_lib.alignment.parse_simple_newick(newick_str: str) → SimpleNewickNode | None[source]

Parse a simple Newick tree string into a tree structure.

This is a lightweight parser that handles basic Newick format with branch lengths. Format: ((A:0.1,B:0.2):0.3,C:0.4);

Parameters:: newick_str – Newick format tree string
Returns:: Root node of the parsed tree, or None if parsing fails

eplace_lib.alignment.process_grouped_alignment_and_tree(group_name: str, group_dir: Path, taxonomic_rank: str, blast_hits: List[BlastHit], query_ids: List[str], num_threads: int = 1) → Dict[str, Path | None][source]

Complete pipeline for a taxonomic group: trim, align, and build tree.

Parameters:

group_name – The name of the group, used for file naming
group_dir – Directory containing group-specific files
taxonomic_rank – Taxonomic rank to use for labeling the tree
blast_hits – List of BlastHit objects for all queries in the group
query_ids – List of query sequence IDs in this group
num_threads – Number of threads to use

Returns:

‘combined_fasta’: Combined sequences (queries + references)
’trimmed_fasta’: Trimmed sequences
’alignment’: Aligned sequences
’tree’: Phylogenetic tree
’labeled_tree’: Tree with taxonomic labels

Return type:

Dictionary with paths to generated files

eplace_lib.alignment.process_grouped_alignment_and_tree_parallel(group_name: str, group_dir: Path, taxonomic_rank: str, blast_hits: List[BlastHit], query_ids: List[str], num_threads: int = 1, background_tree: bool = False) → Dict[str, Path | None][source]

Complete pipeline for a taxonomic group: trim, align, and optionally build tree in background.

This is similar to process_grouped_alignment_and_tree, but with an option to start tree building in the background and return immediately without waiting for completion.

Parameters:

group_name – The name of the group, used for file naming
group_dir – Directory containing group-specific files
taxonomic_rank – Taxonomic rank to use for labeling the tree
blast_hits – List of BlastHit objects for all queries in the group
query_ids – List of query sequence IDs in this group
num_threads – Number of threads to use
background_tree – If True, start tree building in background and return immediately

Returns:

‘combined_fasta’: Combined sequences (queries + references)
’trimmed_fasta’: Trimmed sequences
’alignment’: Aligned sequences
’tree_job’: Background job info if background_tree=True, None otherwise
’tree_file’: Expected tree file path
’blast_hits’: BLAST hits for later tree relabeling
’taxonomic_rank’: Taxonomic rank for later tree relabeling

Return type:

Dictionary with paths to generated files

eplace_lib.alignment.process_query_alignment_and_tree(query_id: str, query_dir: Path, blast_hits: List[BlastHit], query_fasta: Path, taxonomic_rank: str, num_threads: int = 1) → Dict[str, Path | None][source]

Complete pipeline for a single query: trim, align, and build tree.

Parameters:

query_id – Query sequence identifier
query_dir – Directory containing query-specific files
blast_hits – List of BlastHit objects for this query (with taxonomy info)
query_fasta – Path to original query FASTA file
taxonomic_rank – The taxonomic rank to use for relabeling the tree
num_threads – Number of threads to use

Returns:

‘trimmed_fasta’: Trimmed sequences
’alignment’: Aligned sequences
’tree’: Phylogenetic tree
’labeled_tree’: Tree with taxonomic labels

Return type:

Dictionary with paths to generated files

eplace_lib.alignment.process_query_alignment_and_tree_parallel(query_id: str, query_dir: Path, blast_hits: List[BlastHit], query_fasta: Path, taxonomic_rank: str, num_threads: int = 1, background_tree: bool = False) → Dict[str, Path | None][source]

Complete pipeline for a single query: trim, align, and optionally build tree in background.

This is similar to process_query_alignment_and_tree, but with an option to start tree building in the background and return immediately without waiting for completion.

Parameters:

query_id – Query sequence identifier
query_dir – Directory containing query-specific files
blast_hits – List of BlastHit objects for this query (with taxonomy info)
query_fasta – Path to original query FASTA file
taxonomic_rank – The taxonomic rank to use for relabeling the tree
num_threads – Number of threads to use
background_tree – If True, start tree building in background and return immediately

Returns:

‘trimmed_fasta’: Trimmed sequences
’alignment’: Aligned sequences
’tree_job’: Background job info if background_tree=True, None otherwise
’tree_file’: Expected tree file path
’blast_hits’: BLAST hits for later tree relabeling
’taxonomic_rank’: Taxonomic rank for later tree relabeling

Return type:

Dictionary with paths to generated files

eplace_lib.alignment.trim_grouped_sequences(input_fasta: Path, blast_hits: List[BlastHit], output_fasta: Path, query_ids: List[str]) → bool[source]

Trim sequences in a grouped FASTA file based on BLAST hit coordinates.

This is similar to trim_sequences_from_blast_hits but handles multiple queries.

Parameters:

input_fasta – Path to input FASTA file with full-length sequences
blast_hits – List of BlastHit objects for all queries in the group
output_fasta – Path to output FASTA file with trimmed sequences
query_ids – List of query sequence IDs to include (untrimmed)

Returns:

True if successful, False otherwise

eplace_lib.ncbi_download

NCBI database download module.

This module provides functionality for downloading and managing NCBI BLAST databases, specifically the core nucleotide (nt) database.

class eplace_lib.ncbi_download.MMseqsDownloader(db_dir: Path | None = None)[source]

Bases: object

Download and configure MMseqs2 NT databases and taxonomy sidecar files.

Directory resolution for MMseqs2 databases prefers $MMSEQS_DB_DIR, then $MMSEQS2DB (legacy), then ~/mmseqs2db.

Workflow: 1. Download NT with mmseqs databases. 2. Optionally fetch accession2taxid files and build mapping TSV. 3. Attach taxonomy sidecars with mmseqs createtaxdb.

ACC2TAXID_BASE = 'https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/'

LEGACY_MMSEQS_DB_DIR_ENV = 'MMSEQS2DB'

MMSEQS_DB_DIR_ENV = 'MMSEQS_DB_DIR'

NUCLEOTIDE_DB_NAME = 'NT'

__init__(db_dir: Path | None = None)[source]: Initialize the MMseqs downloader.

add_taxonomy_to_database(mmseqs_db: Path, ncbi_taxonomy: Path, threads: int = 1, acc2taxid_dir: Path | None = None, taxonomy_workdir: Path | None = None) → Tuple[bool, str][source]

Add NCBI taxonomy sidecar files to an MMseqs2 NT database.

Parameters:

mmseqs_db – Path to MMseqs2 NT database base file (e.g. .../NT).
ncbi_taxonomy – Directory with NCBI taxonomy dump files.
threads – Number of threads for mmseqs createtaxdb.
acc2taxid_dir – Optional directory with accession2taxid files.
taxonomy_workdir – Optional working directory for mapping files.

Returns:

Tuple (success, message).

Side effects:: Creates mapping files in taxonomy_workdir and writes MMseqs2 taxonomy sidecar files adjacent to mmseqs_db.

download_nt_database(force_download: bool = False, threads: int = 1) → Tuple[bool, str, Path | None][source]: Download MMseqs2 NT database using mmseqs databases.

get_mmseqsdb_directory() → Path[source]

Get or determine the MMseqs2 database directory.

Resolution order: 1. Explicit path passed at initialization. 2. $MMSEQS_DB_DIR. 3. $MMSEQS2DB (legacy fallback). 4. ~/mmseqs2db.

class eplace_lib.ncbi_download.NCBIDownloader[source]

Bases: object

A class for managing NCBI BLAST database downloads.

This class handles checking for existing databases, downloading from NCBI FTP, verifying checksums, and extracting database files.

CORE_NT_PREFIX = 'core_nt'

NCBI_FTP_BASE = 'https://ftp.ncbi.nlm.nih.gov/blast/db/'

__init__()[source]: Initialize the NCBIDownloader.

check_database_exists(db_dir: Path | None = None) → bool[source]

Check if NCBI core_nt database files exist in the specified directory.

Parameters:: db_dir – Directory to check. If None, uses the default BLASTDB directory.
Returns:: True if at least one core_nt database file exists, False otherwise

download_and_setup_database(force_download: bool = False, verbose: bool = True) → Tuple[bool, str][source]

Main function to download and setup the NCBI core_nt database.

This function: 1. Determines the BLASTDB directory 2. Checks if database already exists (unless force_download is True) 3. Downloads all core_nt.* files from NCBI FTP 4. Verifies MD5 checksums 5. Extracts the database files

Parameters:

force_download – If True, downloads even if database exists
verbose – If True, logs progress information (default: True)

Returns:

bool, message: str)

Return type:

Tuple of (success

download_file(filename: str, dest_dir: Path, show_progress: bool = True) → Path[source]

Download a file from NCBI FTP server.

Parameters:

filename – Name of the file to download
dest_dir – Destination directory
show_progress – Whether to show download progress (not implemented yet)

Returns:

Path to the downloaded file

Raises:

URLError – If download fails
ValueError – If filename contains path traversal sequences

extract_tarball(tarball_path: Path, dest_dir: Path) → None[source]

Extract a tar.gz file to the specified directory.

Parameters:

tarball_path – Path to the tar.gz file
dest_dir – Destination directory for extraction

Raises:

tarfile.TarError – If extraction fails
ValueError – If tarball contains unsafe paths

get_available_files() → List[str][source]

Get list of available core_nt files from NCBI FTP server.

Returns:: List of filenames matching core_nt pattern
Raises:: URLError – If unable to connect to FTP server

get_blastdb_directory() → Path[source]

Get or determine the BLASTDB directory.

Checks if the BLASTDB environment variable is set. If it exists and points to a valid directory, uses that. Otherwise, creates and returns a path to ~/blastdb.

Returns:: Path object pointing to the BLASTDB directory

verify_md5(file_path: Path, md5_file_path: Path) → bool[source]

Verify the MD5 checksum of a file.

Parameters:

file_path – Path to the file to verify
md5_file_path – Path to the MD5 checksum file

Returns:

True if checksum matches, False otherwise

Raises:

ValueError – If MD5 file format is invalid

eplace_lib.ncbi_download.check_available_memory_gb(required_gb: float) → Tuple[bool, float][source]: Check whether total system memory meets a required threshold in GiB.

eplace_lib.ncbi_download.get_total_memory_gb() → float[source]

Get total system memory in GiB.

On Linux this first reads /proc/meminfo (MemTotal). If that is not available, it falls back to POSIX os.sysconf. Returns 0.0 when both strategies fail.

eplace_lib.ncbi_download.setup_mmseqs_database(force_download: bool = False, threads: int = 1, db_dir: Path | None = None) → Tuple[bool, str, Path | None][source]: Convenience function to download MMseqs2 NT database.

eplace_lib.ncbi_download.setup_mmseqs_taxonomy(mmseqs_db: Path, ncbi_taxonomy: Path, threads: int = 1, acc2taxid_dir: Path | None = None, taxonomy_workdir: Path | None = None, db_dir: Path | None = None) → Tuple[bool, str][source]: Convenience function to add taxonomy to an MMseqs2 database.

eplace_lib.ncbi_download.setup_ncbi_database(force_download: bool = False, verbose: bool = True) → Tuple[bool, str][source]

Convenience function to setup the NCBI core_nt database.

Parameters:

force_download – If True, downloads even if database exists
verbose – If True, logs progress information (default: True)

Returns:

bool, message: str)

Return type:

Tuple of (success

eplace_lib.cli

ePLACE: environmental Phylogenetic Localisation and Clade Estimation

Main command-line interface for ePLACE toolkit. Provides unified access to database download, BLAST analysis, and grouped workflows.

eplace_lib.cli.blast_command(args)[source]: Handle the blast subcommand - individual workflow.

eplace_lib.cli.download_command(args)[source]: Handle the download subcommand.

eplace_lib.cli.grouped_command(args)[source]: Handle the grouped subcommand - grouped workflow.

eplace_lib.cli.main()[source]: Main entry point for the ePLACE CLI.

eplace_lib.cli.relabel_command(args)[source]: Handle the relabel subcommand - relabel tree with taxonomy.

Quick Examples

BLAST Analysis

from pathlib import Path
from eplace_lib import run_blast_search, process_blast_results_for_taxonomy

# Run BLAST search with filtering
success, filtered_hits = run_blast_search(
    query_fasta=Path("query.fasta"),
    output_file=Path("blast_results.txt"),
    min_identity=90.0,
    min_coverage=80.0
)

# Extract representative sequences
results = process_blast_results_for_taxonomy(
    blast_hits=filtered_hits,
    output_dir=Path("output"),
    rank="genus"
)

Database Download

from eplace_lib import setup_ncbi_database

# Download the core_nt database
success, message = setup_ncbi_database()
print(f"Success: {success}, Message: {message}")

FASTA Reading

from pathlib import Path
from eplace_lib.blast_analysis import FastaReader

# Read sequences
sequences = FastaReader.read_fasta(Path("input.fasta"))

# Get sequence lengths
lengths = FastaReader.get_sequence_lengths(Path("input.fasta"))

Sequence Alignment

from pathlib import Path
from eplace_lib.alignment import align_sequences, build_phylogenetic_tree

# Align sequences
success = align_sequences(
    input_fasta=Path("sequences.fasta"),
    output_fasta=Path("aligned.fasta"),
    num_threads=4
)

# Build tree
success = build_phylogenetic_tree(
    alignment_fasta=Path("aligned.fasta"),
    output_prefix=Path("tree"),
    num_threads=4
)

Data Structures

BlastHit

Represents a single BLAST hit with the following attributes:

query_id: Query sequence identifier
subject_id: Subject (database) sequence identifier
percent_identity: Percentage of identical matches
alignment_length: Length of alignment
query_length: Length of query sequence
subject_length: Length of subject sequence
query_start: Start position in query
query_end: End position in query
subject_start: Start position in subject
subject_end: End position in subject
evalue: Expectation value
bit_score: Bit score
query_coverage: Percentage of query covered by alignment
subject_taxonomy: Dictionary containing taxonomic information (phylum, class, order, family, genus, species)

Example usage:

from eplace_lib.blast_analysis import BlastHit

# Create a BlastHit
hit = BlastHit(
    query_id="query1",
    subject_id="NC_001234.5",
    percent_identity=95.5,
    alignment_length=500,
    query_length=550,
    subject_length=5000,
    query_start=1,
    query_end=500,
    subject_start=100,
    subject_end=599,
    evalue=1e-100,
    bit_score=900,
    query_coverage=90.9,
    subject_taxonomy={"genus": "Escherichia", "species": "coli"}
)

Common Workflows

Complete BLAST to Tree Workflow

from pathlib import Path
from eplace_lib import (
    run_blast_search,
    process_blast_results_for_taxonomy,
)
from eplace_lib.sequences import trim_sequences_to_blast_coordinates
from eplace_lib.alignment import align_sequences, build_phylogenetic_tree

# Step 1: BLAST search
success, filtered_hits = run_blast_search(
    query_fasta=Path("query.fasta"),
    output_file=Path("blast_results.txt"),
    min_identity=90.0,
    min_coverage=80.0,
    num_threads=4
)

# Step 2: Extract representatives
results = process_blast_results_for_taxonomy(
    blast_hits=filtered_hits,
    output_dir=Path("output"),
    rank="genus"
)

# Step 3: Process each query
for query_id, fasta_path in results.items():
    # Trim sequences
    trimmed_path = fasta_path.parent / f"{query_id}_trimmed.fasta"
    trim_sequences_to_blast_coordinates(
        input_fasta=fasta_path,
        output_fasta=trimmed_path,
        blast_hits=filtered_hits
    )

    # Align sequences
    aligned_path = fasta_path.parent / f"{query_id}_aligned.fasta"
    align_sequences(
        input_fasta=trimmed_path,
        output_fasta=aligned_path,
        num_threads=4
    )

    # Build tree
    tree_prefix = fasta_path.parent / f"{query_id}_tree"
    build_phylogenetic_tree(
        alignment_fasta=aligned_path,
        output_prefix=tree_prefix,
        num_threads=4
    )

Custom BLAST Parameters

from pathlib import Path
from eplace_lib.blast_analysis import BlastRunner

runner = BlastRunner()

# Run BLAST with custom parameters
success = runner.run_blastn(
    query_fasta=Path("query.fasta"),
    output_file=Path("blast_results.txt"),
    database="core_nt",
    num_threads=8,
    max_target_seqs=500,
    evalue=1e-10,
    word_size=11
)

# Parse and filter results
hits = runner.parse_blast_results(Path("blast_results.txt"))
filtered_hits = runner.filter_blast_hits(
    hits,
    min_identity=95.0,
    min_coverage=90.0
)

Working with Taxonomic Data

from eplace_lib.taxonomy import TaxonomyExtractor

extractor = TaxonomyExtractor()

# Group hits by query
grouped_hits = extractor.group_hits_by_query(blast_hits)

# Select representatives at different ranks
for query_id, query_hits in grouped_hits.items():
    # At genus level
    genus_reps = extractor.select_representatives_by_rank(
        hits=query_hits,
        rank="genus",
        max_per_rank=1
    )

    # At species level
    species_reps = extractor.select_representatives_by_rank(
        hits=query_hits,
        rank="species",
        max_per_rank=2
    )

Error Handling

Most functions return success indicators and provide error messages:

from pathlib import Path
from eplace_lib import run_blast_search

success, result = run_blast_search(
    query_fasta=Path("query.fasta"),
    output_file=Path("output.txt"),
    min_identity=90.0,
    min_coverage=80.0
)

if not success:
    print(f"BLAST failed: {result}")
else:
    print(f"Found {len(result)} hits")

For functions that don’t return tuples, check return values:

from pathlib import Path
from eplace_lib.alignment import align_sequences

success = align_sequences(
    input_fasta=Path("sequences.fasta"),
    output_fasta=Path("aligned.fasta")
)

if not success:
    print("Alignment failed")

Type Hints

ePLACE uses type hints throughout the codebase for better IDE support:

from pathlib import Path
from typing import List, Dict, Tuple
from eplace_lib.blast_analysis import BlastHit

def process_hits(
    hits: List[BlastHit],
    min_identity: float = 90.0
) -> Tuple[bool, List[BlastHit]]:
    """Process BLAST hits with type hints."""
    filtered = [h for h in hits if h.percent_identity >= min_identity]
    return True, filtered

Logging

ePLACE uses Python’s logging module. Configure logging in your scripts:

import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Now run ePLACE functions
from eplace_lib import run_blast_search

Advanced Usage

Custom Database Management

from eplace_lib.ncbi_download import NCBIDownloader

downloader = NCBIDownloader()

# Get database directory
db_dir = downloader.get_blastdb_directory()

# Check if database exists
exists = downloader.check_database_exists()

# Get available files
files = downloader.get_available_files()

# Download specific file
downloader.download_file('core_nt.00.tar.gz', db_dir)

Sequence Extraction

from pathlib import Path
from eplace_lib.taxonomy import SequenceExtractor

extractor = SequenceExtractor()

# Extract specific sequences
success = extractor.extract_sequences(
    sequence_ids=["NC_001234.5", "NC_005678.9"],
    output_fasta=Path("extracted.fasta"),
    database="core_nt"
)

API Reference

Core Modules

eplace_lib.blast_analysis

eplace_lib.taxonomy

eplace_lib.sequences

eplace_lib.alignment

eplace_lib.ncbi_download

eplace_lib.cli

Quick Examples

BLAST Analysis

Database Download

FASTA Reading

Sequence Alignment

Data Structures

BlastHit

Common Workflows

Complete BLAST to Tree Workflow

Custom BLAST Parameters

Working with Taxonomic Data

Error Handling

Type Hints

Logging

Advanced Usage

Custom Database Management

Sequence Extraction

See Also