API Reference
This page documents the ePLACE Python API for programmatic access.
Core Modules
eplace_lib.blast_analysis
BLAST analysis module for sequence comparison.
This module provides functionality for running BLAST searches and filtering results based on sequence identity and coverage criteria.
- class eplace_lib.blast_analysis.BlastHit(query_id: str, subject_id: str, percent_identity: float, alignment_length: int, query_length: int, subject_length: int, query_start: int, query_end: int, subject_start: int, subject_end: int, evalue: float, bit_score: float, query_coverage: float, subject_taxid: str, subject_taxids: str, subject_taxonomy: Dict[str, Tuple[str, str]] | None = None)[source]
Bases:
objectRepresents a single BLAST hit result.
- subject_taxonomy
The subjects taxonomy information. A dictionary with rank as key and a tuple of (taxid, name) as value.
- get_accession() str[source]
Extract the accession number from the subject_id.
BLAST IDs can be in various formats: - gi|2273658778|gb|MZ387488.1| -> MZ387488.1 - ref|NZ_CP123456.1| -> NZ_CP123456.1 - gb|MZ387488.1| -> MZ387488.1 - MZ387488.1 -> MZ387488.1 (already in accession format)
Note: gnl|database|identifier format is handled by returning the identifier, but these may not be standard accessions.
- Returns:
The accession number extracted from subject_id, or the full subject_id if no standard format is detected
- get_subject_taxonomy(rank: str) tuple[str, str] | None[source]
Return the taxonomy information as a tuple of (taxid, name) for the given rank. If the rank is not found, return None.
- Parameters:
rank – The rank to return the taxonomy information for.
- Returns:
tuple of (taxid, name) for the given rank, or None if the rank is not found.
- class eplace_lib.blast_analysis.BlastRunner(blastdb_path: Path | None = None)[source]
Bases:
objectClass for running BLAST searches and parsing results.
- __init__(blastdb_path: Path | None = None)[source]
Initialize the BlastRunner.
- Parameters:
blastdb_path – Path to BLAST database directory. If None, uses BLASTDB env var.
- check_blastn_available() bool[source]
Check if blastn is available in the system.
- Returns:
True if blastn is available, False otherwise
- filter_blast_hits(hits: list[BlastHit], min_identity: float = 90.0, min_coverage: float = 80.0, min_alignment_length: int | None = None) list[BlastHit][source]
Filter BLAST hits based on identity and coverage thresholds.
- Parameters:
hits – list of BlastHit objects
min_identity – Minimum percent identity (default: 90.0)
min_coverage – Minimum query coverage percentage (default: 80.0)
min_alignment_length – Minimum alignment length (optional)
- Returns:
Filtered list of BlastHit objects
- parse_blast_results(blast_output: Path, query_lengths: dict[str, int] | None = None) list[BlastHit][source]
Parse BLAST tabular output.
- Parameters:
blast_output – Path to BLAST output file (tabular format)
query_lengths – dictionary of query sequence lengths. If None, uses qlen from results.
- Returns:
list of BlastHit objects
- Raises:
FileNotFoundError – If BLAST output file doesn’t exist
ValueError – If BLAST output is malformed
- run_blastn(query_fasta: Path, output_file: Path, database: str = 'core_nt', num_threads: int = 1, max_target_seqs: int = 100, evalue: float = 1e-05, outfmt: str = '6 qseqid sseqid pident length qlen slen qstart qend sstart send evalue bitscore staxid staxids') bool[source]
Run blastn search.
- Parameters:
query_fasta – Path to query FASTA file
output_file – Path to output file
database – Name of BLAST database (default: “core_nt”)
num_threads – Number of threads to use
max_target_seqs – Maximum number of target sequences to report
evalue – E-value threshold
outfmt – Output format string
- Returns:
True if BLAST ran successfully, False otherwise
- Raises:
FileNotFoundError – If query file doesn’t exist
RuntimeError – If blastn is not available
- class eplace_lib.blast_analysis.FastaReader[source]
Bases:
objectClass for reading FASTA files.
- static get_sequence_lengths(fasta_path: Path) dict[str, int][source]
Get the length of each sequence in a FASTA file.
- Parameters:
fasta_path – Path to the FASTA file
- Returns:
dictionary mapping sequence IDs to their lengths
- static read_fasta(fasta_path: Path) dict[str, str][source]
Read sequences from a FASTA file.
- Parameters:
fasta_path – Path to the FASTA file
- Returns:
dictionary mapping sequence IDs to sequences
- Raises:
FileNotFoundError – If FASTA file doesn’t exist
ValueError – If FASTA file is malformed
- class eplace_lib.blast_analysis.MMseqs2Runner(db_path: Path | None = None)[source]
Bases:
objectClass for running MMseqs2 searches and parsing results.
MMseqs2 (Many-against-Many sequence searching) is an alternative to BLAST for sequence similarity searching, offering improved speed and sensitivity. Results are parsed into BlastHit objects for compatibility with the rest of the ePLACE pipeline.
The target database can be either a pre-built MMseqs2 database (created with
mmseqs createdb) or a FASTA file that MMseqs2 indexes automatically. Taxonomy fields (taxid) are populated only when the database was built with taxonomy information (mmseqs createtaxdb); otherwise they default to “0”.Database selection: To keep results comparable with the BLAST workflow (which uses NCBI
core_nt), the recommended MMseqs2 database should be built from the same underlying sequence collection ascore_nt. This means creating an MMseqs2 database from the FASTA sequences that make up NCBIcore_nt(e.g. by exporting them withblastdbcmd -db core_nt -entry alland then runningmmseqs createdb). Using a different nucleotide collection will change the search space and may produce classification differences that reflect the database rather than the search algorithm. There is no official pre-built MMseqs2core_ntdatabase; users must provide their own.- __init__(db_path: Path | None = None)[source]
Initialize the MMseqs2Runner.
- Parameters:
db_path – Path to the MMseqs2 database directory. If None the
MMSEQS_DB_DIRenvironment variable is used; if unset,MMSEQS2DBis used as a legacy fallback; if both are unset, the directory~/mmseqs2dbis used.
- check_mmseqs_available() bool[source]
Check if mmseqs is available in the system PATH.
- Returns:
True if mmseqs is available, False otherwise
- filter_hits(hits: list[BlastHit], min_identity: float = 90.0, min_coverage: float = 80.0, min_alignment_length: int | None = None) list[BlastHit][source]
Filter MMseqs2 hits based on identity and coverage thresholds.
- Parameters:
hits – list of BlastHit objects
min_identity – Minimum percent identity (default: 90.0)
min_coverage – Minimum query coverage percentage (default: 80.0)
min_alignment_length – Minimum alignment length (optional)
- Returns:
Filtered list of BlastHit objects
- parse_mmseqs_results(mmseqs_output: Path, query_lengths: dict[str, int] | None = None) list[BlastHit][source]
Parse MMseqs2 tabular output into BlastHit objects.
Expects output generated with
--format-outputset to:query,target,pident,alnlen,qlen,tlen,qstart,qend,tstart,tend,evalue,bits,taxid,taxlineageThe
taxidandtaxlineagecolumns are optional; if absent or set to “N/A” / “0”,subject_taxidwill be stored as “0”.- Parameters:
mmseqs_output – Path to MMseqs2 output file
query_lengths – Unused; kept for API compatibility with BlastRunner.parse_blast_results.
- Returns:
list of BlastHit objects
- Raises:
FileNotFoundError – If the output file doesn’t exist
ValueError – If the output is malformed
- run_easy_search(query_fasta: Path, output_file: Path, database: str = 'core_nt', num_threads: int = 1, max_target_seqs: int = 100, evalue: float = 1e-05, sensitivity: float = 5.7, tmp_dir: Path | None = None, search_type: int = 3, split_memory_limit: str | None = None, timeout: int = 3600) bool[source]
Run an MMseqs2 easy-search.
The output is written in a tab-separated format with the following columns (in order): query, target, pident, alnlen, qlen, tlen, qstart, qend, tstart, tend, evalue, bits, taxid, taxlineage
- Parameters:
query_fasta – Path to query FASTA file
output_file – Path to output file
database – Name of the MMseqs2 database inside
db_path(default: “core_nt”). There is no official pre-built MMseqs2core_ntdatabase; users must build their own from the same sequence collection as BLASTcore_ntto keep results comparable across backends.num_threads – Number of threads to use
max_target_seqs – Maximum number of target sequences to report
evalue – E-value threshold
sensitivity – MMseqs2 sensitivity (1–7.5, default: 5.7)
tmp_dir – Temporary directory for MMseqs2 intermediate files. Defaults to a
mmseqs_tmpsubdirectory next tooutput_file.search_type – MMseqs2 search type passed as
--search-typetoeasy-search. Commonly used values: 2 (translated), 3 (nucleotide), 4 (translated nucleotide backtrace). Default is 3 (nucleotide). See MMseqs2 documentation for all valid values.split_memory_limit – Maximum RAM for the MMseqs2 prefilter/index step, passed as
--split-memory-limittoeasy-search(e.g."400G"). WhenNonethe flag is omitted and MMseqs2 uses its own default.timeout – Maximum runtime for the MMseqs2 search in seconds (default: 3600).
- Returns:
True if MMseqs2 ran successfully, False otherwise
- Raises:
FileNotFoundError – If query file doesn’t exist
RuntimeError – If mmseqs is not available
- eplace_lib.blast_analysis.normalize_sequence_id(seq_id: str) str[source]
Normalize an arbitrary sequence or tree label to a canonical accession-like identifier.
This is used to compare IDs from different sources (BLAST subject IDs, FASTA headers, tree leaf labels) that may be formatted differently but refer to the same sequence.
Normalization steps: 1. Strip a leading ‘>’ (FASTA header prefix). 2. Take only the first whitespace-delimited token. 3. Remove MAFFT reverse-complement markers: a leading ‘_R_’ prefix or a trailing ‘_R_’ suffix. 4. If the token contains pipes (‘|’), extract the accession via _extract_accession_from_pipe_id()
(gi|…|gb|ACC|, ref|ACC|, gb|ACC|, etc.).
Otherwise return the token unchanged.
- Parameters:
seq_id – Raw sequence identifier from any source.
- Returns:
Canonical accession string suitable for exact comparison.
- eplace_lib.blast_analysis.run_blast_search(query_fasta: Path, output_file: Path, min_identity: float = 90.0, min_coverage: float = 80.0, database: str = 'core_nt', blastdb_path: Path | None = None, num_threads: int = 1, skip_existing: bool = True) tuple[bool, list[BlastHit]][source]
Convenience function to run BLAST search and return filtered hits.
- Parameters:
query_fasta – Path to query FASTA file
output_file – Path to output file
min_identity – Minimum percent identity (default: 90.0)
min_coverage – Minimum query coverage percentage (default: 80.0)
database – Name of BLAST database (default: “core_nt”)
blastdb_path – Path to BLAST database directory
num_threads – Number of threads to use
skip_existing – Skip search if output file already exists (default: True)
- Returns:
bool, filtered_hits: list[BlastHit])
- Return type:
Tuple of (success
- eplace_lib.blast_analysis.run_mmseqs_search(query_fasta: Path, output_file: Path, min_identity: float = 90.0, min_coverage: float = 80.0, database: str = 'core_nt', db_path: Path | None = None, num_threads: int = 1, sensitivity: float = 5.7, skip_existing: bool = True, search_type: int = 3, memory_limit: str | None = None, timeout: int = 3600) tuple[bool, list[BlastHit]][source]
Convenience function to run an MMseqs2 search and return filtered hits.
To keep results comparable with the BLAST workflow (which searches NCBI
core_nt), the MMseqs2 database should be built from the same underlying sequence collection ascore_nt. There is no official pre-built MMseqs2core_ntdatabase; users must create one from the relevant FASTA sequences (e.g. exported from BLASTcore_ntwithblastdbcmd). Using a different nucleotide collection changes the search space and may produce classification differences unrelated to the choice of search engine.- Parameters:
query_fasta – Path to query FASTA file
output_file – Path to output file
min_identity – Minimum percent identity (default: 90.0)
min_coverage – Minimum query coverage percentage (default: 80.0)
database – Name of MMseqs2 database inside
db_path(default: “core_nt”)db_path – Path to the MMseqs2 database directory
num_threads – Number of threads to use
sensitivity – MMseqs2 sensitivity (1–7.5, default: 5.7)
skip_existing – Skip search if output file already exists (default: True)
search_type – MMseqs2 search type passed as
--search-typetoeasy-search. Commonly used values: 2 (translated), 3 (nucleotide), 4 (translated nucleotide backtrace). Default is 3 (nucleotide). See MMseqs2 documentation for all valid values.memory_limit – Maximum RAM for the MMseqs2 prefilter/index step, passed as
--split-memory-limittoeasy-search(e.g."400G"). WhenNonethe flag is omitted.timeout – Maximum runtime for the MMseqs2 search in seconds (default: 3600).
- Returns:
bool, filtered_hits: list[BlastHit])
- Return type:
Tuple of (success
- Raises:
ValueError – If sensitivity is outside the valid range (1–7.5)
- eplace_lib.blast_analysis.validate_mmseqs_memory_limit(value: str) str[source]
Validate a MMseqs2-style memory limit string.
Accepts a positive integer (no leading zeros) followed by a single unit suffix
K,M,G, orT(case-sensitive, no space). Examples of valid values:64G 128G 400G 1T 512M
- Parameters:
value – The memory limit string to validate.
- Returns:
The validated string unchanged.
- Raises:
ValueError – If the string is empty, missing units, has an invalid unit suffix, or is otherwise malformed.
eplace_lib.taxonomy
Taxonomy extraction and sequence retrieval module.
This module provides functionality for extracting taxonomic information from BLAST results, selecting representative sequences per taxonomic rank, and extracting sequences from databases.
- class eplace_lib.taxonomy.SequenceExtractor(blastdb_path: Path | None = None)[source]
Bases:
objectClass for extracting sequences from BLAST databases.
- __init__(blastdb_path: Path | None = None)[source]
Initialize the SequenceExtractor.
- Parameters:
blastdb_path – Path to BLAST database directory. If None, uses BLASTDB env var.
- check_blastdbcmd_available() bool[source]
Check if blastdbcmd is available in the system.
- Returns:
True if blastdbcmd is available, False otherwise
- extract_representatives_for_query(query_id: str, representative_hits: list[BlastHit], output_dir: Path, database: str = 'core_nt') Path | None[source]
Extract representative sequences for a single query to a FASTA file.
- Parameters:
query_id – Query sequence identifier
representative_hits – list of representative BlastHit objects
output_dir – Output directory for FASTA files
database – Name of BLAST database
- Returns:
Path to output FASTA file if successful, None otherwise
- extract_sequences(sequence_ids: list[str], output_fasta: Path, database: str = 'core_nt') bool[source]
Extract sequences from BLAST database using blastdbcmd.
- Parameters:
sequence_ids – list of sequence IDs to extract
output_fasta – Path to output FASTA file
database – Name of BLAST database (default: “core_nt”)
- Returns:
True if extraction was successful, False otherwise
- Raises:
RuntimeError – If blastdbcmd is not available
- class eplace_lib.taxonomy.TaxonomyExtractor[source]
Bases:
objectClass for extracting taxonomic information from sequence IDs.
- group_hits_by_query(hits: list[BlastHit]) dict[str, list[BlastHit]][source]
Group BLAST hits by query sequence.
- Parameters:
hits – list of BlastHit objects
- Returns:
dictionary mapping query IDs to lists of hits
- parse_taxids(tax_ids: list[str]) dict[str, dict[str, tuple[str, str]]][source]
Parse taxonomic information from the taxonomy IDs from the BLAST hits
- Parameters:
tax_ids – the taxonomy IDs reported by BLAST
- Returns:
dictionary containing the rank and a tuple of the taxonomy ID and the name
- select_representatives_by_rank(hits: list[BlastHit], rank: str, max_per_rank: int = 1, preferred_representatives: Dict[str, str] | None = None) list[BlastHit][source]
Select representative sequences per taxonomic rank.
- Parameters:
hits – list of BlastHit objects for a single query
rank – Taxonomic rank for representative selection
max_per_rank – Maximum number of representatives per rank (default: 1)
preferred_representatives – Optional dictionary mapping rank_tid to preferred subject_id to ensure consistent representatives across queries
- Returns:
list of representative BlastHit objects
- eplace_lib.taxonomy.generate_classification_summary(sequences: dict[str, str], blast_hits: List[BlastHit], output_file: Path, rank: str = 'genus', group_rank: str = 'class', tree_label_rank: str = 'genus', tree_files: dict[str, Path] | None = None) bool[source]
Generate a classification summary TSV file for each query sequence.
This function creates a TSV file that reports: - Query sequence ID - Closest organism at the classification rank (–rank) - Closest organism at the grouping rank (–group-rank) - Closest organism at the tree labeling rank (–tree-label-rank) - Whether the sequence appears in multiple groups - Whether the sequence has no appropriate classification
The classification is based on the phylogenetically nearest neighbor in the tree (if available), otherwise falls back to the best BLAST hit by bit score.
- Parameters:
sequences – dictionary of sequences that we read from the fasta file
blast_hits – List of BlastHit objects with taxonomy information
output_file – Path to output TSV file
rank – Taxonomic rank for classification (default: genus)
group_rank – Taxonomic rank for grouping (default: class)
tree_label_rank – Taxonomic rank for tree labeling (default: genus)
tree_files – Optional dict mapping query_id to tree file paths for finding nearest neighbors
- Returns:
True if successful, False otherwise
- eplace_lib.taxonomy.process_blast_results_for_taxonomy(blast_hits: List[BlastHit], output_dir: Path, rank: str = 'genus', database: str = 'core_nt', blastdb_path: Path | None = None) Dict[str, Path | None][source]
Process BLAST hits to extract representative sequences per taxonomic rank.
- Parameters:
blast_hits – list of BlastHit objects
output_dir – Output directory for FASTA files
rank – Taxonomic rank for representative selection
database – Name of BLAST database
blastdb_path – Path to BLAST database directory
- Returns:
dictionary mapping query IDs to output FASTA file paths
- eplace_lib.taxonomy.rewrite_blast_hits(blast_hits: List[BlastHit], output_file: Path, header: bool = True) bool[source]
Rewrite the blast hits when we have annotated them
- Parameters:
blast_hits – list of BlastHit objects
output_file – the file to write to
header – whether to include a header line in the file
- Returns:
True on success
- eplace_lib.taxonomy.sort_strings_and_numbers(s: str)[source]
Extract text and numbers from strings for proper sorting.
- Parameters:
s – string to extract the number from
- Returns:
A tuple
(text_part, num_part)that can be used as a sort key. For strings matching the pattern<non-digits><digits>, this is the non-digit prefix and the trailing integer. For non-matching strings, returns(s, 0).- Return type:
Returns
eplace_lib.sequences
Sequence analysis module for eDNA data.
This module provides basic functionality for analyzing environmental DNA sequences.
- class eplace_lib.sequences.SequenceAnalyzer[source]
Bases:
objectA class for analyzing eDNA sequences.
This class provides methods for basic sequence analysis operations commonly used in environmental DNA studies.
- calculate_gc_content(sequence: str) float[source]
Calculate the GC content of a DNA sequence.
- Parameters:
sequence – DNA sequence string
- Returns:
GC content as a percentage (0-100)
eplace_lib.alignment
Sequence alignment and phylogenetic tree building module.
This module provides functionality for trimming sequences based on BLAST alignments, aligning sequences using MAFFT, and building phylogenetic trees using IQTree.
- class eplace_lib.alignment.IQTreeBuilder[source]
Bases:
objectClass for building phylogenetic trees using IQTree.
- static build_tree(alignment_fasta: Path, output_prefix: Path, model: str = 'MFP', num_threads: int = None) bool[source]
Build a phylogenetic tree using IQTree.
- Parameters:
alignment_fasta – Path to aligned FASTA file
output_prefix – Prefix for output files
model – Substitution model (default: “MFP” for automatic ModelFinder Plus selection)
num_threads – Number of threads to use (default: None, which uses AUTO)
- Returns:
True if tree building was successful, False otherwise
- static build_tree_background(alignment_fasta: Path, output_prefix: Path, model: str = 'MFP') Dict | None[source]
Start building a phylogenetic tree using IQTree in the background.
This method starts IQTree as a background process and returns immediately, allowing multiple trees to be built in parallel.
- Parameters:
alignment_fasta – Path to aligned FASTA file
output_prefix – Prefix for output files
model – Substitution model (default: “MFP” for automatic ModelFinder Plus selection)
- Returns:
‘process’: subprocess.Popen object
’output_prefix’: output prefix path
’alignment_fasta’: input alignment file path
’tree_file’: expected tree file path
- Return type:
Dictionary with process information if successful, None otherwise
- static check_iqtree_available() Tuple[bool, str | None][source]
Check if IQTree is available in the system.
- Returns:
bool, command: str or None)
- Return type:
Tuple of (available
- static relabel_tree_with_taxonomy(tree_file: Path, blast_hits: List[BlastHit], output_tree: Path, taxonomic_rank: str) bool[source]
Relabel tree nodes with taxonomic names.
This reads a Newick tree file and replaces sequence IDs with taxonomic names from the BLAST hits.
- Parameters:
tree_file – Path to input tree file (Newick format)
blast_hits – List of BlastHit objects with taxonomic information
output_tree – Path to output tree file with relabeled nodes
taxonomic_rank – the taxonomic rank to use for relabeling (e.g., “genus”)
- Returns:
True if successful, False otherwise
- static wait_for_tree_jobs(jobs: List[Dict], timeout: int = 14400) Dict[str, bool][source]
Wait for multiple IQTree jobs to complete.
This method polls all running processes and waits for them to complete. Since the processes are already running in parallel (started with Popen), this method just collects their results as they finish.
- Parameters:
jobs – List of job dictionaries returned by build_tree_background()
timeout – Maximum time to wait for each individual job in seconds (default: 14400 = 4 hours) Increased because of mega tree created at the end!
- Returns:
Dictionary mapping tree_file path to success status (True/False)
- class eplace_lib.alignment.MAFFTAligner[source]
Bases:
objectClass for running MAFFT sequence alignments.
- static align_sequences(input_fasta: Path, output_fasta: Path, auto_orient: bool = True, num_threads: int = 1, strategy: str = 'default') bool[source]
Align sequences using MAFFT.
- Parameters:
input_fasta – Path to input FASTA file with sequences to align
output_fasta – Path to output aligned FASTA file
auto_orient – Use MAFFT’s auto-orient feature (default: True)
num_threads – Number of threads to use
strategy – MAFFT alignment strategy (default: ‘default’) Options: ‘default’, ‘auto’, ‘retree2’, ‘fftns’ ‘auto’: Let MAFFT choose the best strategy automatically ‘retree2’: Fast progressive method, good for large datasets ‘fftns’: Fastest method for very large datasets
- Returns:
True if alignment was successful, False otherwise
- class eplace_lib.alignment.SequenceTrimmer[source]
Bases:
objectClass for trimming sequences based on BLAST alignment coordinates.
- static trim_sequence_by_coordinates(sequence: str, start: int, end: int) str[source]
Trim a sequence to extract the region between start and end coordinates.
BLAST coordinates are 1-indexed, so we need to adjust for Python’s 0-indexing.
- Parameters:
sequence – The full sequence string
start – Start position (1-indexed, inclusive)
end – End position (1-indexed, inclusive)
- Returns:
Trimmed sequence string
- static trim_sequences_from_blast_hits(fasta_path: Path, blast_hits: List[BlastHit], output_fasta: Path, query_id: str, taxonomic_rank: str) bool[source]
Trim sequences in a FASTA file based on BLAST hit coordinates.
This reads the representative sequences, trims them to the aligned regions, and writes them to a new FASTA file along with the query sequence.
- Parameters:
fasta_path – Path to input FASTA file with full-length sequences
blast_hits – List of BlastHit objects for this query
output_fasta – Path to output FASTA file with trimmed sequences
query_id – The query sequence ID to include in output
taxonomic_rank – the taxonomic rank to use for taxonomic labels (e.g., “genus”)
- Returns:
True if successful, False otherwise
- class eplace_lib.alignment.SimpleNewickNode(name: str = '', distance: float = 0.0)[source]
Bases:
objectSimple Newick tree node representation for finding nearest neighbors.
- get_leaves() List[SimpleNewickNode][source]
Get all leaf nodes under this node.
- eplace_lib.alignment.check_alignment_consistency(blast_hits: List[BlastHit], tolerance: int = 50) Dict[str, bool][source]
Check if BLAST hits align to similar locations on reference sequences.
For each reference sequence that appears in multiple hits, check if the alignment coordinates are consistent (within tolerance).
- Parameters:
blast_hits – List of BlastHit objects to check
tolerance – Maximum allowed difference in coordinates (default: 50 bp)
- Returns:
Dictionary mapping subject_id to consistency status (True if consistent)
- eplace_lib.alignment.concatenate_all_groups_and_build_tree(output_dir: Path, query_fasta: Path, classification_file: Path, blast_hits: List[BlastHit], combined_tree_label_rank: str = 'genus', num_threads: int = 1, alignment_strategy: str = 'auto') Dict[str, Path | None][source]
Concatenate all group _trimmed.fasta files, add queries with 0 blast hits, build a final alignment and tree.
This function: 1. Finds all *_trimmed.fasta files in group directories 2. Reads the classification file to identify queries with 0 blast hits 3. Concatenates all sequences into a single file 4. Uses MAFFT to build an alignment (with optimal parameters for many sequences) 5. Uses IQTree to build a phylogenetic tree 6. Relabels tree nodes with taxonomic names
- Parameters:
output_dir – Output directory containing group subdirectories
query_fasta – Original query FASTA file
classification_file – Path to classifications.tsv file
blast_hits – List of all BlastHit objects with taxonomy information
combined_tree_label_rank – Taxonomic rank for tree labeling (default: genus)
num_threads – Number of threads for alignment and tree building (default: 1)
alignment_strategy – MAFFT alignment strategy (default: ‘auto’) Options: ‘default’, ‘auto’, ‘retree2’, ‘fftns’
- Returns:
‘combined_fasta’: Combined sequences from all groups + zero-hit queries
’alignment’: Aligned sequences
’tree’: Phylogenetic tree
’labeled_tree’: Tree with taxonomic labels
- Return type:
Dictionary with paths to generated files
- eplace_lib.alignment.create_grouped_fasta_with_queries(group_tid: str, group_name: str, query_hits_map: Dict[str, List[BlastHit]], labeling_rank: str, query_fasta: Path, output_fasta: Path, database: str = 'core_nt', blastdb_path: Path | None = None) bool[source]
Create a FASTA file for a taxonomic group containing all queries and unique references.
- Parameters:
group_tid – Taxonomy ID of the group
group_name – Name of the taxonomic group
query_hits_map – Dictionary mapping query_id to list of BlastHit objects
labeling_rank – Taxonomic rank to use for labeling (e.g., “genus”)
query_fasta – Path to original query FASTA file
output_fasta – Path to output grouped FASTA file
database – Name of BLAST database
blastdb_path – Path to BLAST database directory
- Returns:
True if successful, False otherwise
- eplace_lib.alignment.find_nearest_neighbor_in_tree(tree_file: Path, query_id: str) str | None[source]
Find the nearest neighbor (closest leaf) to a query sequence in a phylogenetic tree.
This function parses the Newick tree and finds the leaf node that is phylogenetically closest to the query sequence based on tree topology and branch lengths.
- Parameters:
tree_file – Path to the Newick tree file (.treefile)
query_id – Query sequence identifier to find neighbors for
- Returns:
Name of the nearest neighbor leaf node, or None if not found or error
- eplace_lib.alignment.group_hits_by_group_rank(blast_hits: List[BlastHit], group_rank: str) Dict[str, Dict[str, List[BlastHit]]][source]
Group BLAST hits by group_rank across all queries.
- Parameters:
blast_hits – List of BlastHit objects with group taxonomy information
- Returns:
Dictionary mapping group_rank_name (taxonomy name) to another dict mapping query_id to list of hits. Format: {group_rank_name: {query_id: [hits]}}
- eplace_lib.alignment.parse_simple_newick(newick_str: str) SimpleNewickNode | None[source]
Parse a simple Newick tree string into a tree structure.
This is a lightweight parser that handles basic Newick format with branch lengths. Format: ((A:0.1,B:0.2):0.3,C:0.4);
- Parameters:
newick_str – Newick format tree string
- Returns:
Root node of the parsed tree, or None if parsing fails
- eplace_lib.alignment.process_grouped_alignment_and_tree(group_name: str, group_dir: Path, taxonomic_rank: str, blast_hits: List[BlastHit], query_ids: List[str], num_threads: int = 1) Dict[str, Path | None][source]
Complete pipeline for a taxonomic group: trim, align, and build tree.
- Parameters:
group_name – The name of the group, used for file naming
group_dir – Directory containing group-specific files
taxonomic_rank – Taxonomic rank to use for labeling the tree
blast_hits – List of BlastHit objects for all queries in the group
query_ids – List of query sequence IDs in this group
num_threads – Number of threads to use
- Returns:
‘combined_fasta’: Combined sequences (queries + references)
’trimmed_fasta’: Trimmed sequences
’alignment’: Aligned sequences
’tree’: Phylogenetic tree
’labeled_tree’: Tree with taxonomic labels
- Return type:
Dictionary with paths to generated files
- eplace_lib.alignment.process_grouped_alignment_and_tree_parallel(group_name: str, group_dir: Path, taxonomic_rank: str, blast_hits: List[BlastHit], query_ids: List[str], num_threads: int = 1, background_tree: bool = False) Dict[str, Path | None][source]
Complete pipeline for a taxonomic group: trim, align, and optionally build tree in background.
This is similar to process_grouped_alignment_and_tree, but with an option to start tree building in the background and return immediately without waiting for completion.
- Parameters:
group_name – The name of the group, used for file naming
group_dir – Directory containing group-specific files
taxonomic_rank – Taxonomic rank to use for labeling the tree
blast_hits – List of BlastHit objects for all queries in the group
query_ids – List of query sequence IDs in this group
num_threads – Number of threads to use
background_tree – If True, start tree building in background and return immediately
- Returns:
‘combined_fasta’: Combined sequences (queries + references)
’trimmed_fasta’: Trimmed sequences
’alignment’: Aligned sequences
’tree_job’: Background job info if background_tree=True, None otherwise
’tree_file’: Expected tree file path
’blast_hits’: BLAST hits for later tree relabeling
’taxonomic_rank’: Taxonomic rank for later tree relabeling
- Return type:
Dictionary with paths to generated files
- eplace_lib.alignment.process_query_alignment_and_tree(query_id: str, query_dir: Path, blast_hits: List[BlastHit], query_fasta: Path, taxonomic_rank: str, num_threads: int = 1) Dict[str, Path | None][source]
Complete pipeline for a single query: trim, align, and build tree.
- Parameters:
query_id – Query sequence identifier
query_dir – Directory containing query-specific files
blast_hits – List of BlastHit objects for this query (with taxonomy info)
query_fasta – Path to original query FASTA file
taxonomic_rank – The taxonomic rank to use for relabeling the tree
num_threads – Number of threads to use
- Returns:
‘trimmed_fasta’: Trimmed sequences
’alignment’: Aligned sequences
’tree’: Phylogenetic tree
’labeled_tree’: Tree with taxonomic labels
- Return type:
Dictionary with paths to generated files
- eplace_lib.alignment.process_query_alignment_and_tree_parallel(query_id: str, query_dir: Path, blast_hits: List[BlastHit], query_fasta: Path, taxonomic_rank: str, num_threads: int = 1, background_tree: bool = False) Dict[str, Path | None][source]
Complete pipeline for a single query: trim, align, and optionally build tree in background.
This is similar to process_query_alignment_and_tree, but with an option to start tree building in the background and return immediately without waiting for completion.
- Parameters:
query_id – Query sequence identifier
query_dir – Directory containing query-specific files
blast_hits – List of BlastHit objects for this query (with taxonomy info)
query_fasta – Path to original query FASTA file
taxonomic_rank – The taxonomic rank to use for relabeling the tree
num_threads – Number of threads to use
background_tree – If True, start tree building in background and return immediately
- Returns:
‘trimmed_fasta’: Trimmed sequences
’alignment’: Aligned sequences
’tree_job’: Background job info if background_tree=True, None otherwise
’tree_file’: Expected tree file path
’blast_hits’: BLAST hits for later tree relabeling
’taxonomic_rank’: Taxonomic rank for later tree relabeling
- Return type:
Dictionary with paths to generated files
- eplace_lib.alignment.trim_grouped_sequences(input_fasta: Path, blast_hits: List[BlastHit], output_fasta: Path, query_ids: List[str]) bool[source]
Trim sequences in a grouped FASTA file based on BLAST hit coordinates.
This is similar to trim_sequences_from_blast_hits but handles multiple queries.
- Parameters:
input_fasta – Path to input FASTA file with full-length sequences
blast_hits – List of BlastHit objects for all queries in the group
output_fasta – Path to output FASTA file with trimmed sequences
query_ids – List of query sequence IDs to include (untrimmed)
- Returns:
True if successful, False otherwise
eplace_lib.ncbi_download
NCBI database download module.
This module provides functionality for downloading and managing NCBI BLAST databases, specifically the core nucleotide (nt) database.
- class eplace_lib.ncbi_download.MMseqsDownloader(db_dir: Path | None = None)[source]
Bases:
objectDownload and configure MMseqs2 NT databases and taxonomy sidecar files.
Directory resolution for MMseqs2 databases prefers
$MMSEQS_DB_DIR, then$MMSEQS2DB(legacy), then~/mmseqs2db.Workflow: 1. Download NT with
mmseqs databases. 2. Optionally fetch accession2taxid files and build mapping TSV. 3. Attach taxonomy sidecars withmmseqs createtaxdb.- ACC2TAXID_BASE = 'https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/'
- LEGACY_MMSEQS_DB_DIR_ENV = 'MMSEQS2DB'
- MMSEQS_DB_DIR_ENV = 'MMSEQS_DB_DIR'
- NUCLEOTIDE_DB_NAME = 'NT'
- add_taxonomy_to_database(mmseqs_db: Path, ncbi_taxonomy: Path, threads: int = 1, acc2taxid_dir: Path | None = None, taxonomy_workdir: Path | None = None) Tuple[bool, str][source]
Add NCBI taxonomy sidecar files to an MMseqs2 NT database.
- Parameters:
mmseqs_db – Path to MMseqs2 NT database base file (e.g.
.../NT).ncbi_taxonomy – Directory with NCBI taxonomy dump files.
threads – Number of threads for
mmseqs createtaxdb.acc2taxid_dir – Optional directory with accession2taxid files.
taxonomy_workdir – Optional working directory for mapping files.
- Returns:
Tuple
(success, message).
- Side effects:
Creates mapping files in
taxonomy_workdirand writes MMseqs2 taxonomy sidecar files adjacent tommseqs_db.
- class eplace_lib.ncbi_download.NCBIDownloader[source]
Bases:
objectA class for managing NCBI BLAST database downloads.
This class handles checking for existing databases, downloading from NCBI FTP, verifying checksums, and extracting database files.
- CORE_NT_PREFIX = 'core_nt'
- NCBI_FTP_BASE = 'https://ftp.ncbi.nlm.nih.gov/blast/db/'
- check_database_exists(db_dir: Path | None = None) bool[source]
Check if NCBI core_nt database files exist in the specified directory.
- Parameters:
db_dir – Directory to check. If None, uses the default BLASTDB directory.
- Returns:
True if at least one core_nt database file exists, False otherwise
- download_and_setup_database(force_download: bool = False, verbose: bool = True) Tuple[bool, str][source]
Main function to download and setup the NCBI core_nt database.
This function: 1. Determines the BLASTDB directory 2. Checks if database already exists (unless force_download is True) 3. Downloads all core_nt.* files from NCBI FTP 4. Verifies MD5 checksums 5. Extracts the database files
- Parameters:
force_download – If True, downloads even if database exists
verbose – If True, logs progress information (default: True)
- Returns:
bool, message: str)
- Return type:
Tuple of (success
- download_file(filename: str, dest_dir: Path, show_progress: bool = True) Path[source]
Download a file from NCBI FTP server.
- Parameters:
filename – Name of the file to download
dest_dir – Destination directory
show_progress – Whether to show download progress (not implemented yet)
- Returns:
Path to the downloaded file
- Raises:
URLError – If download fails
ValueError – If filename contains path traversal sequences
- extract_tarball(tarball_path: Path, dest_dir: Path) None[source]
Extract a tar.gz file to the specified directory.
- Parameters:
tarball_path – Path to the tar.gz file
dest_dir – Destination directory for extraction
- Raises:
tarfile.TarError – If extraction fails
ValueError – If tarball contains unsafe paths
- get_available_files() List[str][source]
Get list of available core_nt files from NCBI FTP server.
- Returns:
List of filenames matching core_nt pattern
- Raises:
URLError – If unable to connect to FTP server
- eplace_lib.ncbi_download.check_available_memory_gb(required_gb: float) Tuple[bool, float][source]
Check whether total system memory meets a required threshold in GiB.
- eplace_lib.ncbi_download.get_total_memory_gb() float[source]
Get total system memory in GiB.
On Linux this first reads
/proc/meminfo(MemTotal). If that is not available, it falls back to POSIXos.sysconf. Returns0.0when both strategies fail.
- eplace_lib.ncbi_download.setup_mmseqs_database(force_download: bool = False, threads: int = 1, db_dir: Path | None = None) Tuple[bool, str, Path | None][source]
Convenience function to download MMseqs2 NT database.
- eplace_lib.ncbi_download.setup_mmseqs_taxonomy(mmseqs_db: Path, ncbi_taxonomy: Path, threads: int = 1, acc2taxid_dir: Path | None = None, taxonomy_workdir: Path | None = None, db_dir: Path | None = None) Tuple[bool, str][source]
Convenience function to add taxonomy to an MMseqs2 database.
- eplace_lib.ncbi_download.setup_ncbi_database(force_download: bool = False, verbose: bool = True) Tuple[bool, str][source]
Convenience function to setup the NCBI core_nt database.
- Parameters:
force_download – If True, downloads even if database exists
verbose – If True, logs progress information (default: True)
- Returns:
bool, message: str)
- Return type:
Tuple of (success
eplace_lib.cli
ePLACE: environmental Phylogenetic Localisation and Clade Estimation
Main command-line interface for ePLACE toolkit. Provides unified access to database download, BLAST analysis, and grouped workflows.
Quick Examples
BLAST Analysis
from pathlib import Path
from eplace_lib import run_blast_search, process_blast_results_for_taxonomy
# Run BLAST search with filtering
success, filtered_hits = run_blast_search(
query_fasta=Path("query.fasta"),
output_file=Path("blast_results.txt"),
min_identity=90.0,
min_coverage=80.0
)
# Extract representative sequences
results = process_blast_results_for_taxonomy(
blast_hits=filtered_hits,
output_dir=Path("output"),
rank="genus"
)
Database Download
from eplace_lib import setup_ncbi_database
# Download the core_nt database
success, message = setup_ncbi_database()
print(f"Success: {success}, Message: {message}")
FASTA Reading
from pathlib import Path
from eplace_lib.blast_analysis import FastaReader
# Read sequences
sequences = FastaReader.read_fasta(Path("input.fasta"))
# Get sequence lengths
lengths = FastaReader.get_sequence_lengths(Path("input.fasta"))
Sequence Alignment
from pathlib import Path
from eplace_lib.alignment import align_sequences, build_phylogenetic_tree
# Align sequences
success = align_sequences(
input_fasta=Path("sequences.fasta"),
output_fasta=Path("aligned.fasta"),
num_threads=4
)
# Build tree
success = build_phylogenetic_tree(
alignment_fasta=Path("aligned.fasta"),
output_prefix=Path("tree"),
num_threads=4
)
Data Structures
BlastHit
Represents a single BLAST hit with the following attributes:
query_id: Query sequence identifiersubject_id: Subject (database) sequence identifierpercent_identity: Percentage of identical matchesalignment_length: Length of alignmentquery_length: Length of query sequencesubject_length: Length of subject sequencequery_start: Start position in queryquery_end: End position in querysubject_start: Start position in subjectsubject_end: End position in subjectevalue: Expectation valuebit_score: Bit scorequery_coverage: Percentage of query covered by alignmentsubject_taxonomy: Dictionary containing taxonomic information (phylum, class, order, family, genus, species)
Example usage:
from eplace_lib.blast_analysis import BlastHit
# Create a BlastHit
hit = BlastHit(
query_id="query1",
subject_id="NC_001234.5",
percent_identity=95.5,
alignment_length=500,
query_length=550,
subject_length=5000,
query_start=1,
query_end=500,
subject_start=100,
subject_end=599,
evalue=1e-100,
bit_score=900,
query_coverage=90.9,
subject_taxonomy={"genus": "Escherichia", "species": "coli"}
)
Common Workflows
Complete BLAST to Tree Workflow
from pathlib import Path
from eplace_lib import (
run_blast_search,
process_blast_results_for_taxonomy,
)
from eplace_lib.sequences import trim_sequences_to_blast_coordinates
from eplace_lib.alignment import align_sequences, build_phylogenetic_tree
# Step 1: BLAST search
success, filtered_hits = run_blast_search(
query_fasta=Path("query.fasta"),
output_file=Path("blast_results.txt"),
min_identity=90.0,
min_coverage=80.0,
num_threads=4
)
# Step 2: Extract representatives
results = process_blast_results_for_taxonomy(
blast_hits=filtered_hits,
output_dir=Path("output"),
rank="genus"
)
# Step 3: Process each query
for query_id, fasta_path in results.items():
# Trim sequences
trimmed_path = fasta_path.parent / f"{query_id}_trimmed.fasta"
trim_sequences_to_blast_coordinates(
input_fasta=fasta_path,
output_fasta=trimmed_path,
blast_hits=filtered_hits
)
# Align sequences
aligned_path = fasta_path.parent / f"{query_id}_aligned.fasta"
align_sequences(
input_fasta=trimmed_path,
output_fasta=aligned_path,
num_threads=4
)
# Build tree
tree_prefix = fasta_path.parent / f"{query_id}_tree"
build_phylogenetic_tree(
alignment_fasta=aligned_path,
output_prefix=tree_prefix,
num_threads=4
)
Custom BLAST Parameters
from pathlib import Path
from eplace_lib.blast_analysis import BlastRunner
runner = BlastRunner()
# Run BLAST with custom parameters
success = runner.run_blastn(
query_fasta=Path("query.fasta"),
output_file=Path("blast_results.txt"),
database="core_nt",
num_threads=8,
max_target_seqs=500,
evalue=1e-10,
word_size=11
)
# Parse and filter results
hits = runner.parse_blast_results(Path("blast_results.txt"))
filtered_hits = runner.filter_blast_hits(
hits,
min_identity=95.0,
min_coverage=90.0
)
Working with Taxonomic Data
from eplace_lib.taxonomy import TaxonomyExtractor
extractor = TaxonomyExtractor()
# Group hits by query
grouped_hits = extractor.group_hits_by_query(blast_hits)
# Select representatives at different ranks
for query_id, query_hits in grouped_hits.items():
# At genus level
genus_reps = extractor.select_representatives_by_rank(
hits=query_hits,
rank="genus",
max_per_rank=1
)
# At species level
species_reps = extractor.select_representatives_by_rank(
hits=query_hits,
rank="species",
max_per_rank=2
)
Error Handling
Most functions return success indicators and provide error messages:
from pathlib import Path
from eplace_lib import run_blast_search
success, result = run_blast_search(
query_fasta=Path("query.fasta"),
output_file=Path("output.txt"),
min_identity=90.0,
min_coverage=80.0
)
if not success:
print(f"BLAST failed: {result}")
else:
print(f"Found {len(result)} hits")
For functions that don’t return tuples, check return values:
from pathlib import Path
from eplace_lib.alignment import align_sequences
success = align_sequences(
input_fasta=Path("sequences.fasta"),
output_fasta=Path("aligned.fasta")
)
if not success:
print("Alignment failed")
Type Hints
ePLACE uses type hints throughout the codebase for better IDE support:
from pathlib import Path
from typing import List, Dict, Tuple
from eplace_lib.blast_analysis import BlastHit
def process_hits(
hits: List[BlastHit],
min_identity: float = 90.0
) -> Tuple[bool, List[BlastHit]]:
"""Process BLAST hits with type hints."""
filtered = [h for h in hits if h.percent_identity >= min_identity]
return True, filtered
Logging
ePLACE uses Python’s logging module. Configure logging in your scripts:
import logging
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# Now run ePLACE functions
from eplace_lib import run_blast_search
Advanced Usage
Custom Database Management
from eplace_lib.ncbi_download import NCBIDownloader
downloader = NCBIDownloader()
# Get database directory
db_dir = downloader.get_blastdb_directory()
# Check if database exists
exists = downloader.check_database_exists()
# Get available files
files = downloader.get_available_files()
# Download specific file
downloader.download_file('core_nt.00.tar.gz', db_dir)
Sequence Extraction
from pathlib import Path
from eplace_lib.taxonomy import SequenceExtractor
extractor = SequenceExtractor()
# Extract specific sequences
success = extractor.extract_sequences(
sequence_ids=["NC_001234.5", "NC_005678.9"],
output_fasta=Path("extracted.fasta"),
database="core_nt"
)
See Also
Quick Start Guide - Quick start guide with examples
Workflows - Workflow documentation
BLAST Sequence Comparison Module - Detailed BLAST workflow guide