BLAST Sequence Comparison Module
This module provides functionality for running BLAST searches, filtering results, and extracting representative sequences by taxonomic rank.
Features
Read FASTA files with validation
Run blastn searches against NCBI databases
Parse and filter BLAST results by identity and coverage
Select representative sequences per taxonomic rank
Extract sequences from BLAST databases
Save results to separate FASTA files per query
Quick Start
Complete Workflow
from pathlib import Path
from eplace_lib.blast_analysis import run_blast_search
from eplace_lib.taxonomy import process_blast_results_for_taxonomy
# Step 1: Run BLAST search with filtering
success, filtered_hits = run_blast_search(
query_fasta=Path("query.fasta"),
output_file=Path("blast_results.txt"),
min_identity=90.0, # 90% identity threshold
min_coverage=80.0, # 80% query coverage threshold
database="core_nt",
num_threads=4
)
# Step 2: Extract representative sequences by taxonomic rank
results = process_blast_results_for_taxonomy(
blast_hits=filtered_hits,
output_dir=Path("output"),
rank="species" # Can be: phylum, class, order, family, genus, species
)
# Results contains mapping of query IDs to output FASTA files
for query_id, output_fasta in results.items():
print(f"{query_id}: {output_fasta}")
Read FASTA Files
from eplace_lib.blast_analysis import FastaReader
# Read sequences from FASTA file
sequences = FastaReader.read_fasta(Path("input.fasta"))
# Get sequence lengths
lengths = FastaReader.get_sequence_lengths(Path("input.fasta"))
for seq_id, length in lengths.items():
print(f"{seq_id}: {length} bp")
Run BLAST Search Only
from eplace_lib.blast_analysis import BlastRunner
runner = BlastRunner()
# Check if blastn is available
if runner.check_blastn_available():
# Run BLAST
success = runner.run_blastn(
query_fasta=Path("query.fasta"),
output_file=Path("blast_results.txt"),
database="core_nt",
num_threads=4,
max_target_seqs=100,
evalue=1e-5
)
Parse and Filter BLAST Results
from eplace_lib.blast_analysis import BlastRunner
runner = BlastRunner()
# Parse BLAST tabular output
hits = runner.parse_blast_results(Path("blast_results.txt"))
# Filter hits by identity and coverage
filtered_hits = runner.filter_blast_hits(
hits,
min_identity=90.0,
min_coverage=80.0
)
print(f"Filtered {len(hits)} hits to {len(filtered_hits)} hits")
Select Representatives by Taxonomic Rank
from eplace_lib.taxonomy import TaxonomyExtractor
# Initialize extractor with desired rank
extractor = TaxonomyExtractor()
# Group hits by query
grouped_hits = extractor.group_hits_by_query(blast_hits)
# Select representatives for each query
for query_id, query_hits in grouped_hits.items():
representatives = extractor.select_representatives_by_rank(
hits=query_hits,
rank="genus",
max_per_rank=1 # Number per rank
)
print(f"{query_id}: {len(representatives)} representatives")
Extract Sequences from Database
from eplace_lib.taxonomy import SequenceExtractor
extractor = SequenceExtractor()
# Check if blastdbcmd is available
if extractor.check_blastdbcmd_available():
# Extract sequences
success = extractor.extract_sequences(
sequence_ids=["NC_001234.5", "NC_005678.9"],
output_fasta=Path("extracted.fasta"),
database="core_nt"
)
Command-Line Usage
The package includes an example script for command-line usage:
# Basic usage
python examples/blast_workflow_example.py query.fasta output_dir
# With custom parameters
python examples/blast_workflow_example.py query.fasta output_dir \
--rank genus \
--min-identity 95 \
--min-coverage 85 \
--num-threads 4
# Show help
python examples/blast_workflow_example.py --help
Command-Line Options
query_fasta: Path to query FASTA file (required)output_dir: Directory for output files (required)--rank: Taxonomic rank (phylum, class, order, family, genus, species) [default: species]--min-identity: Minimum percent identity [default: 90.0]--min-coverage: Minimum query coverage percentage [default: 80.0]--database: BLAST database name [default: core_nt]--blastdb-path: Path to BLAST database directory [default: $BLASTDB or ~/blastdb]--num-threads: Number of threads for BLAST [default: 1]--dry-run: Display what would be done without running BLAST
Output Structure
The workflow creates the following output structure:
output_dir/
├── blast_results.txt # Raw BLAST results (tabular format)
├── query1_id/
│ └── query1_id_representatives.fasta
├── query2_id/
│ └── query2_id_representatives.fasta
└── ...
Each query sequence gets its own directory with a FASTA file containing representative sequences.
Data Classes
BlastHit
Represents a single BLAST hit with the following attributes:
query_id: Query sequence identifiersubject_id: Subject (database) sequence identifierpercent_identity: Percentage of identical matchesalignment_length: Length of alignmentquery_length: Length of query sequencesubject_length: Length of subject sequencequery_start: Start position in queryquery_end: End position in querysubject_start: Start position in subjectsubject_end: End position in subjectevalue: Expectation valuebit_score: Bit scorequery_coverage: Percentage of query covered by alignment
Taxonomic information on BlastHit
The BlastHit dataclass includes the subject_taxonomy field, which is a dictionary containing the following keys:
phylum, class, order, family, genus, species
These fields are used when filtering BLAST results by taxonomy and selecting representative sequences per rank.
Requirements
System Requirements
BLAST+ tools must be installed:
blastn: For running BLAST searchesblastdbcmd: For extracting sequences from databases
NCBI BLAST database (e.g., core_nt) must be downloaded
Python Requirements
Uses Python standard library modules only:
os,subprocess,logging,pathlibtyping,collections,dataclasses
No external dependencies required.
Testing
Run the test suite:
# Test BLAST analysis module
pytest tests/test_blast_analysis.py -v
# Test taxonomy module
pytest tests/test_taxonomy.py -v
# Test complete workflow
pytest tests/test_workflow.py -v
# Run all tests
pytest tests/ -v
Examples
See examples/blast_workflow_example.py for a comprehensive example demonstrating the complete workflow.
Performance Considerations
BLAST searches can be slow for large query files or databases
Use
--num-threadsto parallelize BLAST searchesConsider splitting large query files into smaller batches
Ensure sufficient disk space for BLAST output and extracted sequences
Limitations
Taxonomic information extraction is simplified in this version
Full taxonomy resolution would require querying NCBI taxonomy database
Representative selection uses sequence ID patterns as a proxy for taxonomic grouping
For production use, consider integrating with NCBI Entrez or taxonomy databases
Security
The module includes security features:
Input validation for FASTA files
Safe subprocess execution with timeouts
Path validation to prevent directory traversal
Sanitized filenames for output files
Troubleshooting
“blastn is not available”
Install BLAST+ tools:
# Ubuntu/Debian
sudo apt-get install ncbi-blast+
# macOS with Homebrew
brew install blast
# Or download from NCBI
# https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
“blastdbcmd is not available”
blastdbcmd is included with BLAST+ tools. Install BLAST+ as shown above.
“Database already exists”
If you want to use a different database location, set the BLASTDB environment variable:
export BLASTDB=/path/to/your/blastdb
“No hits found”
Check your filtering parameters:
Try lowering
--min-identitythresholdTry lowering
--min-coveragethresholdEnsure your query sequences are appropriate for the database
Future Enhancements
Potential improvements for future versions:
Integration with NCBI Entrez API for full taxonomy information
Support for additional BLAST programs (blastp, blastx, etc.)
Multiple sequence alignment of representative sequences
Phylogenetic tree construction from representatives
Web interface for the workflow
Support for custom databases beyond core_nt