BLAST Sequence Comparison Module

This module provides functionality for running BLAST searches, filtering results, and extracting representative sequences by taxonomic rank.

Features

Read FASTA files with validation
Run blastn searches against NCBI databases
Parse and filter BLAST results by identity and coverage
Select representative sequences per taxonomic rank
Extract sequences from BLAST databases
Save results to separate FASTA files per query

Quick Start

Complete Workflow

from pathlib import Path
from eplace_lib.blast_analysis import run_blast_search
from eplace_lib.taxonomy import process_blast_results_for_taxonomy

# Step 1: Run BLAST search with filtering
success, filtered_hits = run_blast_search(
    query_fasta=Path("query.fasta"),
    output_file=Path("blast_results.txt"),
    min_identity=90.0,  # 90% identity threshold
    min_coverage=80.0,   # 80% query coverage threshold
    database="core_nt",
    num_threads=4
)

# Step 2: Extract representative sequences by taxonomic rank
results = process_blast_results_for_taxonomy(
    blast_hits=filtered_hits,
    output_dir=Path("output"),
    rank="species"  # Can be: phylum, class, order, family, genus, species
)

# Results contains mapping of query IDs to output FASTA files
for query_id, output_fasta in results.items():
    print(f"{query_id}: {output_fasta}")

Read FASTA Files

from eplace_lib.blast_analysis import FastaReader

# Read sequences from FASTA file
sequences = FastaReader.read_fasta(Path("input.fasta"))

# Get sequence lengths
lengths = FastaReader.get_sequence_lengths(Path("input.fasta"))

for seq_id, length in lengths.items():
    print(f"{seq_id}: {length} bp")

Run BLAST Search Only

from eplace_lib.blast_analysis import BlastRunner

runner = BlastRunner()

# Check if blastn is available
if runner.check_blastn_available():
    # Run BLAST
    success = runner.run_blastn(
        query_fasta=Path("query.fasta"),
        output_file=Path("blast_results.txt"),
        database="core_nt",
        num_threads=4,
        max_target_seqs=100,
        evalue=1e-5
    )

Parse and Filter BLAST Results

from eplace_lib.blast_analysis import BlastRunner

runner = BlastRunner()

# Parse BLAST tabular output
hits = runner.parse_blast_results(Path("blast_results.txt"))

# Filter hits by identity and coverage
filtered_hits = runner.filter_blast_hits(
    hits,
    min_identity=90.0,
    min_coverage=80.0
)

print(f"Filtered {len(hits)} hits to {len(filtered_hits)} hits")

Select Representatives by Taxonomic Rank

from eplace_lib.taxonomy import TaxonomyExtractor

# Initialize extractor with desired rank
extractor = TaxonomyExtractor()

# Group hits by query
grouped_hits = extractor.group_hits_by_query(blast_hits)

# Select representatives for each query
for query_id, query_hits in grouped_hits.items():
    representatives = extractor.select_representatives_by_rank(
        hits=query_hits,
        rank="genus",
        max_per_rank=1   # Number per rank
    )
    print(f"{query_id}: {len(representatives)} representatives")

Extract Sequences from Database

from eplace_lib.taxonomy import SequenceExtractor

extractor = SequenceExtractor()

# Check if blastdbcmd is available
if extractor.check_blastdbcmd_available():
    # Extract sequences
    success = extractor.extract_sequences(
        sequence_ids=["NC_001234.5", "NC_005678.9"],
        output_fasta=Path("extracted.fasta"),
        database="core_nt"
    )

Command-Line Usage

The package includes an example script for command-line usage:

# Basic usage
python examples/blast_workflow_example.py query.fasta output_dir

# With custom parameters
python examples/blast_workflow_example.py query.fasta output_dir \
    --rank genus \
    --min-identity 95 \
    --min-coverage 85 \
    --num-threads 4

# Show help
python examples/blast_workflow_example.py --help

Command-Line Options

query_fasta: Path to query FASTA file (required)
output_dir: Directory for output files (required)
--rank: Taxonomic rank (phylum, class, order, family, genus, species) [default: species]
--min-identity: Minimum percent identity [default: 90.0]
--min-coverage: Minimum query coverage percentage [default: 80.0]
--database: BLAST database name [default: core_nt]
--blastdb-path: Path to BLAST database directory [default: $BLASTDB or ~/blastdb]
--num-threads: Number of threads for BLAST [default: 1]
--dry-run: Display what would be done without running BLAST

Output Structure

The workflow creates the following output structure:

output_dir/
├── blast_results.txt              # Raw BLAST results (tabular format)
├── query1_id/
│   └── query1_id_representatives.fasta
├── query2_id/
│   └── query2_id_representatives.fasta
└── ...

Each query sequence gets its own directory with a FASTA file containing representative sequences.

Data Classes

BlastHit

Represents a single BLAST hit with the following attributes:

query_id: Query sequence identifier
subject_id: Subject (database) sequence identifier
percent_identity: Percentage of identical matches
alignment_length: Length of alignment
query_length: Length of query sequence
subject_length: Length of subject sequence
query_start: Start position in query
query_end: End position in query
subject_start: Start position in subject
subject_end: End position in subject
evalue: Expectation value
bit_score: Bit score
query_coverage: Percentage of query covered by alignment

Taxonomic information on `BlastHit`

The BlastHit dataclass includes the subject_taxonomy field, which is a dictionary containing the following keys: phylum, class, order, family, genus, species

These fields are used when filtering BLAST results by taxonomy and selecting representative sequences per rank.

Requirements

System Requirements

BLAST+ tools must be installed:
- blastn: For running BLAST searches
- blastdbcmd: For extracting sequences from databases
NCBI BLAST database (e.g., core_nt) must be downloaded

Python Requirements

Uses Python standard library modules only:

os, subprocess, logging, pathlib
typing, collections, dataclasses

No external dependencies required.

Testing

Run the test suite:

# Test BLAST analysis module
pytest tests/test_blast_analysis.py -v

# Test taxonomy module
pytest tests/test_taxonomy.py -v

# Test complete workflow
pytest tests/test_workflow.py -v

# Run all tests
pytest tests/ -v

Examples

See examples/blast_workflow_example.py for a comprehensive example demonstrating the complete workflow.

Performance Considerations

BLAST searches can be slow for large query files or databases
Use --num-threads to parallelize BLAST searches
Consider splitting large query files into smaller batches
Ensure sufficient disk space for BLAST output and extracted sequences

Limitations

Taxonomic information extraction is simplified in this version
Full taxonomy resolution would require querying NCBI taxonomy database
Representative selection uses sequence ID patterns as a proxy for taxonomic grouping
For production use, consider integrating with NCBI Entrez or taxonomy databases

Security

The module includes security features:

Input validation for FASTA files
Safe subprocess execution with timeouts
Path validation to prevent directory traversal
Sanitized filenames for output files

Troubleshooting

“blastn is not available”

Install BLAST+ tools:

# Ubuntu/Debian
sudo apt-get install ncbi-blast+

# macOS with Homebrew
brew install blast

# Or download from NCBI
# https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

“blastdbcmd is not available”

blastdbcmd is included with BLAST+ tools. Install BLAST+ as shown above.

“Database already exists”

If you want to use a different database location, set the BLASTDB environment variable:

export BLASTDB=/path/to/your/blastdb

“No hits found”

Check your filtering parameters:

Try lowering --min-identity threshold
Try lowering --min-coverage threshold
Ensure your query sequences are appropriate for the database

Future Enhancements

Potential improvements for future versions:

Integration with NCBI Entrez API for full taxonomy information
Support for additional BLAST programs (blastp, blastx, etc.)
Multiple sequence alignment of representative sequences
Phylogenetic tree construction from representatives
Web interface for the workflow
Support for custom databases beyond core_nt