BLAST Sequence Comparison Module

This module provides functionality for running BLAST searches, filtering results, and extracting representative sequences by taxonomic rank.

Features

  • Read FASTA files with validation

  • Run blastn searches against NCBI databases

  • Parse and filter BLAST results by identity and coverage

  • Select representative sequences per taxonomic rank

  • Extract sequences from BLAST databases

  • Save results to separate FASTA files per query

Quick Start

Complete Workflow

from pathlib import Path
from eplace_lib.blast_analysis import run_blast_search
from eplace_lib.taxonomy import process_blast_results_for_taxonomy

# Step 1: Run BLAST search with filtering
success, filtered_hits = run_blast_search(
    query_fasta=Path("query.fasta"),
    output_file=Path("blast_results.txt"),
    min_identity=90.0,  # 90% identity threshold
    min_coverage=80.0,   # 80% query coverage threshold
    database="core_nt",
    num_threads=4
)

# Step 2: Extract representative sequences by taxonomic rank
results = process_blast_results_for_taxonomy(
    blast_hits=filtered_hits,
    output_dir=Path("output"),
    rank="species"  # Can be: phylum, class, order, family, genus, species
)

# Results contains mapping of query IDs to output FASTA files
for query_id, output_fasta in results.items():
    print(f"{query_id}: {output_fasta}")

Read FASTA Files

from eplace_lib.blast_analysis import FastaReader

# Read sequences from FASTA file
sequences = FastaReader.read_fasta(Path("input.fasta"))

# Get sequence lengths
lengths = FastaReader.get_sequence_lengths(Path("input.fasta"))

for seq_id, length in lengths.items():
    print(f"{seq_id}: {length} bp")

Run BLAST Search Only

from eplace_lib.blast_analysis import BlastRunner

runner = BlastRunner()

# Check if blastn is available
if runner.check_blastn_available():
    # Run BLAST
    success = runner.run_blastn(
        query_fasta=Path("query.fasta"),
        output_file=Path("blast_results.txt"),
        database="core_nt",
        num_threads=4,
        max_target_seqs=100,
        evalue=1e-5
    )

Parse and Filter BLAST Results

from eplace_lib.blast_analysis import BlastRunner

runner = BlastRunner()

# Parse BLAST tabular output
hits = runner.parse_blast_results(Path("blast_results.txt"))

# Filter hits by identity and coverage
filtered_hits = runner.filter_blast_hits(
    hits,
    min_identity=90.0,
    min_coverage=80.0
)

print(f"Filtered {len(hits)} hits to {len(filtered_hits)} hits")

Select Representatives by Taxonomic Rank

from eplace_lib.taxonomy import TaxonomyExtractor

# Initialize extractor with desired rank
extractor = TaxonomyExtractor()

# Group hits by query
grouped_hits = extractor.group_hits_by_query(blast_hits)

# Select representatives for each query
for query_id, query_hits in grouped_hits.items():
    representatives = extractor.select_representatives_by_rank(
        hits=query_hits,
        rank="genus",
        max_per_rank=1   # Number per rank
    )
    print(f"{query_id}: {len(representatives)} representatives")

Extract Sequences from Database

from eplace_lib.taxonomy import SequenceExtractor

extractor = SequenceExtractor()

# Check if blastdbcmd is available
if extractor.check_blastdbcmd_available():
    # Extract sequences
    success = extractor.extract_sequences(
        sequence_ids=["NC_001234.5", "NC_005678.9"],
        output_fasta=Path("extracted.fasta"),
        database="core_nt"
    )

Command-Line Usage

The package includes an example script for command-line usage:

# Basic usage
python examples/blast_workflow_example.py query.fasta output_dir

# With custom parameters
python examples/blast_workflow_example.py query.fasta output_dir \
    --rank genus \
    --min-identity 95 \
    --min-coverage 85 \
    --num-threads 4

# Show help
python examples/blast_workflow_example.py --help

Command-Line Options

  • query_fasta: Path to query FASTA file (required)

  • output_dir: Directory for output files (required)

  • --rank: Taxonomic rank (phylum, class, order, family, genus, species) [default: species]

  • --min-identity: Minimum percent identity [default: 90.0]

  • --min-coverage: Minimum query coverage percentage [default: 80.0]

  • --database: BLAST database name [default: core_nt]

  • --blastdb-path: Path to BLAST database directory [default: $BLASTDB or ~/blastdb]

  • --num-threads: Number of threads for BLAST [default: 1]

  • --dry-run: Display what would be done without running BLAST

Output Structure

The workflow creates the following output structure:

output_dir/
├── blast_results.txt              # Raw BLAST results (tabular format)
├── query1_id/
│   └── query1_id_representatives.fasta
├── query2_id/
│   └── query2_id_representatives.fasta
└── ...

Each query sequence gets its own directory with a FASTA file containing representative sequences.

Data Classes

BlastHit

Represents a single BLAST hit with the following attributes:

  • query_id: Query sequence identifier

  • subject_id: Subject (database) sequence identifier

  • percent_identity: Percentage of identical matches

  • alignment_length: Length of alignment

  • query_length: Length of query sequence

  • subject_length: Length of subject sequence

  • query_start: Start position in query

  • query_end: End position in query

  • subject_start: Start position in subject

  • subject_end: End position in subject

  • evalue: Expectation value

  • bit_score: Bit score

  • query_coverage: Percentage of query covered by alignment

Taxonomic information on BlastHit

The BlastHit dataclass includes the subject_taxonomy field, which is a dictionary containing the following keys: phylum, class, order, family, genus, species

These fields are used when filtering BLAST results by taxonomy and selecting representative sequences per rank.

Requirements

System Requirements

  • BLAST+ tools must be installed:

    • blastn: For running BLAST searches

    • blastdbcmd: For extracting sequences from databases

  • NCBI BLAST database (e.g., core_nt) must be downloaded

Python Requirements

Uses Python standard library modules only:

  • os, subprocess, logging, pathlib

  • typing, collections, dataclasses

No external dependencies required.

Testing

Run the test suite:

# Test BLAST analysis module
pytest tests/test_blast_analysis.py -v

# Test taxonomy module
pytest tests/test_taxonomy.py -v

# Test complete workflow
pytest tests/test_workflow.py -v

# Run all tests
pytest tests/ -v

Examples

See examples/blast_workflow_example.py for a comprehensive example demonstrating the complete workflow.

Performance Considerations

  • BLAST searches can be slow for large query files or databases

  • Use --num-threads to parallelize BLAST searches

  • Consider splitting large query files into smaller batches

  • Ensure sufficient disk space for BLAST output and extracted sequences

Limitations

  • Taxonomic information extraction is simplified in this version

  • Full taxonomy resolution would require querying NCBI taxonomy database

  • Representative selection uses sequence ID patterns as a proxy for taxonomic grouping

  • For production use, consider integrating with NCBI Entrez or taxonomy databases

Security

The module includes security features:

  • Input validation for FASTA files

  • Safe subprocess execution with timeouts

  • Path validation to prevent directory traversal

  • Sanitized filenames for output files

Troubleshooting

“blastn is not available”

Install BLAST+ tools:

# Ubuntu/Debian
sudo apt-get install ncbi-blast+

# macOS with Homebrew
brew install blast

# Or download from NCBI
# https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

“blastdbcmd is not available”

blastdbcmd is included with BLAST+ tools. Install BLAST+ as shown above.

“Database already exists”

If you want to use a different database location, set the BLASTDB environment variable:

export BLASTDB=/path/to/your/blastdb

“No hits found”

Check your filtering parameters:

  • Try lowering --min-identity threshold

  • Try lowering --min-coverage threshold

  • Ensure your query sequences are appropriate for the database

Future Enhancements

Potential improvements for future versions:

  • Integration with NCBI Entrez API for full taxonomy information

  • Support for additional BLAST programs (blastp, blastx, etc.)

  • Multiple sequence alignment of representative sequences

  • Phylogenetic tree construction from representatives

  • Web interface for the workflow

  • Support for custom databases beyond core_nt