# BLAST Sequence Comparison Module

This module provides functionality for running BLAST searches, filtering results, and extracting representative sequences by taxonomic rank.

## Features

- Read FASTA files with validation
- Run blastn searches against NCBI databases
- Parse and filter BLAST results by identity and coverage
- Select representative sequences per taxonomic rank
- Extract sequences from BLAST databases
- Save results to separate FASTA files per query

## Quick Start

### Complete Workflow

```python
from pathlib import Path
from eplace_lib.blast_analysis import run_blast_search
from eplace_lib.taxonomy import process_blast_results_for_taxonomy

# Step 1: Run BLAST search with filtering
success, filtered_hits = run_blast_search(
    query_fasta=Path("query.fasta"),
    output_file=Path("blast_results.txt"),
    min_identity=90.0,  # 90% identity threshold
    min_coverage=80.0,   # 80% query coverage threshold
    database="core_nt",
    num_threads=4
)

# Step 2: Extract representative sequences by taxonomic rank
results = process_blast_results_for_taxonomy(
    blast_hits=filtered_hits,
    output_dir=Path("output"),
    rank="species"  # Can be: phylum, class, order, family, genus, species
)

# Results contains mapping of query IDs to output FASTA files
for query_id, output_fasta in results.items():
    print(f"{query_id}: {output_fasta}")
```

### Read FASTA Files

```python
from eplace_lib.blast_analysis import FastaReader

# Read sequences from FASTA file
sequences = FastaReader.read_fasta(Path("input.fasta"))

# Get sequence lengths
lengths = FastaReader.get_sequence_lengths(Path("input.fasta"))

for seq_id, length in lengths.items():
    print(f"{seq_id}: {length} bp")
```

### Run BLAST Search Only

```python
from eplace_lib.blast_analysis import BlastRunner

runner = BlastRunner()

# Check if blastn is available
if runner.check_blastn_available():
    # Run BLAST
    success = runner.run_blastn(
        query_fasta=Path("query.fasta"),
        output_file=Path("blast_results.txt"),
        database="core_nt",
        num_threads=4,
        max_target_seqs=100,
        evalue=1e-5
    )
```

### Parse and Filter BLAST Results

```python
from eplace_lib.blast_analysis import BlastRunner

runner = BlastRunner()

# Parse BLAST tabular output
hits = runner.parse_blast_results(Path("blast_results.txt"))

# Filter hits by identity and coverage
filtered_hits = runner.filter_blast_hits(
    hits,
    min_identity=90.0,
    min_coverage=80.0
)

print(f"Filtered {len(hits)} hits to {len(filtered_hits)} hits")
```

### Select Representatives by Taxonomic Rank

```python
from eplace_lib.taxonomy import TaxonomyExtractor

# Initialize extractor with desired rank
extractor = TaxonomyExtractor()

# Group hits by query
grouped_hits = extractor.group_hits_by_query(blast_hits)

# Select representatives for each query
for query_id, query_hits in grouped_hits.items():
    representatives = extractor.select_representatives_by_rank(
        hits=query_hits,
        rank="genus",
        max_per_rank=1   # Number per rank
    )
    print(f"{query_id}: {len(representatives)} representatives")
```

### Extract Sequences from Database

```python
from eplace_lib.taxonomy import SequenceExtractor

extractor = SequenceExtractor()

# Check if blastdbcmd is available
if extractor.check_blastdbcmd_available():
    # Extract sequences
    success = extractor.extract_sequences(
        sequence_ids=["NC_001234.5", "NC_005678.9"],
        output_fasta=Path("extracted.fasta"),
        database="core_nt"
    )
```

## Command-Line Usage

The package includes an example script for command-line usage:

```bash
# Basic usage
python examples/blast_workflow_example.py query.fasta output_dir

# With custom parameters
python examples/blast_workflow_example.py query.fasta output_dir \
    --rank genus \
    --min-identity 95 \
    --min-coverage 85 \
    --num-threads 4

# Show help
python examples/blast_workflow_example.py --help
```

### Command-Line Options

- `query_fasta`: Path to query FASTA file (required)
- `output_dir`: Directory for output files (required)
- `--rank`: Taxonomic rank (phylum, class, order, family, genus, species) [default: species]
- `--min-identity`: Minimum percent identity [default: 90.0]
- `--min-coverage`: Minimum query coverage percentage [default: 80.0]
- `--database`: BLAST database name [default: core_nt]
- `--blastdb-path`: Path to BLAST database directory [default: $BLASTDB or ~/blastdb]
- `--num-threads`: Number of threads for BLAST [default: 1]
- `--dry-run`: Display what would be done without running BLAST

## Output Structure

The workflow creates the following output structure:

```
output_dir/
├── blast_results.txt              # Raw BLAST results (tabular format)
├── query1_id/
│   └── query1_id_representatives.fasta
├── query2_id/
│   └── query2_id_representatives.fasta
└── ...
```

Each query sequence gets its own directory with a FASTA file containing representative sequences.

## Data Classes

### BlastHit

Represents a single BLAST hit with the following attributes:

- `query_id`: Query sequence identifier
- `subject_id`: Subject (database) sequence identifier
- `percent_identity`: Percentage of identical matches
- `alignment_length`: Length of alignment
- `query_length`: Length of query sequence
- `subject_length`: Length of subject sequence
- `query_start`: Start position in query
- `query_end`: End position in query
- `subject_start`: Start position in subject
- `subject_end`: End position in subject
- `evalue`: Expectation value
- `bit_score`: Bit score
- `query_coverage`: Percentage of query covered by alignment

### Taxonomic information on `BlastHit`

The `BlastHit` dataclass includes the subject_taxonomy field, which is a dictionary containing the following keys:
`phylum`, `class`, `order`, `family`, `genus`, `species`


These fields are used when filtering BLAST results by taxonomy and selecting representative sequences per rank.

## Requirements

### System Requirements

- BLAST+ tools must be installed:
  - `blastn`: For running BLAST searches
  - `blastdbcmd`: For extracting sequences from databases
- NCBI BLAST database (e.g., core_nt) must be downloaded

### Python Requirements

Uses Python standard library modules only:
- `os`, `subprocess`, `logging`, `pathlib`
- `typing`, `collections`, `dataclasses`

No external dependencies required.

## Testing

Run the test suite:

```bash
# Test BLAST analysis module
pytest tests/test_blast_analysis.py -v

# Test taxonomy module
pytest tests/test_taxonomy.py -v

# Test complete workflow
pytest tests/test_workflow.py -v

# Run all tests
pytest tests/ -v
```

## Examples

See `examples/blast_workflow_example.py` for a comprehensive example demonstrating the complete workflow.

## Performance Considerations

- BLAST searches can be slow for large query files or databases
- Use `--num-threads` to parallelize BLAST searches
- Consider splitting large query files into smaller batches
- Ensure sufficient disk space for BLAST output and extracted sequences

## Limitations

- Taxonomic information extraction is simplified in this version
- Full taxonomy resolution would require querying NCBI taxonomy database
- Representative selection uses sequence ID patterns as a proxy for taxonomic grouping
- For production use, consider integrating with NCBI Entrez or taxonomy databases

## Security

The module includes security features:

- Input validation for FASTA files
- Safe subprocess execution with timeouts
- Path validation to prevent directory traversal
- Sanitized filenames for output files

## Troubleshooting

### "blastn is not available"

Install BLAST+ tools:
```bash
# Ubuntu/Debian
sudo apt-get install ncbi-blast+

# macOS with Homebrew
brew install blast

# Or download from NCBI
# https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
```

### "blastdbcmd is not available"

blastdbcmd is included with BLAST+ tools. Install BLAST+ as shown above.

### "Database already exists"

If you want to use a different database location, set the `BLASTDB` environment variable:

```bash
export BLASTDB=/path/to/your/blastdb
```

### "No hits found"

Check your filtering parameters:
- Try lowering `--min-identity` threshold
- Try lowering `--min-coverage` threshold
- Ensure your query sequences are appropriate for the database

## Future Enhancements

Potential improvements for future versions:

- Integration with NCBI Entrez API for full taxonomy information
- Support for additional BLAST programs (blastp, blastx, etc.)
- Multiple sequence alignment of representative sequences
- Phylogenetic tree construction from representatives
- Web interface for the workflow
- Support for custom databases beyond core_nt