Sequence Alignment

ePLACE uses MAFFT for multiple sequence alignment of query sequences and their taxonomic representatives.

Overview

After extracting representative sequences from BLAST results, ePLACE aligns them together with the query sequence to prepare for phylogenetic analysis. The alignment step:

Combines query and reference sequences
Uses MAFFT with auto-orientation (--adjustdirection)
Handles sequences in different orientations
Produces aligned FASTA files suitable for tree building

MAFFT Integration

ePLACE uses MAFFT’s automatic alignment mode with several key features:

Auto-orientation

The --adjustdirection flag automatically detects and corrects reverse-complemented sequences:

mafft --auto --adjustdirection --thread N input.fasta > output.fasta

This is crucial for environmental DNA sequences which may be in either orientation.

Automatic Algorithm Selection

MAFFT’s --auto flag selects the most appropriate algorithm based on:

Number of sequences
Sequence lengths
Available memory

This provides a good balance of speed and accuracy for most datasets.

Using the API

Basic Alignment

from pathlib import Path
from eplace_lib.alignment import align_sequences

# Align sequences
success = align_sequences(
    input_fasta=Path("sequences.fasta"),
    output_fasta=Path("aligned.fasta"),
    num_threads=4
)

if success:
    print("Alignment successful")
else:
    print("Alignment failed")

Checking MAFFT Availability

from eplace_lib.alignment import check_mafft_available

if check_mafft_available():
    print("MAFFT is available")
else:
    print("MAFFT not found - install MAFFT to enable alignment")

Alignment in Workflows

Individual Workflow

In the individual workflow (eplace search):

Query sequence is combined with its representatives
Combined FASTA is trimmed to aligned regions
Trimmed sequences are aligned with MAFFT
Aligned sequences are used for tree building

eplace search query.fasta output_dir --num-threads 4

Grouped Workflow

In the grouped workflow (eplace grouped):

Multiple queries and their unique references are combined
Combined FASTA is trimmed to aligned regions
All sequences in the group are aligned together
Aligned sequences are used for phylogenetic tree

eplace grouped query.fasta output_dir --num-threads 4

Skipping Alignment

If you only need BLAST results without alignment:

eplace search query.fasta output_dir --skip-alignment

This will:

Perform BLAST search
Extract representative sequences
Skip MAFFT alignment
Skip tree building

Output Files

Alignment produces the following files:

*_aligned.fasta - Multiple sequence alignment in FASTA format
Sequences are in the same order as input
Gap characters (-) indicate alignment positions
Reversed sequences are marked with _R_ prefix (MAFFT convention)

Example aligned FASTA:

>query_sequence_1
ATGC-ATGCATGC
>representative_1
ATGCGATGCATGC
>_R_representative_2
ATGC-ATGCATGC

Tree Labeling

When phylogenetic trees are labeled with taxonomic names, sequences that were reverse-complemented by MAFFT (marked with _R_ prefix) will have _R appended to their taxonomic label. This allows you to identify which sequences were reverse-complemented during alignment.

Example tree before relabeling:

(MZ387488.1:0.1,_R_CP123456.1:0.2,query:0.0);

Example tree after relabeling:

(Salmonella:0.1,Escherichia_R:0.2,query:0.0);

The _R suffix indicates that the Escherichia sequence was reverse-complemented during alignment.

Troubleshooting

MAFFT not found

If you get “MAFFT is not available”:

# Ubuntu/Debian
sudo apt-get install mafft

# macOS
brew install mafft

# Conda
conda install -c bioconda mafft

Alignment takes too long

For large datasets:

Increase --num-threads for parallelization
Consider filtering to fewer representative sequences
Use grouped workflow to reduce number of alignments
Pre-filter sequences by length or quality

Out of memory

If MAFFT runs out of memory:

Process fewer sequences at a time
Reduce number of representatives per rank
Use a machine with more RAM
Consider sequence length limits

Poor alignment quality

If alignments appear poor:

Check sequence quality and length
Ensure sequences are from related organisms
Verify BLAST filtering parameters are appropriate
Consider manual curation of representatives

Advanced Usage

Custom MAFFT Parameters

While ePLACE uses sensible defaults, you can modify the alignment code to use custom MAFFT parameters:

import subprocess
from pathlib import Path

def custom_align(input_fasta: Path, output_fasta: Path):
    """Custom alignment with specific MAFFT parameters."""
    cmd = [
        "mafft",
        "--maxiterate", "1000",
        "--localpair",  # L-INS-i algorithm
        "--adjustdirection",
        "--thread", "8",
        str(input_fasta)
    ]

    with open(output_fasta, 'w') as f:
        subprocess.run(cmd, stdout=f, check=True)

Alignment Quality Assessment

Assess alignment quality:

from Bio import AlignIO

# Read alignment
alignment = AlignIO.read("aligned.fasta", "fasta")

# Calculate statistics
num_seqs = len(alignment)
align_length = alignment.get_alignment_length()

# Count gaps
gap_counts = [seq.seq.count('-') for seq in alignment]
avg_gaps = sum(gap_counts) / num_seqs

print(f"Sequences: {num_seqs}")
print(f"Alignment length: {align_length}")
print(f"Average gaps: {avg_gaps:.1f}")

Post-processing Alignments

Trim poorly aligned regions:

from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment

# Read alignment
alignment = AlignIO.read("aligned.fasta", "fasta")

# Trim columns with >50% gaps
trimmed_alignment = []
for i in range(alignment.get_alignment_length()):
    column = alignment[:, i]
    gap_fraction = column.count('-') / len(column)
    if gap_fraction < 0.5:
        trimmed_alignment.append(column)

# Save trimmed alignment
# (implementation depends on your needs)

Best Practices

Quality Control: Check input sequences before alignment
Threading: Use multiple threads (--num-threads) for speed
Memory: Monitor memory usage for large alignments
Validation: Visually inspect alignments when possible
Documentation: Record alignment parameters for reproducibility

Performance Considerations

Alignment speed depends on:

Number of sequences
Sequence lengths
Algorithm selected by MAFFT
Number of threads
Available memory

Typical timings:

10 sequences × 500bp: < 1 second
50 sequences × 1000bp: ~5-10 seconds
100 sequences × 2000bp: ~30-60 seconds