Sequence Alignment

ePLACE uses MAFFT for multiple sequence alignment of query sequences and their taxonomic representatives.

Overview

After extracting representative sequences from BLAST results, ePLACE aligns them together with the query sequence to prepare for phylogenetic analysis. The alignment step:

  • Combines query and reference sequences

  • Uses MAFFT with auto-orientation (--adjustdirection)

  • Handles sequences in different orientations

  • Produces aligned FASTA files suitable for tree building

MAFFT Integration

ePLACE uses MAFFT’s automatic alignment mode with several key features:

Auto-orientation

The --adjustdirection flag automatically detects and corrects reverse-complemented sequences:

mafft --auto --adjustdirection --thread N input.fasta > output.fasta

This is crucial for environmental DNA sequences which may be in either orientation.

Automatic Algorithm Selection

MAFFT’s --auto flag selects the most appropriate algorithm based on:

  • Number of sequences

  • Sequence lengths

  • Available memory

This provides a good balance of speed and accuracy for most datasets.

Using the API

Basic Alignment

from pathlib import Path
from eplace_lib.alignment import align_sequences

# Align sequences
success = align_sequences(
    input_fasta=Path("sequences.fasta"),
    output_fasta=Path("aligned.fasta"),
    num_threads=4
)

if success:
    print("Alignment successful")
else:
    print("Alignment failed")

Checking MAFFT Availability

from eplace_lib.alignment import check_mafft_available

if check_mafft_available():
    print("MAFFT is available")
else:
    print("MAFFT not found - install MAFFT to enable alignment")

Alignment in Workflows

Individual Workflow

In the individual workflow (eplace search):

  1. Query sequence is combined with its representatives

  2. Combined FASTA is trimmed to aligned regions

  3. Trimmed sequences are aligned with MAFFT

  4. Aligned sequences are used for tree building

eplace search query.fasta output_dir --num-threads 4

Grouped Workflow

In the grouped workflow (eplace grouped):

  1. Multiple queries and their unique references are combined

  2. Combined FASTA is trimmed to aligned regions

  3. All sequences in the group are aligned together

  4. Aligned sequences are used for phylogenetic tree

eplace grouped query.fasta output_dir --num-threads 4

Skipping Alignment

If you only need BLAST results without alignment:

eplace search query.fasta output_dir --skip-alignment

This will:

  • Perform BLAST search

  • Extract representative sequences

  • Skip MAFFT alignment

  • Skip tree building

Output Files

Alignment produces the following files:

  • *_aligned.fasta - Multiple sequence alignment in FASTA format

  • Sequences are in the same order as input

  • Gap characters (-) indicate alignment positions

  • Reversed sequences are marked with _R_ prefix (MAFFT convention)

Example aligned FASTA:

>query_sequence_1
ATGC-ATGCATGC
>representative_1
ATGCGATGCATGC
>_R_representative_2
ATGC-ATGCATGC

Tree Labeling

When phylogenetic trees are labeled with taxonomic names, sequences that were reverse-complemented by MAFFT (marked with _R_ prefix) will have _R appended to their taxonomic label. This allows you to identify which sequences were reverse-complemented during alignment.

Example tree before relabeling:

(MZ387488.1:0.1,_R_CP123456.1:0.2,query:0.0);

Example tree after relabeling:

(Salmonella:0.1,Escherichia_R:0.2,query:0.0);

The _R suffix indicates that the Escherichia sequence was reverse-complemented during alignment.

Troubleshooting

MAFFT not found

If you get “MAFFT is not available”:

# Ubuntu/Debian
sudo apt-get install mafft

# macOS
brew install mafft

# Conda
conda install -c bioconda mafft

Alignment takes too long

For large datasets:

  1. Increase --num-threads for parallelization

  2. Consider filtering to fewer representative sequences

  3. Use grouped workflow to reduce number of alignments

  4. Pre-filter sequences by length or quality

Out of memory

If MAFFT runs out of memory:

  1. Process fewer sequences at a time

  2. Reduce number of representatives per rank

  3. Use a machine with more RAM

  4. Consider sequence length limits

Poor alignment quality

If alignments appear poor:

  1. Check sequence quality and length

  2. Ensure sequences are from related organisms

  3. Verify BLAST filtering parameters are appropriate

  4. Consider manual curation of representatives

Advanced Usage

Custom MAFFT Parameters

While ePLACE uses sensible defaults, you can modify the alignment code to use custom MAFFT parameters:

import subprocess
from pathlib import Path

def custom_align(input_fasta: Path, output_fasta: Path):
    """Custom alignment with specific MAFFT parameters."""
    cmd = [
        "mafft",
        "--maxiterate", "1000",
        "--localpair",  # L-INS-i algorithm
        "--adjustdirection",
        "--thread", "8",
        str(input_fasta)
    ]

    with open(output_fasta, 'w') as f:
        subprocess.run(cmd, stdout=f, check=True)

Alignment Quality Assessment

Assess alignment quality:

from Bio import AlignIO

# Read alignment
alignment = AlignIO.read("aligned.fasta", "fasta")

# Calculate statistics
num_seqs = len(alignment)
align_length = alignment.get_alignment_length()

# Count gaps
gap_counts = [seq.seq.count('-') for seq in alignment]
avg_gaps = sum(gap_counts) / num_seqs

print(f"Sequences: {num_seqs}")
print(f"Alignment length: {align_length}")
print(f"Average gaps: {avg_gaps:.1f}")

Post-processing Alignments

Trim poorly aligned regions:

from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment

# Read alignment
alignment = AlignIO.read("aligned.fasta", "fasta")

# Trim columns with >50% gaps
trimmed_alignment = []
for i in range(alignment.get_alignment_length()):
    column = alignment[:, i]
    gap_fraction = column.count('-') / len(column)
    if gap_fraction < 0.5:
        trimmed_alignment.append(column)

# Save trimmed alignment
# (implementation depends on your needs)

Best Practices

  1. Quality Control: Check input sequences before alignment

  2. Threading: Use multiple threads (--num-threads) for speed

  3. Memory: Monitor memory usage for large alignments

  4. Validation: Visually inspect alignments when possible

  5. Documentation: Record alignment parameters for reproducibility

Performance Considerations

Alignment speed depends on:

  • Number of sequences

  • Sequence lengths

  • Algorithm selected by MAFFT

  • Number of threads

  • Available memory

Typical timings:

  • 10 sequences × 500bp: < 1 second

  • 50 sequences × 1000bp: ~5-10 seconds

  • 100 sequences × 2000bp: ~30-60 seconds

See Also