Sequence Alignment
ePLACE uses MAFFT for multiple sequence alignment of query sequences and their taxonomic representatives.
Overview
After extracting representative sequences from BLAST results, ePLACE aligns them together with the query sequence to prepare for phylogenetic analysis. The alignment step:
Combines query and reference sequences
Uses MAFFT with auto-orientation (
--adjustdirection)Handles sequences in different orientations
Produces aligned FASTA files suitable for tree building
MAFFT Integration
ePLACE uses MAFFT’s automatic alignment mode with several key features:
Auto-orientation
The --adjustdirection flag automatically detects and corrects reverse-complemented sequences:
mafft --auto --adjustdirection --thread N input.fasta > output.fasta
This is crucial for environmental DNA sequences which may be in either orientation.
Automatic Algorithm Selection
MAFFT’s --auto flag selects the most appropriate algorithm based on:
Number of sequences
Sequence lengths
Available memory
This provides a good balance of speed and accuracy for most datasets.
Using the API
Basic Alignment
from pathlib import Path
from eplace_lib.alignment import align_sequences
# Align sequences
success = align_sequences(
input_fasta=Path("sequences.fasta"),
output_fasta=Path("aligned.fasta"),
num_threads=4
)
if success:
print("Alignment successful")
else:
print("Alignment failed")
Checking MAFFT Availability
from eplace_lib.alignment import check_mafft_available
if check_mafft_available():
print("MAFFT is available")
else:
print("MAFFT not found - install MAFFT to enable alignment")
Alignment in Workflows
Individual Workflow
In the individual workflow (eplace search):
Query sequence is combined with its representatives
Combined FASTA is trimmed to aligned regions
Trimmed sequences are aligned with MAFFT
Aligned sequences are used for tree building
eplace search query.fasta output_dir --num-threads 4
Grouped Workflow
In the grouped workflow (eplace grouped):
Multiple queries and their unique references are combined
Combined FASTA is trimmed to aligned regions
All sequences in the group are aligned together
Aligned sequences are used for phylogenetic tree
eplace grouped query.fasta output_dir --num-threads 4
Skipping Alignment
If you only need BLAST results without alignment:
eplace search query.fasta output_dir --skip-alignment
This will:
Perform BLAST search
Extract representative sequences
Skip MAFFT alignment
Skip tree building
Output Files
Alignment produces the following files:
*_aligned.fasta- Multiple sequence alignment in FASTA formatSequences are in the same order as input
Gap characters (
-) indicate alignment positionsReversed sequences are marked with
_R_prefix (MAFFT convention)
Example aligned FASTA:
>query_sequence_1
ATGC-ATGCATGC
>representative_1
ATGCGATGCATGC
>_R_representative_2
ATGC-ATGCATGC
Tree Labeling
When phylogenetic trees are labeled with taxonomic names, sequences that were reverse-complemented by MAFFT (marked with _R_ prefix) will have _R appended to their taxonomic label. This allows you to identify which sequences were reverse-complemented during alignment.
Example tree before relabeling:
(MZ387488.1:0.1,_R_CP123456.1:0.2,query:0.0);
Example tree after relabeling:
(Salmonella:0.1,Escherichia_R:0.2,query:0.0);
The _R suffix indicates that the Escherichia sequence was reverse-complemented during alignment.
Troubleshooting
MAFFT not found
If you get “MAFFT is not available”:
# Ubuntu/Debian
sudo apt-get install mafft
# macOS
brew install mafft
# Conda
conda install -c bioconda mafft
Alignment takes too long
For large datasets:
Increase
--num-threadsfor parallelizationConsider filtering to fewer representative sequences
Use grouped workflow to reduce number of alignments
Pre-filter sequences by length or quality
Out of memory
If MAFFT runs out of memory:
Process fewer sequences at a time
Reduce number of representatives per rank
Use a machine with more RAM
Consider sequence length limits
Poor alignment quality
If alignments appear poor:
Check sequence quality and length
Ensure sequences are from related organisms
Verify BLAST filtering parameters are appropriate
Consider manual curation of representatives
Advanced Usage
Custom MAFFT Parameters
While ePLACE uses sensible defaults, you can modify the alignment code to use custom MAFFT parameters:
import subprocess
from pathlib import Path
def custom_align(input_fasta: Path, output_fasta: Path):
"""Custom alignment with specific MAFFT parameters."""
cmd = [
"mafft",
"--maxiterate", "1000",
"--localpair", # L-INS-i algorithm
"--adjustdirection",
"--thread", "8",
str(input_fasta)
]
with open(output_fasta, 'w') as f:
subprocess.run(cmd, stdout=f, check=True)
Alignment Quality Assessment
Assess alignment quality:
from Bio import AlignIO
# Read alignment
alignment = AlignIO.read("aligned.fasta", "fasta")
# Calculate statistics
num_seqs = len(alignment)
align_length = alignment.get_alignment_length()
# Count gaps
gap_counts = [seq.seq.count('-') for seq in alignment]
avg_gaps = sum(gap_counts) / num_seqs
print(f"Sequences: {num_seqs}")
print(f"Alignment length: {align_length}")
print(f"Average gaps: {avg_gaps:.1f}")
Post-processing Alignments
Trim poorly aligned regions:
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
# Read alignment
alignment = AlignIO.read("aligned.fasta", "fasta")
# Trim columns with >50% gaps
trimmed_alignment = []
for i in range(alignment.get_alignment_length()):
column = alignment[:, i]
gap_fraction = column.count('-') / len(column)
if gap_fraction < 0.5:
trimmed_alignment.append(column)
# Save trimmed alignment
# (implementation depends on your needs)
Best Practices
Quality Control: Check input sequences before alignment
Threading: Use multiple threads (
--num-threads) for speedMemory: Monitor memory usage for large alignments
Validation: Visually inspect alignments when possible
Documentation: Record alignment parameters for reproducibility
Performance Considerations
Alignment speed depends on:
Number of sequences
Sequence lengths
Algorithm selected by MAFFT
Number of threads
Available memory
Typical timings:
10 sequences × 500bp: < 1 second
50 sequences × 1000bp: ~5-10 seconds
100 sequences × 2000bp: ~30-60 seconds
See Also
Phylogenetic Trees - Tree building from alignments
Workflows - Complete workflow documentation
BLAST Sequence Comparison Module - BLAST to alignment pipeline