Command-Line Interface
======================

The ``eplace`` command provides a unified interface to all ePLACE functionality through three subcommands.

Installation Verification
--------------------------

After installing the package with ``pip install .`` or ``pip install -e .``, the ``eplace`` command will be available:

.. code-block:: bash

   # Verify installation
   eplace --help

   # Check version
   eplace --version

Commands Overview
-----------------

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Command
     - Description
   * - ``eplace download``
     - Download NCBI BLAST database
   * - ``eplace search``
     - Run individual search workflow (one tree per query; BLAST by default, MMseqs2 via ``--search-tool``)
   * - ``eplace grouped``
     - Run grouped BLAST workflow (one tree per taxonomic group)
   * - ``eplace relabel``
     - Relabel phylogenetic tree with taxonomic names

eplace download
---------------

Download and setup the NCBI core_nt BLAST database and/or MMseqs2 NT database.

Usage
~~~~~

.. code-block:: bash

   eplace download [--target {blast,mmseqs2,both}] [--force] [MMSEQS_OPTIONS]

Options
~~~~~~~

.. option:: --target {blast,mmseqs2,both}

   Database backend(s) to download

   Default: ``blast``

.. option:: --force

   Force redownload even if database exists

.. option:: --mmseqs-db-dir PATH

   Path to MMseqs2 database root directory. Defaults to ``$MMSEQS_DB_DIR``,
   then ``$MMSEQS2DB``, then ``~/mmseqs2db``.

.. option:: --mmseqs-threads INT

   Number of threads for MMseqs2 download/taxonomy commands

.. option:: --add-taxonomy

   Add taxonomy sidecar files to MMseqs2 NT database after download.

.. option:: --ncbi-taxonomy PATH

   Path to NCBI taxonomy dump directory containing ``nodes.dmp``,
   ``names.dmp``, and ``merged.dmp``. Required with ``--add-taxonomy``.

.. option:: --acc2taxid-dir PATH

   Path to accession2taxid files. Defaults to ``$ACC2TAXID_DIR`` or
   ``<ncbi-taxonomy>/accession2taxid``.

.. option:: --taxonomy-workdir PATH

   Working directory for MMseqs taxonomy mapping files.

.. option:: --skip-memory-check

   Skip RAM preflight checks for MMseqs2 download/taxonomy.

Examples
~~~~~~~~

.. code-block:: bash

   # Download BLAST database to default location ($BLASTDB or ~/blastdb)
   eplace download

   # Download MMseqs2 NT database
   eplace download --target mmseqs2 --mmseqs-db-dir /path/to/mmseqs_db

   # Download MMseqs2 NT database and add taxonomy sidecar files
   eplace download --target mmseqs2 --add-taxonomy --ncbi-taxonomy /path/to/ncbi/taxonomy/current

Notes
~~~~~

* BLAST DB location: ``$BLASTDB`` or ``~/blastdb``
* MMseqs2 DB location: ``$MMSEQS_DB_DIR``, then ``$MMSEQS2DB``, or ``~/mmseqs2db``
* MMseqs2 NT download typically requires at least 64 GiB RAM
* MMseqs2 taxonomy integration typically requires at least 128 GiB RAM

eplace search
------------

Run BLAST search with individual taxonomy analysis. Creates one phylogenetic tree per query sequence.

Usage
~~~~~

.. code-block:: bash

   eplace search QUERY_FASTA OUTPUT_DIR [OPTIONS]

Required Arguments
~~~~~~~~~~~~~~~~~~

.. option:: QUERY_FASTA

   Path to query FASTA file containing sequences to search

.. option:: OUTPUT_DIR

   Output directory for results (will be created if it doesn't exist)

Optional Arguments
~~~~~~~~~~~~~~~~~~

Taxonomy Options
^^^^^^^^^^^^^^^^

.. option:: --rank {phylum,class,order,family,genus,species}

   Taxonomic rank for representative selection
   
   Default: ``genus``

.. option:: --tree-label-rank {phylum,class,order,family,genus,species}

   Taxonomic rank for tree labeling
   
   Default: ``genus``

Filtering Options
^^^^^^^^^^^^^^^^^

.. option:: --min-identity FLOAT

   Minimum percent identity for BLAST hits
   
   Default: ``90.0``

.. option:: --min-coverage FLOAT

   Minimum query coverage percentage
   
   Default: ``80.0``

Database Options
^^^^^^^^^^^^^^^^

.. option:: --database NAME

   BLAST database name
   
   Default: ``core_nt``

.. option:: --blastdb-path PATH

   Path to BLAST database directory

Performance Options
^^^^^^^^^^^^^^^^^^^

.. option:: --num-threads INT

   Number of threads for BLAST and alignment
   
   Default: ``1``

Workflow Options
^^^^^^^^^^^^^^^^

.. option:: --overwrite-existing-blast

   Overwrite existing BLAST results

.. option:: --skip-alignment

   Skip alignment and tree building steps

.. option:: --output-classification PATH

   Path to output classification TSV file

MMseqs2 Options
^^^^^^^^^^^^^^^

These options apply only when ``--search-tool mmseqs2`` is specified.

.. option:: --mmseqs-memory-limit LIMIT

   Maximum RAM for the MMseqs2 prefilter/index step, passed as
   ``--split-memory-limit`` to ``mmseqs easy-search``.

   Default: ``400G``

   On smaller hosts (laptops/small VMs), set a lower value explicitly,
   for example ``16G`` or ``32G``.

   Accepts MMseqs2-style memory strings: a positive integer (no leading zeros)
   immediately followed by a
   single-character unit ``K``, ``M``, ``G``, or ``T`` (no extra suffix).
   Invalid values (e.g. ``0G`` not positive, ``01G`` with a leading zero,
   ``400GB`` where ``B`` makes a double unit, or
   ``fourhundredG`` with a non-numeric prefix) will cause an error before the
   search starts.

.. option:: --mmseqs-sensitivity FLOAT

   MMseqs2 sensitivity setting (1–7.5)

   Default: ``5.7``

.. option:: --mmseqs-search-type INT

   MMseqs2 search type (2 = translated, 3 = nucleotide, 4 = translated
   nucleotide backtrace)

   Default: ``3``

.. option:: --mmseqs-database NAME

   MMseqs2 database name (default: same as ``--database``)

.. option:: --mmseqs-db-path PATH

   Path to the MMseqs2 database directory

.. option:: --mmseqs-db-source LABEL

   Provenance label for the MMseqs2 database, recorded in
   ``search_metadata.json``

Examples
~~~~~~~~

.. code-block:: bash

   # Basic usage with default parameters
   eplace search query.fasta output_dir

   # With custom parameters
   eplace search query.fasta output_dir \
       --rank genus \
       --min-identity 95 \
       --min-coverage 85 \
       --num-threads 4

   # Skip alignment and tree building (BLAST only)
   eplace search query.fasta output_dir --skip-alignment

   # Use custom BLAST database location
   eplace search query.fasta output_dir --blastdb-path /path/to/blastdb

   # Use MMseqs2 with memory limit for large NT database
   eplace search query.fasta output_dir \
       --search-tool mmseqs2 \
       --mmseqs-db-path /path/to/mmseqs_db \
       --mmseqs-memory-limit 400G

Output Structure
~~~~~~~~~~~~~~~~

.. code-block:: text

   output_dir/
   ├── blast_results.txt              # Raw BLAST results
   ├── blast_results_annotated.txt    # BLAST results with taxonomic annotations
   ├── query1_id/
   │   ├── query1_id_representatives.fasta          # Representative sequences
   │   ├── query1_id_with_query.fasta              # Query + representatives
   │   ├── query1_id_trimmed.fasta                 # Trimmed to aligned regions
   │   ├── query1_id_aligned.fasta                 # Multiple sequence alignment
   │   ├── query1_id_tree.treefile                 # Phylogenetic tree
   │   ├── query1_id_tree_labeled.treefile         # Tree with taxonomic labels
   │   └── query1_id_tree.* (other IQTree files)
   └── ...

eplace grouped
--------------

Run BLAST search with grouped taxonomy analysis. Groups queries by taxonomic rank and creates one phylogenetic tree per group.

Usage
~~~~~

.. code-block:: bash

   eplace grouped QUERY_FASTA OUTPUT_DIR [OPTIONS]

Required Arguments
~~~~~~~~~~~~~~~~~~

.. option:: QUERY_FASTA

   Path to query FASTA file containing sequences to search

.. option:: OUTPUT_DIR

   Output directory for results (will be created if it doesn't exist)

Optional Arguments
~~~~~~~~~~~~~~~~~~

Taxonomy Options
^^^^^^^^^^^^^^^^

.. option:: --rank {phylum,class,order,family,genus,species}

   Taxonomic rank for representative selection
   
   Default: ``genus``

.. option:: --group-rank {phylum,class,order,family,genus,species}

   Taxonomic rank for grouping sequences
   
   Default: ``class``

.. option:: --tree-label-rank {phylum,class,order,family,genus,species}

   Taxonomic rank for tree labeling
   
   Default: ``genus``

.. option:: --combined-tree-label-rank {phylum,class,order,family,genus,species}

   Taxonomic rank for labeling the combined tree (optional)
   
   Default: Not set (combined tree will not be built)
   
   The grouped workflow can create a combined tree from all groups, but this is optional because it can be very time-consuming with large datasets. If you want to build the combined tree, specify this parameter with the desired taxonomic rank for labeling. If not provided, only individual group trees will be built.

Filtering Options
^^^^^^^^^^^^^^^^^

.. option:: --min-identity FLOAT

   Minimum percent identity for BLAST hits
   
   Default: ``90.0``

.. option:: --min-coverage FLOAT

   Minimum query coverage percentage
   
   Default: ``80.0``

Database Options
^^^^^^^^^^^^^^^^

.. option:: --database NAME

   BLAST database name
   
   Default: ``core_nt``

.. option:: --blastdb-path PATH

   Path to BLAST database directory

Performance Options
^^^^^^^^^^^^^^^^^^^

.. option:: --num-threads INT

   Number of threads for BLAST and alignment
   
   Default: ``1``

Workflow Options
^^^^^^^^^^^^^^^^

.. option:: --overwrite-existing-blast

   Overwrite existing BLAST results

.. option:: --skip-alignment

   Skip alignment and tree building steps

.. option:: --alignment-tolerance INT

   Maximum coordinate difference for alignment consistency
   
   Default: ``50``

.. option:: --output-classification PATH

   Path to output classification TSV file

MMseqs2 Options
^^^^^^^^^^^^^^^

These options apply only when ``--search-tool mmseqs2`` is specified.

.. option:: --mmseqs-memory-limit LIMIT

   Maximum RAM for the MMseqs2 prefilter/index step, passed as
   ``--split-memory-limit`` to ``mmseqs easy-search``.

   Default: ``400G``

   Accepts MMseqs2-style memory strings: a positive integer (no leading zeros)
   immediately followed by a single-character unit ``K``, ``M``, ``G``, or
   ``T`` (no extra suffix, e.g. ``64G``, ``400G``, ``1T``).
   Invalid values (e.g. ``0G`` not positive, ``01G`` with leading zero,
   ``400GB`` with extra ``B`` suffix, ``fourhundredG`` with non-numeric prefix)
   will cause an error before the search starts.

.. option:: --mmseqs-sensitivity FLOAT

   MMseqs2 sensitivity setting (1–7.5)

   Default: ``5.7``

.. option:: --mmseqs-search-type INT

   MMseqs2 search type (2 = translated, 3 = nucleotide, 4 = translated
   nucleotide backtrace)

   Default: ``3``

.. option:: --mmseqs-database NAME

   MMseqs2 database name (default: same as ``--database``)

.. option:: --mmseqs-db-path PATH

   Path to the MMseqs2 database directory

.. option:: --mmseqs-db-source LABEL

   Provenance label for the MMseqs2 database, recorded in
   ``search_metadata.json``

Examples
~~~~~~~~

.. code-block:: bash

   # Basic usage (group by class, default)
   eplace grouped query.fasta output_dir

   # Group by different taxonomic rank
   eplace grouped query.fasta output_dir --group-rank order

   # Specify both representative and grouping ranks
   eplace grouped query.fasta output_dir --rank genus --group-rank family

   # Skip alignment and tree building
   eplace grouped query.fasta output_dir --skip-alignment

   # Use MMseqs2 with memory limit for large NT database
   eplace grouped query.fasta output_dir \
       --search-tool mmseqs2 \
       --mmseqs-db-path /path/to/mmseqs_db \
       --mmseqs-memory-limit 400G

Output Structure
~~~~~~~~~~~~~~~~

.. code-block:: text

   output_dir/
   ├── blast_results.txt              # Raw BLAST results
   ├── blast_results_annotated.txt    # BLAST results with taxonomic annotations
   ├── query1_id/                     # Per-query representative sequences
   │   └── query1_id_representatives.fasta
   ├── Taxonomic_Group_1/             # One directory per taxonomic group
   │   ├── Taxonomic_Group_1_combined.fasta        # All queries + unique references
   │   ├── Taxonomic_Group_1_trimmed.fasta         # Trimmed to aligned regions
   │   ├── Taxonomic_Group_1_aligned.fasta         # Multiple sequence alignment
   │   ├── Taxonomic_Group_1_tree.treefile         # Phylogenetic tree
   │   ├── Taxonomic_Group_1_tree_labeled.treefile # Tree with taxonomic labels
   │   └── Taxonomic_Group_1_tree.* (other IQTree files)
   ├── combined_all_groups_trimmed.fasta           # Combined alignment of all groups
   ├── combined_all_groups_aligned.fasta           # Multiple sequence alignment
   ├── combined_all_groups_tree.treefile           # Combined phylogenetic tree
   ├── combined_all_groups_tree_labeled.treefile   # Combined tree with taxonomic labels
   └── ...

eplace relabel
--------------

Relabel a phylogenetic tree with taxonomic names from BLAST results. This command allows you to replace sequence IDs in an existing tree with taxonomic names at any specified rank.

Usage
~~~~~

.. code-block:: bash

   eplace relabel BLAST_OUTPUT TREE_FILE OUTPUT_TREE [OPTIONS]

Required Arguments
~~~~~~~~~~~~~~~~~~

.. option:: BLAST_OUTPUT

   Path to BLAST output file (tabular format with taxonomy)
   
   The BLAST results file should contain taxonomic information for the sequences in the tree.

.. option:: TREE_FILE

   Path to input tree file (Newick format)
   
   The phylogenetic tree to be relabeled with taxonomic names.

.. option:: OUTPUT_TREE

   Path to output relabeled tree file
   
   The new tree file with taxonomic labels will be written to this path.

Optional Arguments
~~~~~~~~~~~~~~~~~~

Taxonomy Options
^^^^^^^^^^^^^^^^

.. option:: --rank {phylum,class,order,family,genus,species}

   Taxonomic rank for tree labeling
   
   Default: ``genus``
   
   When using ``species``, the output will use binomial nomenclature (genus + species).

Database Options
^^^^^^^^^^^^^^^^

.. option:: --blastdb-path PATH

   Path to BLAST database directory
   
   Optional parameter, not required for relabeling operation.

Examples
~~~~~~~~

.. code-block:: bash

   # Relabel tree with genus names (default)
   eplace relabel blast_results.txt input.treefile output_labeled.treefile
   
   # Relabel tree with species names (genus + species binomial)
   eplace relabel blast_results.txt input.treefile output_species.treefile --rank species
   
   # Relabel tree with family names
   eplace relabel blast_results.txt input.treefile output_family.treefile --rank family
   
   # Relabel tree with order names
   eplace relabel blast_results.txt input.treefile output_order.treefile --rank order

Key Features
~~~~~~~~~~~~

* **Flexible Taxonomic Ranks**: Supports all standard taxonomic ranks from phylum to species
* **Binomial Nomenclature**: Species rank uses "genus species" format for proper scientific names
* **Topology Preservation**: Maintains the original tree structure while updating labels
* **Format Compatibility**: Works with standard Newick format trees
* **Reversed Sequences**: Handles sequences with _R_ prefix (from MAFFT orientation)
* **Label Cleaning**: Automatically cleans labels for Newick format compatibility

Use Cases
~~~~~~~~~

The ``eplace relabel`` command is useful in several scenarios:

1. **Re-labeling at Different Ranks**
   
   Generate multiple versions of the same tree with different taxonomic granularity without rebuilding the tree:
   
   .. code-block:: bash
   
      # Create genus-level tree
      eplace relabel blast_results.txt tree.treefile tree_genus.treefile --rank genus
      
      # Create family-level tree from same data
      eplace relabel blast_results.txt tree.treefile tree_family.treefile --rank family

2. **External Tree Tools**
   
   Add taxonomic labels to trees generated by external phylogenetic tools:
   
   .. code-block:: bash
   
      # After building tree with RAxML, IQTree, or FastTree
      eplace relabel blast_results.txt external_tree.nwk labeled_tree.nwk --rank genus

3. **Visualization Preparation**
   
   Create publication-ready trees with appropriate taxonomic labels:
   
   .. code-block:: bash
   
      # For species-level visualization
      eplace relabel blast_results.txt tree.treefile publication_tree.treefile --rank species

4. **Updating Taxonomy**
   
   Update tree labels when taxonomy databases are updated or corrected:
   
   .. code-block:: bash
   
      # Re-run BLAST with updated database, then relabel
      eplace relabel new_blast_results.txt old_tree.treefile updated_tree.treefile

Notes
~~~~~

* The BLAST results file must contain taxonomic information for the sequences
* Tree file must be in Newick format (standard phylogenetic tree format)
* Sequence IDs in the tree must match accession numbers in the BLAST results
* Labels are automatically cleaned to comply with Newick format requirements (spaces, colons, parentheses are replaced)
* If taxonomy information is missing for a sequence, it will be skipped with a warning

Output
~~~~~~

The output is a Newick format tree file with sequence IDs replaced by taxonomic names at the specified rank.

.. code-block:: text

   Input tree:
   ((NZ_CP012345:0.01,NZ_CP067890:0.02):0.03,(NC_012345:0.01,NC_067890:0.02):0.03);
   
   Output tree (--rank genus):
   ((Escherichia:0.01,Salmonella:0.02):0.03,(Bacillus:0.01,Staphylococcus:0.02):0.03);
   
   Output tree (--rank species):
   ((Escherichia_coli:0.01,Salmonella_enterica:0.02):0.03,(Bacillus_subtilis:0.01,Staphylococcus_aureus:0.02):0.03);

Workflow Comparison
-------------------

Individual Workflow (eplace search)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Best for:** Analyzing each query sequence in its own phylogenetic context

**Process:**

1. Run BLAST search for all queries
2. Extract representative sequences for each query at specified rank
3. Create one directory per query
4. Build one alignment and tree per query

**Output:** Separate phylogenetic trees for each query sequence

Grouped Workflow (eplace grouped)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Best for:** Analyzing multiple related queries together in a single phylogenetic context

**Process:**

1. Run BLAST search for all queries
2. Extract representative sequences for each query
3. Group queries by specified taxonomic rank (e.g., class, order)
4. Combine all queries in a group with unique reference sequences
5. Build one alignment and tree per group

**Output:** Phylogenetic trees with multiple queries grouped by taxonomy

Common Use Cases
----------------

Quick BLAST search without trees
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   eplace search query.fasta results --skip-alignment

Relabel existing trees
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   # Relabel tree with different taxonomic ranks
   eplace relabel blast_results.txt input.treefile genus_tree.treefile --rank genus
   eplace relabel blast_results.txt input.treefile family_tree.treefile --rank family
   
   # Use with trees from external tools
   eplace relabel blast_results.txt raxml_tree.nwk labeled_tree.nwk --rank species

High stringency search
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   eplace search query.fasta results \
       --min-identity 95 \
       --min-coverage 90

Multi-threaded analysis
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   eplace search query.fasta results --num-threads 8

Group related sequences
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   eplace grouped query.fasta results \
       --rank species \
       --group-rank genus

Troubleshooting
---------------

MMseqs2 memory limits
~~~~~~~~~~~~~~~~~~~~~

When searching against large MMseqs2 databases such as NCBI NT, the MMseqs2
prefilter step may require hundreds of GB of RAM. If insufficient memory is
available, MMseqs2 may fail with an error similar to:

.. code-block:: text

   Cannot fit databases into 22G. Please use a computer with more main memory.
   Error: Prefilter died
   Error: Search step died
   Error: Search died

The workflow exposes the option:

.. code-block:: bash

   --mmseqs-memory-limit 400G

which is passed to MMseqs2 as:

.. code-block:: bash

   --split-memory-limit 400G

For full NCBI NT, we recommend running on a high-memory node. Suggested
starting values are:

.. list-table::
   :header-rows: 1
   :widths: 50 50

   * - Node RAM
     - Suggested ``--mmseqs-memory-limit``
   * - 128 GB
     - ``90G``
   * - 256 GB
     - ``200G``
   * - 512 GB
     - ``400G``
   * - 1 TB
     - ``800G``

For 16S amplicon/ZOTU workflows, consider using a smaller curated 16S database
such as SILVA, GTDB rRNA, RDP, or NCBI 16S rather than full NT. These require
far less RAM and often produce equally good taxonomic assignments for 16S data.

Command not found
~~~~~~~~~~~~~~~~~

If ``eplace`` command is not found after installation:

.. code-block:: bash

   # Check if it's installed
   pip show eplace

   # Reinstall
   pip install --force-reinstall .

   # Or add to PATH
   export PATH="$HOME/.local/bin:$PATH"

Dependencies missing
~~~~~~~~~~~~~~~~~~~~

Some features require external tools:

**BLAST+**: Required for all workflows

.. code-block:: bash

   sudo apt-get install ncbi-blast+  # Ubuntu/Debian
   brew install blast                 # macOS

**TaxonKit**: Required for taxonomy lookups

.. code-block:: bash

   conda install -c bioconda taxonkit

**MAFFT**: Required for alignment (unless --skip-alignment)

.. code-block:: bash

   sudo apt-get install mafft  # Ubuntu/Debian
   brew install mafft          # macOS

**IQTree**: Required for tree building (unless --skip-alignment)

.. code-block:: bash

   sudo apt-get install iqtree  # Ubuntu/Debian
   brew install iqtree          # macOS

See Also
--------

* :doc:`installation` - Installation instructions
* :doc:`workflows` - Detailed workflow documentation
* :doc:`blast_workflow` - Complete workflow guide
* :doc:`ncbi_download` - Database management details