NCBI Database Download Module

This module provides functionality for downloading and managing NCBI BLAST databases, specifically the core nucleotide (nt) database.

Features

  • Automatic detection of BLASTDB environment variable

  • Fallback to ~/blastdb directory if BLASTDB is not set

  • Downloads core_nt.* files from NCBI FTP server

  • Verifies MD5 checksums for data integrity

  • Automatic extraction of tar.gz files

  • Security features to prevent path traversal attacks

  • Configurable logging output

Quick Start

Basic Usage

from eplace_lib.ncbi_download import setup_ncbi_database

# Download and setup the database
success, message = setup_ncbi_database()
print(f"Success: {success}")
print(f"Message: {message}")

Check if Database Exists

from eplace_lib.ncbi_download import NCBIDownloader

downloader = NCBIDownloader()
db_dir = downloader.get_blastdb_directory()
exists = downloader.check_database_exists()

print(f"BLASTDB directory: {db_dir}")
print(f"Database exists: {exists}")

Force Redownload

from eplace_lib.ncbi_download import setup_ncbi_database

# Force redownload even if database exists
success, message = setup_ncbi_database(force_download=True)

Disable Verbose Output

from eplace_lib.ncbi_download import setup_ncbi_database

# Download without progress messages
success, message = setup_ncbi_database(verbose=False)

BLASTDB Environment Variable

The module checks for the BLASTDB environment variable:

  • If set and points to a valid directory, uses that location

  • If not set or invalid, creates and uses ~/blastdb

Setting BLASTDB

# In bash
export BLASTDB=/path/to/your/blastdb

# Or in Python
import os
os.environ['BLASTDB'] = '/path/to/your/blastdb'

Advanced Usage

from eplace_lib.ncbi_download import NCBIDownloader

downloader = NCBIDownloader()

# Get database directory
db_dir = downloader.get_blastdb_directory()

# Get list of available files from NCBI
files = downloader.get_available_files()

# Download a specific file
downloader.download_file('core_nt.00.tar.gz', db_dir)

# Verify MD5 checksum
downloader.verify_md5(tar_path, md5_path)

# Extract tarball
downloader.extract_tarball(tar_path, db_dir)

Security

The module includes security features to prevent path traversal attacks:

  • Filename validation in download_file()

  • Path validation before extraction in extract_tarball()

  • Safe extraction that validates all member paths

Requirements

Uses Python standard library modules only:

  • os

  • hashlib

  • tarfile

  • logging

  • pathlib

  • urllib.request

No external dependencies required.

Examples

See examples/download_ncbi_example.py for comprehensive usage examples.

Testing

Run the test suite:

pytest tests/test_ncbi_download.py -v

Note

The NCBI core_nt database is large (hundreds of GB). Ensure you have sufficient disk space and bandwidth before downloading.