NCBI Database Download Module
This module provides functionality for downloading and managing NCBI BLAST databases, specifically the core nucleotide (nt) database.
Features
Automatic detection of BLASTDB environment variable
Fallback to
~/blastdbdirectory if BLASTDB is not setDownloads core_nt.* files from NCBI FTP server
Verifies MD5 checksums for data integrity
Automatic extraction of tar.gz files
Security features to prevent path traversal attacks
Configurable logging output
Quick Start
Basic Usage
from eplace_lib.ncbi_download import setup_ncbi_database
# Download and setup the database
success, message = setup_ncbi_database()
print(f"Success: {success}")
print(f"Message: {message}")
Check if Database Exists
from eplace_lib.ncbi_download import NCBIDownloader
downloader = NCBIDownloader()
db_dir = downloader.get_blastdb_directory()
exists = downloader.check_database_exists()
print(f"BLASTDB directory: {db_dir}")
print(f"Database exists: {exists}")
Force Redownload
from eplace_lib.ncbi_download import setup_ncbi_database
# Force redownload even if database exists
success, message = setup_ncbi_database(force_download=True)
Disable Verbose Output
from eplace_lib.ncbi_download import setup_ncbi_database
# Download without progress messages
success, message = setup_ncbi_database(verbose=False)
BLASTDB Environment Variable
The module checks for the BLASTDB environment variable:
If set and points to a valid directory, uses that location
If not set or invalid, creates and uses
~/blastdb
Setting BLASTDB
# In bash
export BLASTDB=/path/to/your/blastdb
# Or in Python
import os
os.environ['BLASTDB'] = '/path/to/your/blastdb'
Advanced Usage
from eplace_lib.ncbi_download import NCBIDownloader
downloader = NCBIDownloader()
# Get database directory
db_dir = downloader.get_blastdb_directory()
# Get list of available files from NCBI
files = downloader.get_available_files()
# Download a specific file
downloader.download_file('core_nt.00.tar.gz', db_dir)
# Verify MD5 checksum
downloader.verify_md5(tar_path, md5_path)
# Extract tarball
downloader.extract_tarball(tar_path, db_dir)
Security
The module includes security features to prevent path traversal attacks:
Filename validation in
download_file()Path validation before extraction in
extract_tarball()Safe extraction that validates all member paths
Requirements
Uses Python standard library modules only:
oshashlibtarfileloggingpathliburllib.request
No external dependencies required.
Examples
See examples/download_ncbi_example.py for comprehensive usage examples.
Testing
Run the test suite:
pytest tests/test_ncbi_download.py -v
Note
The NCBI core_nt database is large (hundreds of GB). Ensure you have sufficient disk space and bandwidth before downloading.