Overview
The core primitives that form the backbone of DeepChem Server’s machine learning workflow. These primitives provide the essential functionality for molecular machine learning pipelines. Currently, Deepchem Server provides the following primitives: (Other primitives are planned to be added soon)
Featurize: Transform raw molecular data into machine learning features
Train Valid Test Split: Split the dataset into training, validation, and test sets
Train: Build and train machine learning models on featurized datasets
Inference: Run predictions on new data using trained models
Evaluation: Assess model performance using various metrics
Docking: Perform molecular docking to predict protein-ligand binding poses
These primitives are designed to work seamlessly together while also being usable independently for specific tasks.
Featurization
The featurization primitive transforms raw molecular data (like SMILES strings or SDF files) into numerical features that can be used for machine learning.
- deepchem_server.core.feat.featurize(dataset_address: str, featurizer: str, output: str, dataset_column: str, feat_kwargs: Dict = {}, label_column: str | None = None, n_core: int | None = None, single_core_threshold: int | None = 250) str | None[source]
Featurize the dataset at given address with specified featurizer.
Writes output to datastore. If the compute node has more than 1 CPU core then the featurization is done by splitting the dataset into parts of equal size and featurizing each part in parallel. The featurized parts are then merged into a single dataset and written to the datastore. The number of parts is equal to the number of cores available on the machine. If the compute node has only 1 CPU core then the featurization will be done in a single process.
Restart support: The featurize primitive saves a (output).partial folder where the checkpoints are saved until completion. To resume a failed featurize execution, the featurize primitive can be rerun with the same arguments and the checkpoints will be restored from the (output).partial folder and the folder is deleted once the featurization process is complete.
Note: The restart fails if the n_core < n_core used before restart. Additionally, the checkpoints must belong to the same dataset address as the initial run, otherwise, they will not be considered for the restart.
- Parameters:
dataset_address (str) – The deepchem address of the dataset to featurize.
featurizer (str) – Has to be a featurizer string in mappings.
output (str) – The name of output featurized dataset in your workspace.
dataset_column (str) – Column containing the input for featurizer.
feat_kwargs (dict, optional) – Keyword arguments to pass to featurizer on initialization, by default {}.
label_column (str, optional) – The target column in case this dataset is going to be used for training purposes.
n_core (int, optional) – The number of cores to use for featurization.
single_core_threshold (int, optional) – The threshold size of the dataset size in megabytes above which multicore featurization will be used, by default 250.
- Returns:
Deepchem address of the featurized dataset.
- Return type:
- Raises:
ValueError – If featurizer is not recognized, if input column is not specified for CSV files, or if datastore is not set.
NotImplementedError – If the dataset format is not supported for featurization.
Supporting Functions
- deepchem_server.core.feat.split_dataset(dataset_path: str, file_type: str, n_partition: int, available_checkpoints: List[int]) List[str][source]
Split the dataset into n partitions.
- Parameters:
- Returns:
The list of file paths of the partitioned datasets.
- Return type:
- Raises:
NotImplementedError – If the file type is not supported for featurization.
- deepchem_server.core.feat.featurize_part(main_dataset_address: str, dataset_path: str, file_type: str, featurizer: deepchem.feat.Featurizer, dataset_column: str, label_column: str | None, checkpoint_output_key: str, nproc: int) None[source]
Featurize a part of the dataset.
- Parameters:
main_dataset_address (str) – Address of the main dataset being featurized.
dataset_path (str) – The path to the dataset partition to featurize.
file_type (str) – The type of the dataset (e.g., ‘csv’, ‘sdf’).
featurizer (dc.feat.Featurizer) – The featurizer to use.
dataset_column (str) – The column containing the input for featurizer.
label_column (str, optional) – The target column in case this dataset is going to be used for training purposes.
checkpoint_output_key (str) – The output key for checkpoint ‘.partial’ folder.
nproc (int) – The total number of partitions being processed.
- Return type:
None
- Raises:
ValueError – If datastore is not set.
NotImplementedError – If the file type is not supported for featurization.
- deepchem_server.core.feat.featurize_multi_core(main_dataset_address: str, raw_dataset_path: str, file_type: str, feat: deepchem.feat.Featurizer, dataset_column: str, label_column: str | None, basedir: str, nproc: int, checkpoint_output_key: str, available_checkpoints: List[int]) Iterable[List | str][source]
Featurize the dataset in parallel.
- Parameters:
main_dataset_address (str) – Address of the main dataset being featurized.
raw_dataset_path (str) – The path to the raw dataset.
file_type (str) – The type of the dataset (e.g., ‘csv’, ‘sdf’).
feat (dc.feat.Featurizer) – The featurizer to use.
dataset_column (str) – The column containing the input for featurizer.
label_column (str, optional) – The target column in case this dataset is going to be used for training purposes.
basedir (str) – The base directory where the dataset is stored.
nproc (int) – The number of partitions to split the dataset into.
checkpoint_output_key (str) – The output key for checkpoint ‘.partial’ folder.
available_checkpoints (list of int) – The list of checkpoint ids already completed in the previous run (if any).
- Returns:
A list containing [datasets, merge_dir] where datasets is a list of DiskDataset objects and merge_dir is the directory path for merging.
- Return type:
Available Featurizers
The featurization primitive supports the following DeepChem featurizers:
ecfp: Extended Connectivity Fingerprints (Circular Fingerprints) - Compatible with scikit-learn models
graphconv: Graph Convolution Featurizer - Compatible with scikit-learn models
weave: Weave Featurizer for molecular graphs - Compatible with scikit-learn models
molgraphconv: Molecular Graph Convolution Featurizer - Required for GCN model
Note
While scikit-learn models (linear_regression, random_forest_*) can work with any featurizer, the GCN model specifically requires the molgraphconv featurizer for proper graph-based processing.
Training
The training primitive builds and trains machine learning models on featurized datasets using DeepChem’s extensive model library.
- deepchem_server.core.train.train(model_type: str, dataset_address: str, model_name: str, init_kwargs: Dict | None = None, train_kwargs: Dict | None = None) str[source]
Trains a model on the specified dataset and writes output to datastore.
- Parameters:
model_type (str) – A model string recognized by deepchem_server.core.model_mappings
dataset_address (str) – The Deepchem server datastore address of the training dataset. The dataset in the address should be a DeepChem dataset.
model_name (str) – The name under which the output trained model will be stored in the workspace.
init_kwargs (Optional[Dict]) – Keyword arguments to pass to model on initialization.
train_kwargs (Optional[Dict]) – Keyword arguments to pass to model on training.
Examples
>>> from deepchem_server.core.cards import DataCard >>> from deepchem_server.core import config, featurize >>> from deepchem_server.core.datastore import DiskDataStore >>> import tempfile >>> import pandas as pd >>> disk_datastore = DiskDataStore('profile', 'project', tempfile.mkdtemp()) >>> config.set_datastore(disk_datastore) >>> df = pd.DataFrame([["CCC", 0], ["CCCCC", 1]], columns=["smiles", "label"]) >>> card = DataCard(address='', file_type='csv', data_type='pandas.DataFrame') >>> data_address = disk_datastore.upload_data_from_memory(df, "test.csv", card) >>> dataset_address = featurize(data_address, ... featurizer="ecfp", ... output="feat_test", ... dataset_column="smiles", ... label_column="label") >>> train(model_type = "random_forest_regressor", ... dataset_address = dataset_address, ... model_name = "random_forest_model") 'deepchem://profile/project/random_forest_model'
Available Models
The training primitive supports the following specific model types:
Scikit-learn Models (wrapped in DeepChem SklearnModel):
linear_regression: Linear regression for continuous target variables
random_forest_classifier: Random forest for classification tasks
random_forest_regressor: Random forest for regression tasks
DeepChem Neural Network Models:
gcn: Graph Convolutional Network (requires
molgraphconvfeaturizer)
Note
The GCN model requires PyTorch to be installed and may not be available if torch dependencies are missing. Each model supports different initialization and training parameters - refer to deepchem_server.core.model_mappings for detailed parameter options.
Inference
The inference primitive runs predictions on new data using previously trained models, handling both featurized and raw input data.
- deepchem_server.core.inference.infer(model_address: str, data_address: str, output: str, dataset_column: str | None = None, shard_size: int | None = 8192, threshold: int | float | None = None)[source]
Runs inference for the specified model against specified dataset and featurization.
- Parameters:
model_address (str) – deepchem_server address of model to run inference for
data_address (str) – deepchem_server address of raw data to run inference on
output (str) – The output file to write results to.
dataset_column (str) – The column in the raw dataset to featurize.
shard_size (Optional[int]) – The shard size for the featurize and inference operation.
threshold (Optional[Union[int, float]]) – Threshold for binarizing the predictions.
Example
>>> import os >>> from deepchem_server.core import config >>> from deepchem_server.core.feat import featurize >>> from deepchem_server.core.cards import DataCard >>> from deepchem_server.core.train import train >>> from deepchem_server.core.inference import infer >>> from deepchem_server.core.datastore import DiskDataStore >>> import tempfile >>> disk_datastore = DiskDataStore('profile', 'project', tempfile.mkdtemp()) >>> config.set_datastore(disk_datastore) >>> df = pd.DataFrame([["CCC", 0], ["CCCCC", 1]], columns=["smiles", "label"]) >>> card = DataCard(address='', file_type='csv', data_type='pandas.DataFrame') >>> data_address = disk_datastore.upload_data_from_memory(df, "test.csv", card) >>> feat_address = featurize(data_address, ... featurizer='ecfp', ... output='featurized_data', ... dataset_column='smiles', ... label_column='label') >>> model_address = train(model_type='linear_regression', ... dataset_address=feat_address, ... model_name='ecfp_reg') >>> infer_address = infer(model_address, feat_address, output='infer.csv')
Supporting Functions
- deepchem_server.core.inference._infer_with_featurize(model_address: str, data_address: str, dataset_column: str, shard_size: int | None = 8192) Callable[[], Iterator[Sequence[numpy.ndarray]]][source]
This function takes in csv file, and returns a callable iterator that featurizes it based on the featurizer used for train dataset and yields predictions
- Parameters:
model_address (str) – deepchem_server address of model to run inference for
data_address (str) – deepchem_server address of raw data to run inference on
dataset_column (str) – The column in the raw dataset to featurize.
shard_size (Optional[int]) – The shard size for the featurize and inference operation.
- Returns:
iterator – iterator function that yields raw inputs and predictions
- Return type:
Callable[[], Iterator[Sequence[np.ndarray]]]
- deepchem_server.core.inference._infer_without_featurize(model_address: str, data_address: str, shard_size: int | None = 8192) Callable[[], Iterator[Sequence[numpy.ndarray]]][source]
This function takes in csv file, and returns a callable iterator that yields predictions on featurized data
- Parameters:
- Returns:
iterator – iterator function that yields raw inputs and predictions
- Return type:
Callable[[], Iterator[Sequence[np.ndarray]]]
Evaluation
The evaluation primitive assesses model performance using various metrics and generates evaluation reports.
- deepchem_server.core.evaluator.model_evaluator(dataset_addresses: List[str], model_address: str, metrics: List[str], output_key: str, is_metric_plots: bool = False)[source]
Evaluate models using featurized datasets
- Parameters:
Supporting Functions
- deepchem_server.core.evaluator.prc_auc_curve(y_true: numpy.ndarray, y_preds: numpy.ndarray) pandas.DataFrame[source]
Generate precision recall dataframe
- Parameters:
y_true (np.ndarray) – true values from the dataset
y_preds (np.ndarray) – model predictions based on the dataset
- Returns:
df – precision recall dataframe
- Return type:
pd.DataFrame
Molecular Docking
The docking primitive performs molecular docking between proteins and ligands using AutoDock VINA to predict binding poses and affinities.
Key Features:
Generates protein-ligand binding poses using AutoDock VINA
Supports both PDB and PDBQT output formats
Automatically splits PDBQT files for multiple binding modes
Returns DeepChem addresses to all generated files
- deepchem_server.core.docking.generate_pose(protein_address: str, ligand_address: str, output: str, exhaustiveness: int = 10, num_modes: int = 9, save_pdbqt: bool = False) str[source]
Generate VINA molecular docking poses.
Performs molecular docking between a protein and ligand using AutoDock VINA to predict binding poses and affinities. Returns DeepChem addresses to all generated files including PDB complexes, optional PDBQT files, and scores.
- Parameters:
protein_address (str) – DeepChem address of the protein PDB file
ligand_address (str) – DeepChem address of the ligand file (PDB or SDF format)
output (str) – Output name for the docking results (used as prefix for all files)
exhaustiveness (int, default=10) – VINA exhaustiveness parameter (higher = more thorough search)
num_modes (int, default=9) – Number of binding modes to generate (1-20 recommended)
save_pdbqt (bool, default=False) – Whether to save PDBQT files in addition to PDB complexes
- Returns:
DeepChem address to results JSON file containing: - complex_addresses: Dict mapping mode names to PDB complex addresses - scores_address: DeepChem address to scores JSON file - pdbqt_addresses: Dict mapping mode names to PDBQT addresses (if save_pdbqt=True) - docking_method, exhaustiveness, message
- Return type:
- Raises:
ImportError – If RDKit or AutoDock VINA are not installed
ValueError – If protein_address or ligand_address are empty If no valid docking results are generated
Examples
Basic docking with default parameters:
>>> result_address = generate_pose( ... protein_address="deepchem://user/protein.pdb", ... ligand_address="deepchem://user/ligand.sdf", ... output="docking_results" ... ) >>> results = json.loads(datastore.get(result_address)) >>> print(f"Generated {len(results['complex_addresses'])} binding modes")
Docking with PDBQT files and custom parameters:
>>> result_address = generate_pose( ... protein_address="deepchem://user/protein.pdb", ... ligand_address="deepchem://user/ligand.pdb", ... output="thorough_docking", ... exhaustiveness=20, ... num_modes=5, ... save_pdbqt=True ... ) >>> results = json.loads(datastore.get(result_address)) >>> scores = json.loads(datastore.get(results['scores_address'])) >>> print(f"Best binding affinity: {scores['mode 1']['affinity (kcal/mol)']} kcal/mol")
Notes
PDB complexes are always generated (one per binding mode)
PDBQT files are only generated when save_pdbqt=True
For multiple modes, PDBQT files are automatically split per mode
Scores are stored in a separate JSON file for easy access
All files are uploaded to the configured datastore
Supporting Functions
Available Metrics
The evaluation primitive supports the following metrics:
pearson_r2_score: Pearson correlation coefficient
jaccard_score: Jaccard similarity score
prc_auc_score: Precision-Recall AUC score
roc_auc_score: ROC AUC score
rms_score: Root Mean Square score
mae_error: Mean Absolute Error
bedroc_score: BEDROC score
accuracy_score: Classification accuracy
balanced_accuracy_score: Balanced classification accuracy
Workflow Integration
These primitives are designed to work together in typical machine learning workflows:
Data Preparation: Upload raw data to the datastore
Featurization: Use
featurize()to transform molecular data into featuresTraining: Use
train()to build models on the featurized dataInference: Use
infer()to make predictions on new dataEvaluation: Use
model_evaluator()to assess model performanceDocking: Use
generate_pose()to predict protein-ligand binding interactions
Example Workflow
Here’s a typical workflow using all five primitives:
from deepchem_server.core import feat, train, inference, evaluator, docking
from deepchem_server.core import config
from deepchem_server.core.datastore import DiskDataStore
import tempfile
# Setup datastore
datastore = DiskDataStore('profile', 'project', tempfile.mkdtemp())
config.set_datastore(datastore)
# 1. Featurize raw data
dataset_address = feat.featurize(
dataset_address="raw_data_address",
featurizer="ecfp",
output="featurized_dataset",
dataset_column="smiles",
label_column="target"
)
# 2. Train a model
model_address = train.train(
model_type="random_forest_classifier",
dataset_address=dataset_address,
model_name="my_classification_model"
)
# 3. Run inference
predictions_address = inference.infer(
model_address=model_address,
data_address="new_data_address",
output="predictions.csv",
dataset_column="smiles"
)
# 4. Evaluate the model
evaluator.model_evaluator(
dataset_addresses=[dataset_address],
model_address=model_address,
metrics=["roc_auc_score", "accuracy_score"],
output_key="evaluation_results"
)
# 5. Perform molecular docking
docking_address = docking.generate_pose(
protein_address="protein_address",
ligand_address="ligand_address",
output="docking_results"
)