Overview
The core primitives that form the backbone of DeepChem Server’s machine learning workflow. These primitives provide the essential functionality for molecular machine learning pipelines. Currently, Deepchem Server provides the following primitives: (Other primitives are planned to be added soon)
Featurize: Transform raw molecular data into machine learning features
Partition: Split a dataset into multiple datastore-backed partitions
Train Valid Test Split: Split the dataset into training, validation, and test sets
Train: Build and train machine learning models on featurized datasets
Inference: Run predictions on new data using trained models
Evaluation: Assess model performance using various metrics
Docking: Perform molecular docking to predict protein-ligand binding poses
These primitives are designed to work seamlessly together while also being usable independently for specific tasks.
Featurization
The featurization primitive transforms raw molecular data (like SMILES strings or SDF files) into numerical features that can be used for machine learning.
Supporting Functions
Available Featurizers
The featurization primitive supports the following DeepChem featurizers:
ecfp: Extended Connectivity Fingerprints (Circular Fingerprints) - Compatible with scikit-learn models
graphconv: Graph Convolution Featurizer - Compatible with scikit-learn models
weave: Weave Featurizer for molecular graphs - Compatible with scikit-learn models
molgraphconv: Molecular Graph Convolution Featurizer - Required for GCN model
Note
While scikit-learn models (linear_regression, random_forest_*) can work with any featurizer, the GCN model specifically requires the molgraphconv featurizer for proper graph-based processing.
Partition
The partition primitive splits a dataset into multiple smaller datasets and uploads each partition back to the datastore. Use it when you need to:
Distribute workloads — break a large dataset into chunks so featurization or inference can run in parallel.
Reduce memory pressure — process data in manageable pieces when the full dataset does not fit in memory.
Stage training pipelines — create fixed data splits before training begins.
Supported dataset types:
DeepChem
DiskDataset— supports optional shuffling before partitioning.CSV
DataFrame— partitions rows sequentially (shuffling is not supported).
After partitioning, the parent dataset’s datacard is updated with the number of partitions (n_partition) so downstream primitives can discover the splits.
Training
The training primitive builds and trains machine learning models on featurized datasets using DeepChem’s extensive model library.
Available Models
The training primitive supports the following specific model types:
Scikit-learn Models (wrapped in DeepChem SklearnModel):
linear_regression: Linear regression for continuous target variables
random_forest_classifier: Random forest for classification tasks
random_forest_regressor: Random forest for regression tasks
DeepChem Neural Network Models:
gcn: Graph Convolutional Network (requires
molgraphconvfeaturizer)
Note
The GCN model requires PyTorch to be installed and may not be available if torch dependencies are missing. Each model supports different initialization and training parameters - refer to deepchem_server.core.model_mappings for detailed parameter options.
Inference
The inference primitive runs predictions on new data using previously trained models, handling both featurized and raw input data.
Supporting Functions
Evaluation
The evaluation primitive assesses model performance using various metrics and generates evaluation reports.
Supporting Functions
Molecular Docking
The docking primitive performs molecular docking between proteins and ligands using AutoDock VINA to predict binding poses and affinities.
Key Features:
Generates protein-ligand binding poses using AutoDock VINA
Supports both PDB and PDBQT output formats
Automatically splits PDBQT files for multiple binding modes
Returns DeepChem addresses to all generated files
Supporting Functions
Available Metrics
The evaluation primitive supports the following metrics:
pearson_r2_score: Pearson correlation coefficient
jaccard_score: Jaccard similarity score
prc_auc_score: Precision-Recall AUC score
roc_auc_score: ROC AUC score
rms_score: Root Mean Square score
mae_error: Mean Absolute Error
bedroc_score: BEDROC score
accuracy_score: Classification accuracy
balanced_accuracy_score: Balanced classification accuracy
Workflow Integration
These primitives are designed to work together in typical machine learning workflows:
Data Preparation: Upload raw data to the datastore
Partition (Optional): Use
partition()to split large datasets for parallel or staged workflowsFeaturization: Use
featurize()to transform molecular data into featuresTraining: Use
train()to build models on the featurized dataInference: Use
infer()to make predictions on new dataEvaluation: Use
model_evaluator()to assess model performanceDocking: Use
generate_pose()to predict protein-ligand binding interactions
Example Workflow
Here’s a typical workflow using all five primitives:
from deepchem_server.core import feat, train, inference, evaluator, docking
from deepchem_server.core import config
from deepchem_server.core.datastore import DiskDataStore
import tempfile
# Setup datastore
datastore = DiskDataStore('profile', 'project', tempfile.mkdtemp())
config.set_datastore(datastore)
# 1. Featurize raw data
dataset_address = feat.featurize(
dataset_address="raw_data_address",
featurizer="ecfp",
output="featurized_dataset",
dataset_column="smiles",
label_column="target"
)
# 2. Train a model
model_address = train.train(
model_type="random_forest_classifier",
dataset_address=dataset_address,
model_name="my_classification_model"
)
# 3. Run inference
predictions_address = inference.infer(
model_address=model_address,
data_address="new_data_address",
output="predictions.csv",
dataset_column="smiles"
)
# 4. Evaluate the model
evaluator.model_evaluator(
dataset_addresses=[dataset_address],
model_address=model_address,
metrics=["roc_auc_score", "accuracy_score"],
output_key="evaluation_results"
)
# 5. Perform molecular docking
docking_address = docking.generate_pose(
protein_address="protein_address",
ligand_address="ligand_address",
output="docking_results"
)