Overview

The core primitives that form the backbone of DeepChem Server’s machine learning workflow. These primitives provide the essential functionality for molecular machine learning pipelines. Currently, Deepchem Server provides the following primitives: (Other primitives are planned to be added soon)

  • Featurize: Transform raw molecular data into machine learning features

  • Partition: Split a dataset into multiple datastore-backed partitions

  • Train Valid Test Split: Split the dataset into training, validation, and test sets

  • Train: Build and train machine learning models on featurized datasets

  • Inference: Run predictions on new data using trained models

  • Evaluation: Assess model performance using various metrics

  • Docking: Perform molecular docking to predict protein-ligand binding poses

  • DEL Denoise: Score DEL screening data to identify strong binders

These primitives are designed to work seamlessly together while also being usable independently for specific tasks.

Featurization

The featurization primitive transforms raw molecular data (like SMILES strings or SDF files) into numerical features that can be used for machine learning.

Supporting Functions

Available Featurizers

The featurization primitive supports the following DeepChem featurizers:

  • ecfp: Extended Connectivity Fingerprints (Circular Fingerprints) - Compatible with scikit-learn models

  • graphconv: Graph Convolution Featurizer - Compatible with scikit-learn models

  • weave: Weave Featurizer for molecular graphs - Compatible with scikit-learn models

  • molgraphconv: Molecular Graph Convolution Featurizer - Required for GCN model

Note

While scikit-learn models (linear_regression, random_forest_*) can work with any featurizer, the GCN model specifically requires the molgraphconv featurizer for proper graph-based processing.

Partition

The partition primitive splits a dataset into multiple smaller datasets and uploads each partition back to the datastore. Use it when you need to:

  • Distribute workloads — break a large dataset into chunks so featurization or inference can run in parallel.

  • Reduce memory pressure — process data in manageable pieces when the full dataset does not fit in memory.

  • Stage training pipelines — create fixed data splits before training begins.

Supported dataset types:

  • DeepChem DiskDataset — supports optional shuffling before partitioning.

  • CSV DataFrame — partitions rows sequentially (shuffling is not supported).

After partitioning, the parent dataset’s datacard is updated with the number of partitions (n_partition) so downstream primitives can discover the splits.

Training

The training primitive builds and trains machine learning models on featurized datasets using DeepChem’s extensive model library.

Available Models

The training primitive supports the following specific model types:

Scikit-learn Models (wrapped in DeepChem SklearnModel):

  • linear_regression: Linear regression for continuous target variables

  • random_forest_classifier: Random forest for classification tasks

  • random_forest_regressor: Random forest for regression tasks

DeepChem Neural Network Models:

  • gcn: Graph Convolutional Network (requires molgraphconv featurizer)

Note

The GCN model requires PyTorch to be installed and may not be available if torch dependencies are missing. Each model supports different initialization and training parameters - refer to deepchem_server.core.model_mappings for detailed parameter options.

Inference

The inference primitive runs predictions on new data using previously trained models, handling both featurized and raw input data.

Supporting Functions

Evaluation

The evaluation primitive assesses model performance using various metrics and generates evaluation reports.

Supporting Functions

Molecular Docking

The docking primitive performs molecular docking between proteins and ligands using AutoDock VINA to predict binding poses and affinities.

Key Features:

  • Generates protein-ligand binding poses using AutoDock VINA

  • Supports both PDB and PDBQT output formats

  • Automatically splits PDBQT files for multiple binding modes

  • Returns DeepChem addresses to all generated files

Supporting Functions

Available Metrics

The evaluation primitive supports the following metrics:

  • pearson_r2_score: Pearson correlation coefficient

  • jaccard_score: Jaccard similarity score

  • prc_auc_score: Precision-Recall AUC score

  • roc_auc_score: ROC AUC score

  • rms_score: Root Mean Square score

  • mae_error: Mean Absolute Error

  • bedroc_score: BEDROC score

  • accuracy_score: Classification accuracy

  • balanced_accuracy_score: Balanced classification accuracy

DEL Denoise

The DEL Denoise primitive scores DEL screening data to identify compounds that are strongly enriched in the target selection relative to background noise.

Key Features:

  • Supports two scoring strategies: Poisson-based (unified) and z-score-based (non_unified)

  • Optional collapsing of trisynthon rows into pairwise disynthon combinations

Scoring Strategies:

  • unified: Computes a Poisson confidence-interval enrichment ratio (target lower bound / control upper bound) across all replicates simultaneously.

  • non_unified: Sums replicates then computes a z-score for each compound independently in the target and control.

deepchem_server.core.primitives.del_denoising.del_denoise(dataset_address: str, output_key: str, strategy: str = 'unified', control_cols: List[str] | None = None, target_cols: List[str] | None = None, add_hit_labels: bool = False, hit_percentile: float = 90.0, alpha: float = 0.05, drop_duplicates: bool = True, use_disynthon_pairs: bool = False, smiles_cols: List[str] | None = None, aggregate_operation: str = 'sum', min_count_threshold: int = 0) str[source]

Score DEL screening data to identify strong binders.

Reads a CSV of raw sequencing counts, scores each compound using the chosen enrichment strategy, and writes the result back to the datastore.

Scoring strategies

unified

Applies Poisson confidence intervals across all replicate columns simultaneously. The enrichment score for each row is -

Poisson_Enrichment = target_lower_CI / control_upper_CI

where the CIs are computed via poissfit.

non_unified

Sums replicate counts to form seq_target_sum and seq_control_sum, then computes a z-score for each.

param dataset_address:

Datastore address of the input CSV.

type dataset_address:

str

param output_key:

Name for the output CSV in the datastore.

type output_key:

str

param strategy:

‘unified’ (Poisson ratio) or ‘non_unified’ (z-score).

type strategy:

str

param control_cols:

Control count column names.

type control_cols:

Optional[List[str]]

param target_cols:

Target count column names.

type target_cols:

Optional[List[str]]

param add_hit_labels:

Add binary 0/1 hit columns based on a percentile cutoff.

type add_hit_labels:

bool

param hit_percentile:

Percentile cutoff for hits (0–100). Used when add_hit_labels is True.

type hit_percentile:

float

param alpha:

Confidence level for Poisson intervals. Used when strategy is ‘unified’.

type alpha:

float

param drop_duplicates:

Remove duplicate SMILES rows before scoring.

type drop_duplicates:

bool

param use_disynthon_pairs:

Collapse three-part rows into pairwise combinations before scoring.

type use_disynthon_pairs:

bool

param smiles_cols:

Three SMILES column names for the pairwise collapse. Used when use_disynthon_pairs is True.

type smiles_cols:

Optional[List[str]]

param aggregate_operation:

‘sum’ or ‘mean’ for combining duplicate pair counts. Used when use_disynthon_pairs is True.

type aggregate_operation:

str

param min_count_threshold:

Drop pair rows with total count below this value. Used when use_disynthon_pairs is True.

type min_count_threshold:

int

returns:

Datastore address of the output CSV.

rtype:

str

raises ValueError:

If strategy is invalid or the datastore is not configured.

References

“DeepChem-DEL: An Open Source Framework for Reproducible DEL Modeling and Benchmarking.” (2025). https://doi.org/10.26434/chemrxiv-2025-f11mk

Examples

Unified scoring:

>>> from deepchem_server.core.common.cards import DataCard
>>> from deepchem_server.core.common import config
>>> from deepchem_server.core.datastore import DiskDataStore
>>> import tempfile, pandas as pd
>>> disk_datastore = DiskDataStore('profile', 'project', tempfile.mkdtemp())
>>> config.set_datastore(disk_datastore)
>>> df = pd.DataFrame({
...     "smiles": ["CCO", "CCN", "CCC"],
...     "seq_matrix_1": [10, 20, 5], "seq_matrix_2": [12, 18, 6],
...     "seq_matrix_3": [11, 22, 4],
...     "seq_target_1": [50, 30, 8], "seq_target_2": [55, 28, 7],
...     "seq_target_3": [48, 32, 9],
... })
>>> card = DataCard(address='', file_type='csv', data_type='pandas.DataFrame')
>>> addr = disk_datastore.upload_data_from_memory(df, "raw_del.csv", card)
>>> result_addr = del_denoise(dataset_address=addr, output_key="denoised")
>>> result_addr
'deepchem://profile/project/denoised.csv'

With hit labels:

>>> result_addr = del_denoise(
...     dataset_address=addr,
...     output_key="denoised_hits",
...     strategy="unified",
...     add_hit_labels=True,
...     hit_percentile=90.0,
... )
>>> result_addr
'deepchem://profile/project/denoised_hits.csv'

Non-unified scoring:

>>> result_addr = del_denoise(
...     dataset_address=addr,
...     output_key="denoised_nu",
...     strategy="non_unified",
...     add_hit_labels=True,
... )
>>> result_addr
'deepchem://profile/project/denoised_nu.csv'

Supporting Functions

deepchem_server.core.primitives.del_denoising.poissfit(vec: pandas.Series, alpha: float = 0.05) Tuple[float, float][source]

Poisson confidence interval for replicate counts.

Parameters:
  • vec (pd.Series) – Replicate counts for one row.

  • alpha (float) – Significance level (default 0.05 for 95% CI).

Returns:

(lower_bound, upper_bound) of the estimated Poisson rate.

Return type:

Tuple[float, float]

Examples

>>> import pandas as pd
>>> lower, upper = poissfit(pd.Series([10, 12, 11]))
>>> round(lower, 4)
7.5719
>>> round(upper, 4)
15.4481
>>> lower < upper
True
deepchem_server.core.primitives.del_denoising.get_enrichment_ratio(row: pandas.Series, control_cols: List[str], target_cols: List[str], alpha: float = 0.05) float[source]

Enrichment ratio: target_lower_bound / control_upper_bound.

Parameters:
  • row (pd.Series) – One row with control and target count columns.

  • control_cols (List[str]) – Control count column names.

  • target_cols (List[str]) – Target count column names.

  • alpha (float) – Significance level for the confidence interval.

Returns:

Ratio or 0.0 when the control upper bound is zero.

Return type:

float

Examples

>>> import pandas as pd
>>> row = pd.Series({"ctrl_1": 5, "ctrl_2": 6, "tgt_1": 50, "tgt_2": 55})
>>> ratio = get_enrichment_ratio(row, ["ctrl_1", "ctrl_2"], ["tgt_1", "tgt_2"])
>>> round(ratio, 4)
4.3633
>>> ratio > 1.0
True
deepchem_server.core.primitives.del_denoising.calculate_poisson_enrichment(df: pandas.DataFrame, control_cols: List[str], target_cols: List[str], alpha: float = 0.05) pandas.DataFrame[source]

Add a Poisson_Enrichment column to the DataFrame.

Parameters:
  • df (pd.DataFrame) – Input data with control and target count columns.

  • control_cols (List[str]) – Control count column names.

  • target_cols (List[str]) – Target count column names.

  • alpha (float) – Significance level for confidence intervals.

Returns:

Copy of dataframe with a Poisson_Enrichment column added.

Return type:

pd.DataFrame

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     "seq_matrix_1": [10, 20], "seq_matrix_2": [12, 18], "seq_matrix_3": [11, 22],
...     "seq_target_1": [50, 30], "seq_target_2": [55, 28], "seq_target_3": [48, 32],
... })
>>> result = calculate_poisson_enrichment(
...     df,
...     ["seq_matrix_1", "seq_matrix_2", "seq_matrix_3"],
...     ["seq_target_1", "seq_target_2", "seq_target_3"],
... )
>>> "Poisson_Enrichment" in result.columns
True
>>> list(result["Poisson_Enrichment"].round(4))
[2.799, 0.9371]
deepchem_server.core.primitives.del_denoising.calculate_normalized_enrichment_score(row: pandas.Series, total_sum: float, row_count: int, column_name: str) float[source]

Z-score for one row: (p0 - p1) / sqrt(p1 * (1 - p1)).

Parameters:
  • row (pd.Series) – One DataFrame row.

  • total_sum (float) – Sum of column_name across all rows.

  • row_count (int) – Number of rows in the DataFrame.

  • column_name (str) – Column to read the count from.

Returns:

Normalized score.

Return type:

float

Examples

>>> import pandas as pd
>>> row = pd.Series({"count_col": 30})
>>> score = calculate_normalized_enrichment_score(row, total_sum=115.0, row_count=5, column_name="count_col")
>>> round(score, 4)
0.1522
deepchem_server.core.primitives.del_denoising.calculate_hit_threshold(df: pandas.DataFrame, column_name: str, percentile: float) float[source]

Return the percentile cutoff for a column.

Parameters:
  • df (pd.DataFrame) – Input DataFrame.

  • column_name (str) – Column to compute the percentile on.

  • percentile (float) – Percentile value (0–100).

Returns:

The cutoff value.

Return type:

float

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"Poisson_Enrichment": [0.1, 0.5, 1.2, 2.3, 0.8, 3.1, 0.3, 0.9, 1.5, 4.0]})
>>> threshold = calculate_hit_threshold(df, "Poisson_Enrichment", 80.0)
>>> threshold
2.46
deepchem_server.core.primitives.del_denoising.collapse_to_disynthons(df: pandas.DataFrame, smiles_cols: List[str], control_cols: List[str], target_cols: List[str], is_unified: bool, aggregate_operation: str = 'sum', min_count_threshold: int = 0) Tuple[pandas.DataFrame, int][source]

Collapse three-part rows into pairwise combinations.

Parameters:
  • df (pd.DataFrame) – Cleaned input (no NaN/duplicate rows).

  • smiles_cols (List[str]) – Three SMILES column names.

  • control_cols (List[str]) – Control count column names.

  • target_cols (List[str]) – Target count column names.

  • is_unified (bool) – If True, keep individual count columns. If False, pre-sum into two totals.

  • aggregate_operation (str) – How to combine duplicate counts: ‘sum’ or ‘mean’.

  • min_count_threshold (int) – Drop rows with total count below this value.

Returns:

(collapsed_df, n_failed). collapsed_df has a disynthons column and aggregated counts. n_failed is the number of SMILES that could not be merged.

Return type:

Tuple[pd.DataFrame, int]

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     "smiles_a": ["CCO", "CCO", "CCN"],
...     "smiles_b": ["CCN", "CCC", "CCC"],
...     "smiles_c": ["CCC", "CCO", "CCO"],
...     "seq_matrix_1": [5, 3, 7], "seq_matrix_2": [6, 4, 8], "seq_matrix_3": [5, 3, 6],
...     "seq_target_1": [20, 10, 15], "seq_target_2": [22, 11, 16], "seq_target_3": [19, 9, 14],
... })
>>> collapsed_df, n_failed = collapse_to_disynthons(
...     df,
...     smiles_cols=["smiles_a", "smiles_b", "smiles_c"],
...     control_cols=["seq_matrix_1", "seq_matrix_2", "seq_matrix_3"],
...     target_cols=["seq_target_1", "seq_target_2", "seq_target_3"],
...     is_unified=True,
... )
>>> "disynthons" in collapsed_df.columns
True
>>> len(collapsed_df)
4
>>> n_failed
0

Workflow Integration

These primitives are designed to work together in typical machine learning workflows:

  1. Data Preparation: Upload raw data to the datastore

  2. Partition (Optional): Use partition() to split large datasets for parallel or staged workflows

  3. Featurization: Use featurize() to transform molecular data into features

  4. Training: Use train() to build models on the featurized data

  5. Inference: Use infer() to make predictions on new data

  6. Evaluation: Use model_evaluator() to assess model performance

  7. Docking: Use generate_pose() to predict protein-ligand binding interactions

Example Workflow

Here’s a typical workflow using all five primitives:

from deepchem_server.core import feat, train, inference, evaluator, docking
from deepchem_server.core import config
from deepchem_server.core.datastore import DiskDataStore
import tempfile

# Setup datastore
datastore = DiskDataStore('profile', 'project', tempfile.mkdtemp())
config.set_datastore(datastore)

# 1. Featurize raw data
dataset_address = feat.featurize(
    dataset_address="raw_data_address",
    featurizer="ecfp",
    output="featurized_dataset",
    dataset_column="smiles",
    label_column="target"
)

# 2. Train a model
model_address = train.train(
    model_type="random_forest_classifier",
    dataset_address=dataset_address,
    model_name="my_classification_model"
)

# 3. Run inference
predictions_address = inference.infer(
    model_address=model_address,
    data_address="new_data_address",
    output="predictions.csv",
    dataset_column="smiles"
)

# 4. Evaluate the model
evaluator.model_evaluator(
    dataset_addresses=[dataset_address],
    model_address=model_address,
    metrics=["roc_auc_score", "accuracy_score"],
    output_key="evaluation_results"
)

# 5. Perform molecular docking
docking_address = docking.generate_pose(
    protein_address="protein_address",
    ligand_address="ligand_address",
    output="docking_results"
)