Core Primitives =============== This section documents the core primitives that form the backbone of DeepChem Server's machine learning workflow. These primitives provide the essential functionality for molecular machine learning pipelines: data featurization, model training, inference, and evaluation. Overview -------- DeepChem Server provides four main primitives that work together to create end-to-end machine learning workflows: * **Featurize**: Transform raw molecular data into machine learning features * **Train**: Build and train machine learning models on featurized datasets * **Inference**: Run predictions on new data using trained models * **Evaluation**: Assess model performance using various metrics These primitives are designed to work seamlessly together while also being usable independently for specific tasks. Featurization ------------- The featurization primitive transforms raw molecular data (like SMILES strings or SDF files) into numerical features that can be used for machine learning. .. autofunction:: deepchem_server.core.feat.featurize :no-index: Supporting Functions ~~~~~~~~~~~~~~~~~~~~ .. autofunction:: deepchem_server.core.feat.split_dataset :no-index: .. autofunction:: deepchem_server.core.feat.featurize_part :no-index: .. autofunction:: deepchem_server.core.feat.featurize_multi_core :no-index: Available Featurizers ~~~~~~~~~~~~~~~~~~~~~ The featurization primitive supports the following DeepChem featurizers: * **ecfp**: Extended Connectivity Fingerprints (Circular Fingerprints) - Compatible with scikit-learn models * **graphconv**: Graph Convolution Featurizer - Compatible with scikit-learn models * **weave**: Weave Featurizer for molecular graphs - Compatible with scikit-learn models * **molgraphconv**: Molecular Graph Convolution Featurizer - Required for GCN model .. note:: While scikit-learn models (linear_regression, random_forest_*) can work with any featurizer, the GCN model specifically requires the ``molgraphconv`` featurizer for proper graph-based processing. Training -------- The training primitive builds and trains machine learning models on featurized datasets using DeepChem's extensive model library. .. autofunction:: deepchem_server.core.train.train Available Models ~~~~~~~~~~~~~~~~ The training primitive supports the following specific model types: **Scikit-learn Models (wrapped in DeepChem SklearnModel):** * **linear_regression**: Linear regression for continuous target variables * **random_forest_classifier**: Random forest for classification tasks * **random_forest_regressor**: Random forest for regression tasks **DeepChem Neural Network Models:** * **gcn**: Graph Convolutional Network (requires ``molgraphconv`` featurizer) .. note:: The GCN model requires PyTorch to be installed and may not be available if torch dependencies are missing. Each model supports different initialization and training parameters - refer to ``deepchem_server.core.model_mappings`` for detailed parameter options. Inference --------- The inference primitive runs predictions on new data using previously trained models, handling both featurized and raw input data. .. autofunction:: deepchem_server.core.inference.infer Supporting Functions ~~~~~~~~~~~~~~~~~~~~ .. autofunction:: deepchem_server.core.inference._infer_with_featurize .. autofunction:: deepchem_server.core.inference._infer_without_featurize Evaluation ---------- The evaluation primitive assesses model performance using various metrics and generates evaluation reports. .. autofunction:: deepchem_server.core.evaluator.model_evaluator Supporting Functions ~~~~~~~~~~~~~~~~~~~~ .. autofunction:: deepchem_server.core.evaluator.prc_auc_curve Available Metrics ~~~~~~~~~~~~~~~~~ The evaluation primitive supports the following metrics: * **pearson_r2_score**: Pearson correlation coefficient * **jaccard_score**: Jaccard similarity score * **prc_auc_score**: Precision-Recall AUC score * **roc_auc_score**: ROC AUC score * **rms_score**: Root Mean Square score * **mae_error**: Mean Absolute Error * **bedroc_score**: BEDROC score * **accuracy_score**: Classification accuracy * **balanced_accuracy_score**: Balanced classification accuracy Workflow Integration -------------------- These primitives are designed to work together in typical machine learning workflows: 1. **Data Preparation**: Upload raw data to the datastore 2. **Featurization**: Use ``featurize()`` to transform molecular data into features 3. **Training**: Use ``train()`` to build models on the featurized data 4. **Inference**: Use ``infer()`` to make predictions on new data 5. **Evaluation**: Use ``model_evaluator()`` to assess model performance Example Workflow ~~~~~~~~~~~~~~~~ Here's a typical workflow using all four primitives: .. code-block:: python from deepchem_server.core import feat, train, inference, evaluator from deepchem_server.core import config from deepchem_server.core.datastore import DiskDataStore import tempfile # Setup datastore datastore = DiskDataStore('profile', 'project', tempfile.mkdtemp()) config.set_datastore(datastore) # 1. Featurize raw data dataset_address = feat.featurize( dataset_address="raw_data_address", featurizer="ecfp", output="featurized_dataset", dataset_column="smiles", label_column="target" ) # 2. Train a model model_address = train.train( model_type="random_forest_classifier", dataset_address=dataset_address, model_name="my_classification_model" ) # 3. Run inference predictions_address = inference.infer( model_address=model_address, data_address="new_data_address", output="predictions.csv", dataset_column="smiles" ) # 4. Evaluate the model evaluator.model_evaluator( dataset_addresses=[dataset_address], model_address=model_address, metrics=["roc_auc_score", "accuracy_score"], output_key="evaluation_results" )