Getting Started
The pyds library is a Python client package for interacting with the DeepChem Server API. It provides a clean, object-oriented interface for managing settings, uploading data, and submitting primitive jobs for molecular machine learning workflows.
What is pyds?
pyds simplifies the process of working with molecular data by providing:
Unified API: A consistent interface for all DeepChem Server operations
Settings Management: Centralized configuration for profiles, projects, and server connections
Data Operations: Easy upload and management of molecular datasets
ML Primitives: Ready-to-use components for featurization, training, evaluation, and inference
Workflow Integration: Seamless chaining of operations for complete ML pipelines
Installation
Install the pyds library from source:
cd pyds
pip install -e .
For development with testing dependencies:
pip install -e ".[dev]"
Architecture
The pyds library follows a clean inheritance structure designed for modularity and code reuse:
BaseClient (base functionality)
├── Data (data operations)
└── Primitive (abstract base for computation tasks)
├── Featurize (molecular featurization)
├── Train (model training)
├── Evaluate (model evaluation)
├── Infer (inference/predictions)
└── TVTSplit (train-valid-test splitting)
Key Design Principles:
BaseClient: Contains all common functionality like HTTP requests, configuration validation, and shared utilities
Inheritance-based: Specific clients inherit from BaseClient, eliminating code duplication
Consistent Interface: All clients provide the same base methods and configuration handling
Settings Management: Centralized configuration through a Settings class with persistent storage in a JSON file
Quick Start
Basic workflow for using the pyds library:
Configure Settings: Set up profile, project, and server URL
Initialize Clients: Create Data and Primitive client instances
Upload Data: Use Data client to upload datasets
Run Primitives: Use primitive classes for computation tasks
from pyds import Settings, Data, Featurize, Train
# Configure settings
settings = Settings()
settings.set_profile("my_profile")
settings.set_project("my_project")
# Initialize clients
data_client = Data(settings)
featurize_client = Featurize(settings)
train_client = Train(settings)
# Upload data
response = data_client.upload_data("data.csv", description="My dataset")
dataset_address = response['dataset_address']
# Featurize data
response = featurize_client.run(
dataset_address=dataset_address,
featurizer="ECFP",
output="featurized_data",
dataset_column="smiles"
)
featurized_address = response['featurized_file_address']
# Train model
response = train_client.run(
dataset_address=featurized_address,
model_type="random_forest_classifier",
model_name="my_model"
)