MCP Agents for ML Experiment Automation: A Practical Pattern

ML experiment automation usually means one of two things: a fixed script that runs the same steps every time, or a full orchestration system (Airflow, Prefect, Dagster) that requires significant setup overhead. Neither is great for exploratory work where you want to iterate on the experiment definition itself.

Model Context Protocol (MCP) agents offer a third option: a lightweight, conversational automation layer that can call real tools — Databricks jobs, GitHub PRs, Grafana dashboards — without requiring you to write a new pipeline for every experiment variant.

This post describes the pattern I use for ML experiment automation with MCP, and when it's actually worth reaching for it.

What MCP gives you

MCP is a protocol for connecting language model agents to external tools via structured function calls. Each tool is defined with a schema (name, description, input/output types), and the agent decides which tools to call and in what order based on a natural-language prompt.

The key insight is that you're not hardcoding a workflow. You're describing the goal, and the agent plans the steps. This is useful when:

The experiment workflow has conditional branches ("if validation F1 drops below 0.6, check the feature distribution before re-training")
You want to chain multiple systems (run job → fetch metrics → open PR → update tracking doc)
You're exploring variations and don't want to template a new script for each one

The core tools for an ML experiment loop

For a typical supervised ML experiment, the tools you need are:

| Tool | What it does | |---|---| | run_databricks_job | Trigger a training job by job ID, passing config as parameters | | get_job_status | Poll job run status (RUNNING, SUCCEEDED, FAILED) | | fetch_mlflow_run | Get metrics and parameters for a completed run by run ID | | compare_runs | Diff metrics between two MLflow run IDs | | create_github_pr | Open a PR with experiment results in the description | | get_grafana_panel | Fetch a panel snapshot for a given time range |

Each tool is a thin wrapper around the relevant API. The MCP server registers them and handles auth. The agent calls them in sequence, with branching logic handled by the model.

Example: running an ablation

Here's what a prompt to the agent looks like for a feature ablation experiment:

Run a training job using config base_config.json but with feature_set=v2.
Wait for it to complete. Compare its val_f1 to the best run in experiment "ads-rank-v3".
If val_f1 improved by more than 1%, open a GitHub PR titled "feat: feature-set-v2 ablation"
with the metric comparison in the description. Otherwise, post the diff to Slack channel #ml-experiments.

The agent:

Calls run_databricks_job with the modified config
Polls get_job_status until completion
Calls fetch_mlflow_run for the new run and the baseline
Calls compare_runs to compute the diff
Branches based on the F1 improvement threshold
Opens the PR or posts to Slack

None of this required a new script. The branching logic lived in natural language and was interpreted by the model.

Implementing the MCP server

A minimal MCP server in Python using the mcp SDK:

from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp import types
import mlflow
import databricks_sdk

server = Server("ml-experiment-automation")

@server.list_tools()
async def list_tools():
    return [
        types.Tool(
            name="run_databricks_job",
            description="Trigger a Databricks job run with optional parameter overrides",
            inputSchema={
                "type": "object",
                "properties": {
                    "job_id": {"type": "integer"},
                    "parameters": {"type": "object"},
                },
                "required": ["job_id"],
            },
        ),
        types.Tool(
            name="fetch_mlflow_run",
            description="Get metrics and params for a completed MLflow run",
            inputSchema={
                "type": "object",
                "properties": {"run_id": {"type": "string"}},
                "required": ["run_id"],
            },
        ),
        # ... other tools
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "run_databricks_job":
        client = databricks_sdk.WorkspaceClient()
        run = client.jobs.run_now(
            job_id=arguments["job_id"],
            notebook_params=arguments.get("parameters", {}),
        )
        return [types.TextContent(type="text", text=str(run.run_id))]
    
    if name == "fetch_mlflow_run":
        run = mlflow.get_run(arguments["run_id"])
        return [types.TextContent(
            type="text",
            text=str({"metrics": run.data.metrics, "params": run.data.params})
        )]

async def main():
    async with stdio_server() as streams:
        await server.run(*streams, server.create_initialization_options())

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

When not to use this pattern

MCP agents are not a replacement for scheduled pipelines. If you're running the same workflow on a fixed schedule — nightly retraining, weekly evaluation — use a proper orchestrator. The value of MCP automation is in the ad-hoc, exploratory loop where the next step depends on what you just saw.

It's also not appropriate for production-critical paths. An agent that can open PRs and trigger jobs should be restricted to non-prod environments or require explicit human confirmation for high-impact actions. Build that confirmation into your tool schemas: return a requires_confirmation: true field and have the agent ask before proceeding.

The actual leverage

The biggest win isn't the automation itself — it's that the experiment description becomes a record. The prompt you wrote to trigger the ablation is a human-readable spec of exactly what was run and why. That's more useful for audit trails and onboarding than a script buried in a Makefile.

Worth trying if you're doing more than 5 experiments a week and spending time on the plumbing between them.