Regression and Modeling Domain

Overview

The regression and modeling domain fits regression models, evaluates their performance, and diagnoses their residuals. Use it when you need to quantify the relationship between a continuous outcome and one or more predictor variables, or when you need to predict numeric values from a set of features.

When to use this domain:

Estimating the effect of one or more variables on a continuous outcome
Predicting numeric values from structured features
Selecting the most informative features from a large feature set
Checking whether model assumptions (linearity, homoscedasticity, independence) hold
Comparing in-sample versus out-of-sample performance to detect overfitting

Source: src/localdata_mcp/domains/regression_modeling/

Available Analyses

Method	Class	Description
Ordinary least squares	`LinearRegressionTransformer`	Standard linear regression with full statistical diagnostics
Ridge regression	`RegularizedRegressionTransformer`	L2 regularisation; shrinks coefficients without eliminating them
Lasso regression	`RegularizedRegressionTransformer`	L1 regularisation; performs automatic feature selection
Elastic net	`RegularizedRegressionTransformer`	L1+L2 combination; balances Ridge and Lasso properties
Logistic regression	`LogisticRegressionTransformer`	Binary or multi-class classification
Polynomial regression	`PolynomialRegressionTransformer`	Non-linear relationships via polynomial feature expansion
Model-based feature selection	`FeatureSelectionTransformer`	Select features via Lasso coefficient shrinkage
Recursive feature elimination	`FeatureSelectionTransformer`	Iteratively remove least important features (RFE / RFECV)
Univariate feature selection	`FeatureSelectionTransformer`	F-statistic based selection (SelectKBest)
Residual normality tests	`ResidualAnalysisTransformer`	Shapiro-Wilk, Anderson-Darling, Jarque-Bera
Homoscedasticity tests	`ResidualAnalysisTransformer`	Breusch-Pagan and White tests
Autocorrelation test	`ResidualAnalysisTransformer`	Durbin-Watson statistic
Influence measures	`ResidualAnalysisTransformer`	Leverage, Cook’s distance, studentised residuals
Cross-validation	`RegressionModelingPipeline`	K-fold R² and RMSE

MCP Tool Reference

The domain exposes two primary MCP tools via src/localdata_mcp/datascience_tools.py.

`tool_fit_regression`

Fit a regression model on data retrieved from a SQL query, with optional residual analysis and cross-validation.

Parameters:

Parameter	Type	Default	Description
`engine`	`Engine`	required	SQLAlchemy engine from an active connection
`query`	`str`	required	SQL query returning features and target column
`target_column`	`str`	required	Name of the numeric outcome column
`feature_columns`	`list[str]`	`None`	Feature columns; all non-target columns used if None
`model_type`	`str`	`"linear"`	`"linear"`, `"ridge"`, `"lasso"`, `"elastic_net"`, `"logistic"`, `"polynomial"`
`regularization`	`str`	`None`	Override regularisation method (alternative to `model_type`)
`max_rows`	`int`	`None`	Row cap (default 500,000)

Underlying RegressionModelingPipeline also accepts:

Parameter	Type	Default	Description
`cross_validation`	`bool`	`True`	Perform K-fold cross-validation
`residual_analysis`	`bool`	`True`	Run residual diagnostics after fitting
`feature_selection`	`bool`	`False`	Run automatic feature selection before fitting
`preprocessing`	`str`	`"auto"`	Preprocessing level: `"minimal"`, `"auto"`, `"comprehensive"`

Returns: dict with keys:

model_type — model type fitted
regression_analysis — coefficients, standard errors, p-values, R², adjusted R², RMSE, MAE, AIC, BIC
residual_analysis — normality tests, homoscedasticity tests, autocorrelation, outlier indices, Cook’s distances
feature_selection — selected features and importance scores (when enabled)
pipeline_config — configuration settings used

`tool_evaluate_model`

Evaluate a fitted model’s performance on held-out data.

Parameters:

Parameter	Type	Default	Description
`engine`	`Engine`	required	SQLAlchemy engine
`query`	`str`	required	SQL query returning test data
`target_column`	`str`	required	Ground-truth outcome column
`prediction_column`	`str`	required	Column containing model predictions
`model_type`	`str`	`"regression"`	Model type for interpretation context
`max_rows`	`int`	`None`	Row cap

For direct use of evaluate_model_performance from the domain:

evaluation = evaluate_model_performance(
    model=fitted_model,
    X_test=X_test,
    y_test=y_test,
    X_train=X_train,   # optional, enables overfitting check
    y_train=y_train,
)

Returns: dict with keys:

test_metrics — R², MSE, RMSE, MAE, explained variance
train_metrics — same metrics for training data (when provided)
overfitting_check — R² gap between train and test; likely_overfitting=True when gap > 0.1
test_predictions — model predictions as list
test_residuals — prediction errors as list

Method Details

Linear Regression (OLS)

Fits ordinary least squares with sklearn.linear_model.LinearRegression and computes full statistical diagnostics via statsmodels.

Outputs include:

Coefficients with standard errors, t-statistics, and p-values for each feature
R² and adjusted R²
F-statistic for overall model significance
AIC and BIC for model comparison

Key parameters of LinearRegressionTransformer:

Parameter	Default	Description
`fit_intercept`	`True`	Include intercept term
`include_diagnostics`	`True`	Run full statsmodels diagnostics
`alpha`	`0.05`	Significance level for tests

Regularised Regression

All three variants use cross-validation to select the optimal regularisation strength when alpha="auto" (default).

Ridge: Penalises the sum of squared coefficients (L2). All features remain in the model; coefficients shrink toward zero. Use when you want to reduce variance without eliminating predictors.

Lasso: Penalises the sum of absolute coefficients (L1). Drives some coefficients exactly to zero, performing automatic feature selection. Use when you suspect many irrelevant features.

Elastic Net: Combines L1 and L2 penalties. The l1_ratio parameter controls the mix (0 = Ridge, 1 = Lasso). Use when features are correlated and Lasso tends to arbitrarily drop one from a correlated group.

Key parameters of RegularizedRegressionTransformer:

Parameter	Default	Description
`method`	`"ridge"`	`"ridge"`, `"lasso"`, `"elastic_net"`
`alpha`	`"auto"`	Regularisation strength; `"auto"` uses CV
`l1_ratio`	`0.5`	ElasticNet L1/L2 mix (only for elastic_net)
`cv`	`5`	Cross-validation folds for hyperparameter search
`max_iter`	`1000`	Solver iteration limit

Logistic Regression

LogisticRegressionTransformer fits a regularised logistic regression for binary or multiclass classification. Reports coefficients, odds ratios, and classification metrics (accuracy, precision, recall, F1, AUC-ROC).

Polynomial Regression

PolynomialRegressionTransformer expands features to polynomial terms up to a specified degree, then fits OLS. Use for capturing non-linear relationships in low-dimensional data. Beware overfitting at high degrees.

Feature Selection

Three methods are available via FeatureSelectionTransformer:

Model-based (method="model_based"): Uses sklearn.feature_selection.SelectFromModel with a LassoCV estimator. Features with near-zero Lasso coefficients are dropped.

Recursive Feature Elimination (method="rfe"): Iteratively fits the model and removes the least important feature. The number of features to keep is set by k.

RFECV (method="rfecv"): Like RFE but selects k automatically via cross-validation. Reports the optimal number of features and cross-validation scores.

Univariate (method="univariate"): Ranks features by F-statistic from SelectKBest(f_regression). Fast but ignores feature interactions.

Key parameters of FeatureSelectionTransformer:

Parameter	Default	Description
`method`	`"model_based"`	Selection method
`k`	`"all"`	Number of features to select (RFE, univariate)
`cv`	`5`	Cross-validation folds (RFECV)
`scoring`	`"r2"`	Evaluation metric for RFECV

Returns include: selected_features, feature_importance, R² before and after selection, and feature reduction ratio.

Residual Analysis

ResidualAnalysisTransformer performs full residual diagnostics automatically when residual_analysis=True in the pipeline.

Normality tests:

Shapiro-Wilk (n < 5,000): most powerful for small samples
Anderson-Darling: compared against critical values at 5% significance
Jarque-Bera: tests whether skewness and kurtosis match the normal distribution

Homoscedasticity tests:

Breusch-Pagan: regresses squared residuals on features; significant result indicates heteroscedasticity
White: tests for non-linear forms of heteroscedasticity

Autocorrelation:

Durbin-Watson statistic: values near 2 indicate no autocorrelation; < 1.5 suggests positive, > 2.5 suggests negative autocorrelation

Influence measures:

Leverage: diagonal of the hat matrix; high leverage points have unusual feature values
Cook’s distance: measures overall influence on all fitted values; values > 4/n are flagged
Studentised residuals: standardised by leave-one-out standard error; |value| > 2.5 are flagged as outliers

Model Evaluation Metrics

Metric	Range	Better when
R²	0 – 1	Higher
Adjusted R²	< R²	Higher (penalises extra features)
RMSE	≥ 0	Lower (same units as target)
MAE	≥ 0	Lower (robust to outliers)
AIC	any	Lower (model comparison)
BIC	any	Lower (stronger penalty for complexity)

The overfitting check in evaluate_model_performance flags when the train R² exceeds test R² by more than 0.1. The MSE ratio (test / train) above 1.5 is a secondary signal.

Composition

Next step	Purpose
`statistical_analysis`	Validate model assumptions; test residual normality and correlation between residuals and features
`pattern_recognition`	Identify clusters in residuals that may indicate omitted subgroup structure
`time_series`	Use fitted regression as part of a decomposition or as a feature in forecasting
`business_intelligence`	Translate model coefficients into business impact estimates

The regression_analysis result dict from the pipeline can be passed directly to statistical_analysis tools by supplying the residuals array and feature matrix.

Examples

Fit a linear model with diagnostics

result = tool_fit_regression(
    engine=engine,
    query="SELECT price, sqft, bedrooms, age, neighborhood FROM housing",
    target_column="price",
    model_type="linear",
)

reg = result["regression_analysis"]
print(f"R² = {reg['r2']:.3f}, Adjusted R² = {reg['adj_r2']:.3f}")
print(f"RMSE = {reg['rmse']:.1f}")

# Feature coefficients
for feat, coef in reg["coefficients"].items():
    print(f"  {feat}: {coef:.3f} (p={reg['p_values'][feat]:.4f})")

# Residual diagnostics
res = result["residual_analysis"]
print("Residuals normal?", res["normality_test"]["shapiro_wilk"]["is_normal"])
print("Homoscedastic?", res["homoscedasticity_test"]["breusch_pagan"]["is_homoscedastic"])

Regularised regression with automatic alpha selection

from localdata_mcp.domains.regression_modeling import RegularizedRegressionTransformer
import pandas as pd

df = pd.read_sql("SELECT * FROM features", engine)
X = df.drop(columns=["target"]).values
y = df["target"].values
feature_names = df.drop(columns=["target"]).columns.tolist()

transformer = RegularizedRegressionTransformer(method="lasso", alpha="auto", cv=5)
transformer.fit(X, y, feature_names=feature_names)
result = transformer.get_result()

print(f"Best alpha: {result['best_alpha']:.5f}")
print("Non-zero features:", result["non_zero_features"])

Feature selection before model fitting

result = tool_fit_regression(
    engine=engine,
    query="SELECT * FROM wide_feature_table",
    target_column="outcome",
    model_type="linear",
    feature_selection=True,  # enable RFECV-based selection
)

sel = result["feature_selection"]
print(f"Selected {sel['n_selected']} of {sel['n_original']} features")
print("Selected:", sel["selected_features"])
print(f"R² retained: {sel['comparison']['r2_selected']:.3f}")

Evaluate overfitting on a hold-out set

# Fit on training data
train_result = tool_fit_regression(
    engine=engine,
    query="SELECT * FROM train_data",
    target_column="sales",
    model_type="ridge",
)

# Evaluate on test data
eval_result = tool_evaluate_model(
    engine=engine,
    query="SELECT * FROM test_data",
    target_column="sales",
    prediction_column="predicted_sales",  # pre-computed or use model directly
)

check = eval_result["overfitting_check"]
print(f"R² gap: {check['r2_gap']:.3f}")
print(f"Likely overfitting: {check['likely_overfitting']}")

Full pipeline: feature selection → lasso → residual diagnostics

from localdata_mcp.domains.regression_modeling import RegressionModelingPipeline

pipeline = RegressionModelingPipeline(
    model_type="lasso",
    cross_validation=True,
    residual_analysis=True,
    feature_selection=True,
)
pipeline.fit(X_train, y_train, feature_names=feature_names)
results = pipeline.get_results()

# Report
print("AIC:", results["regression_analysis"]["aic"])
outliers = results["residual_analysis"]["outliers"]
print(f"Potential outliers at indices: {outliers}")
cooks = results["residual_analysis"]["cooks_distance"]
high_influence = [i for i, c in enumerate(cooks) if c is not None and c > 4/len(y_train)]
print(f"High-influence observations: {high_influence}")