Regression and Modeling Domain

Overview

The regression and modeling domain fits regression models, evaluates their performance, and diagnoses their residuals. Use it when you need to quantify the relationship between a continuous outcome and one or more predictor variables, or when you need to predict numeric values from a set of features.

When to use this domain:

  • Estimating the effect of one or more variables on a continuous outcome

  • Predicting numeric values from structured features

  • Selecting the most informative features from a large feature set

  • Checking whether model assumptions (linearity, homoscedasticity, independence) hold

  • Comparing in-sample versus out-of-sample performance to detect overfitting

Source: src/localdata_mcp/domains/regression_modeling/


Available Analyses

Method

Class

Description

Ordinary least squares

LinearRegressionTransformer

Standard linear regression with full statistical diagnostics

Ridge regression

RegularizedRegressionTransformer

L2 regularisation; shrinks coefficients without eliminating them

Lasso regression

RegularizedRegressionTransformer

L1 regularisation; performs automatic feature selection

Elastic net

RegularizedRegressionTransformer

L1+L2 combination; balances Ridge and Lasso properties

Logistic regression

LogisticRegressionTransformer

Binary or multi-class classification

Polynomial regression

PolynomialRegressionTransformer

Non-linear relationships via polynomial feature expansion

Model-based feature selection

FeatureSelectionTransformer

Select features via Lasso coefficient shrinkage

Recursive feature elimination

FeatureSelectionTransformer

Iteratively remove least important features (RFE / RFECV)

Univariate feature selection

FeatureSelectionTransformer

F-statistic based selection (SelectKBest)

Residual normality tests

ResidualAnalysisTransformer

Shapiro-Wilk, Anderson-Darling, Jarque-Bera

Homoscedasticity tests

ResidualAnalysisTransformer

Breusch-Pagan and White tests

Autocorrelation test

ResidualAnalysisTransformer

Durbin-Watson statistic

Influence measures

ResidualAnalysisTransformer

Leverage, Cook’s distance, studentised residuals

Cross-validation

RegressionModelingPipeline

K-fold R² and RMSE


MCP Tool Reference

The domain exposes two primary MCP tools via src/localdata_mcp/datascience_tools.py.

tool_fit_regression

Fit a regression model on data retrieved from a SQL query, with optional residual analysis and cross-validation.

Parameters:

Parameter

Type

Default

Description

engine

Engine

required

SQLAlchemy engine from an active connection

query

str

required

SQL query returning features and target column

target_column

str

required

Name of the numeric outcome column

feature_columns

list[str]

None

Feature columns; all non-target columns used if None

model_type

str

"linear"

"linear", "ridge", "lasso", "elastic_net", "logistic", "polynomial"

regularization

str

None

Override regularisation method (alternative to model_type)

max_rows

int

None

Row cap (default 500,000)

Underlying RegressionModelingPipeline also accepts:

Parameter

Type

Default

Description

cross_validation

bool

True

Perform K-fold cross-validation

residual_analysis

bool

True

Run residual diagnostics after fitting

feature_selection

bool

False

Run automatic feature selection before fitting

preprocessing

str

"auto"

Preprocessing level: "minimal", "auto", "comprehensive"

Returns: dict with keys:

  • model_type — model type fitted

  • regression_analysis — coefficients, standard errors, p-values, R², adjusted R², RMSE, MAE, AIC, BIC

  • residual_analysis — normality tests, homoscedasticity tests, autocorrelation, outlier indices, Cook’s distances

  • feature_selection — selected features and importance scores (when enabled)

  • pipeline_config — configuration settings used


tool_evaluate_model

Evaluate a fitted model’s performance on held-out data.

Parameters:

Parameter

Type

Default

Description

engine

Engine

required

SQLAlchemy engine

query

str

required

SQL query returning test data

target_column

str

required

Ground-truth outcome column

prediction_column

str

required

Column containing model predictions

model_type

str

"regression"

Model type for interpretation context

max_rows

int

None

Row cap

For direct use of evaluate_model_performance from the domain:

evaluation = evaluate_model_performance(
    model=fitted_model,
    X_test=X_test,
    y_test=y_test,
    X_train=X_train,   # optional, enables overfitting check
    y_train=y_train,
)

Returns: dict with keys:

  • test_metrics — R², MSE, RMSE, MAE, explained variance

  • train_metrics — same metrics for training data (when provided)

  • overfitting_check — R² gap between train and test; likely_overfitting=True when gap > 0.1

  • test_predictions — model predictions as list

  • test_residuals — prediction errors as list


Method Details

Linear Regression (OLS)

Fits ordinary least squares with sklearn.linear_model.LinearRegression and computes full statistical diagnostics via statsmodels.

Outputs include:

  • Coefficients with standard errors, t-statistics, and p-values for each feature

  • R² and adjusted R²

  • F-statistic for overall model significance

  • AIC and BIC for model comparison

Key parameters of LinearRegressionTransformer:

Parameter

Default

Description

fit_intercept

True

Include intercept term

include_diagnostics

True

Run full statsmodels diagnostics

alpha

0.05

Significance level for tests


Regularised Regression

All three variants use cross-validation to select the optimal regularisation strength when alpha="auto" (default).

Ridge: Penalises the sum of squared coefficients (L2). All features remain in the model; coefficients shrink toward zero. Use when you want to reduce variance without eliminating predictors.

Lasso: Penalises the sum of absolute coefficients (L1). Drives some coefficients exactly to zero, performing automatic feature selection. Use when you suspect many irrelevant features.

Elastic Net: Combines L1 and L2 penalties. The l1_ratio parameter controls the mix (0 = Ridge, 1 = Lasso). Use when features are correlated and Lasso tends to arbitrarily drop one from a correlated group.

Key parameters of RegularizedRegressionTransformer:

Parameter

Default

Description

method

"ridge"

"ridge", "lasso", "elastic_net"

alpha

"auto"

Regularisation strength; "auto" uses CV

l1_ratio

0.5

ElasticNet L1/L2 mix (only for elastic_net)

cv

5

Cross-validation folds for hyperparameter search

max_iter

1000

Solver iteration limit


Logistic Regression

LogisticRegressionTransformer fits a regularised logistic regression for binary or multiclass classification. Reports coefficients, odds ratios, and classification metrics (accuracy, precision, recall, F1, AUC-ROC).


Polynomial Regression

PolynomialRegressionTransformer expands features to polynomial terms up to a specified degree, then fits OLS. Use for capturing non-linear relationships in low-dimensional data. Beware overfitting at high degrees.


Feature Selection

Three methods are available via FeatureSelectionTransformer:

Model-based (method="model_based"): Uses sklearn.feature_selection.SelectFromModel with a LassoCV estimator. Features with near-zero Lasso coefficients are dropped.

Recursive Feature Elimination (method="rfe"): Iteratively fits the model and removes the least important feature. The number of features to keep is set by k.

RFECV (method="rfecv"): Like RFE but selects k automatically via cross-validation. Reports the optimal number of features and cross-validation scores.

Univariate (method="univariate"): Ranks features by F-statistic from SelectKBest(f_regression). Fast but ignores feature interactions.

Key parameters of FeatureSelectionTransformer:

Parameter

Default

Description

method

"model_based"

Selection method

k

"all"

Number of features to select (RFE, univariate)

cv

5

Cross-validation folds (RFECV)

scoring

"r2"

Evaluation metric for RFECV

Returns include: selected_features, feature_importance, R² before and after selection, and feature reduction ratio.


Residual Analysis

ResidualAnalysisTransformer performs full residual diagnostics automatically when residual_analysis=True in the pipeline.

Normality tests:

  • Shapiro-Wilk (n < 5,000): most powerful for small samples

  • Anderson-Darling: compared against critical values at 5% significance

  • Jarque-Bera: tests whether skewness and kurtosis match the normal distribution

Homoscedasticity tests:

  • Breusch-Pagan: regresses squared residuals on features; significant result indicates heteroscedasticity

  • White: tests for non-linear forms of heteroscedasticity

Autocorrelation:

  • Durbin-Watson statistic: values near 2 indicate no autocorrelation; < 1.5 suggests positive, > 2.5 suggests negative autocorrelation

Influence measures:

  • Leverage: diagonal of the hat matrix; high leverage points have unusual feature values

  • Cook’s distance: measures overall influence on all fitted values; values > 4/n are flagged

  • Studentised residuals: standardised by leave-one-out standard error; |value| > 2.5 are flagged as outliers


Model Evaluation Metrics

Metric

Range

Better when

0 – 1

Higher

Adjusted R²

< R²

Higher (penalises extra features)

RMSE

≥ 0

Lower (same units as target)

MAE

≥ 0

Lower (robust to outliers)

AIC

any

Lower (model comparison)

BIC

any

Lower (stronger penalty for complexity)

The overfitting check in evaluate_model_performance flags when the train R² exceeds test R² by more than 0.1. The MSE ratio (test / train) above 1.5 is a secondary signal.


Composition

Next step

Purpose

statistical_analysis

Validate model assumptions; test residual normality and correlation between residuals and features

pattern_recognition

Identify clusters in residuals that may indicate omitted subgroup structure

time_series

Use fitted regression as part of a decomposition or as a feature in forecasting

business_intelligence

Translate model coefficients into business impact estimates

The regression_analysis result dict from the pipeline can be passed directly to statistical_analysis tools by supplying the residuals array and feature matrix.


Examples

Fit a linear model with diagnostics

result = tool_fit_regression(
    engine=engine,
    query="SELECT price, sqft, bedrooms, age, neighborhood FROM housing",
    target_column="price",
    model_type="linear",
)

reg = result["regression_analysis"]
print(f"R² = {reg['r2']:.3f}, Adjusted R² = {reg['adj_r2']:.3f}")
print(f"RMSE = {reg['rmse']:.1f}")

# Feature coefficients
for feat, coef in reg["coefficients"].items():
    print(f"  {feat}: {coef:.3f} (p={reg['p_values'][feat]:.4f})")

# Residual diagnostics
res = result["residual_analysis"]
print("Residuals normal?", res["normality_test"]["shapiro_wilk"]["is_normal"])
print("Homoscedastic?", res["homoscedasticity_test"]["breusch_pagan"]["is_homoscedastic"])

Regularised regression with automatic alpha selection

from localdata_mcp.domains.regression_modeling import RegularizedRegressionTransformer
import pandas as pd

df = pd.read_sql("SELECT * FROM features", engine)
X = df.drop(columns=["target"]).values
y = df["target"].values
feature_names = df.drop(columns=["target"]).columns.tolist()

transformer = RegularizedRegressionTransformer(method="lasso", alpha="auto", cv=5)
transformer.fit(X, y, feature_names=feature_names)
result = transformer.get_result()

print(f"Best alpha: {result['best_alpha']:.5f}")
print("Non-zero features:", result["non_zero_features"])

Feature selection before model fitting

result = tool_fit_regression(
    engine=engine,
    query="SELECT * FROM wide_feature_table",
    target_column="outcome",
    model_type="linear",
    feature_selection=True,  # enable RFECV-based selection
)

sel = result["feature_selection"]
print(f"Selected {sel['n_selected']} of {sel['n_original']} features")
print("Selected:", sel["selected_features"])
print(f"R² retained: {sel['comparison']['r2_selected']:.3f}")

Evaluate overfitting on a hold-out set

# Fit on training data
train_result = tool_fit_regression(
    engine=engine,
    query="SELECT * FROM train_data",
    target_column="sales",
    model_type="ridge",
)

# Evaluate on test data
eval_result = tool_evaluate_model(
    engine=engine,
    query="SELECT * FROM test_data",
    target_column="sales",
    prediction_column="predicted_sales",  # pre-computed or use model directly
)

check = eval_result["overfitting_check"]
print(f"R² gap: {check['r2_gap']:.3f}")
print(f"Likely overfitting: {check['likely_overfitting']}")

Full pipeline: feature selection → lasso → residual diagnostics

from localdata_mcp.domains.regression_modeling import RegressionModelingPipeline

pipeline = RegressionModelingPipeline(
    model_type="lasso",
    cross_validation=True,
    residual_analysis=True,
    feature_selection=True,
)
pipeline.fit(X_train, y_train, feature_names=feature_names)
results = pipeline.get_results()

# Report
print("AIC:", results["regression_analysis"]["aic"])
outliers = results["residual_analysis"]["outliers"]
print(f"Potential outliers at indices: {outliers}")
cooks = results["residual_analysis"]["cooks_distance"]
high_influence = [i for i, c in enumerate(cooks) if c is not None and c > 4/len(y_train)]
print(f"High-influence observations: {high_influence}")