Regression and Modeling Domain
Overview
The regression and modeling domain fits regression models, evaluates their performance, and diagnoses their residuals. Use it when you need to quantify the relationship between a continuous outcome and one or more predictor variables, or when you need to predict numeric values from a set of features.
When to use this domain:
Estimating the effect of one or more variables on a continuous outcome
Predicting numeric values from structured features
Selecting the most informative features from a large feature set
Checking whether model assumptions (linearity, homoscedasticity, independence) hold
Comparing in-sample versus out-of-sample performance to detect overfitting
Source: src/localdata_mcp/domains/regression_modeling/
Available Analyses
Method |
Class |
Description |
|---|---|---|
Ordinary least squares |
|
Standard linear regression with full statistical diagnostics |
Ridge regression |
|
L2 regularisation; shrinks coefficients without eliminating them |
Lasso regression |
|
L1 regularisation; performs automatic feature selection |
Elastic net |
|
L1+L2 combination; balances Ridge and Lasso properties |
Logistic regression |
|
Binary or multi-class classification |
Polynomial regression |
|
Non-linear relationships via polynomial feature expansion |
Model-based feature selection |
|
Select features via Lasso coefficient shrinkage |
Recursive feature elimination |
|
Iteratively remove least important features (RFE / RFECV) |
Univariate feature selection |
|
F-statistic based selection (SelectKBest) |
Residual normality tests |
|
Shapiro-Wilk, Anderson-Darling, Jarque-Bera |
Homoscedasticity tests |
|
Breusch-Pagan and White tests |
Autocorrelation test |
|
Durbin-Watson statistic |
Influence measures |
|
Leverage, Cook’s distance, studentised residuals |
Cross-validation |
|
K-fold R² and RMSE |
MCP Tool Reference
The domain exposes two primary MCP tools via src/localdata_mcp/datascience_tools.py.
tool_fit_regression
Fit a regression model on data retrieved from a SQL query, with optional residual analysis and cross-validation.
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
SQLAlchemy engine from an active connection |
|
|
required |
SQL query returning features and target column |
|
|
required |
Name of the numeric outcome column |
|
|
|
Feature columns; all non-target columns used if None |
|
|
|
|
|
|
|
Override regularisation method (alternative to |
|
|
|
Row cap (default 500,000) |
Underlying RegressionModelingPipeline also accepts:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Perform K-fold cross-validation |
|
|
|
Run residual diagnostics after fitting |
|
|
|
Run automatic feature selection before fitting |
|
|
|
Preprocessing level: |
Returns: dict with keys:
model_type— model type fittedregression_analysis— coefficients, standard errors, p-values, R², adjusted R², RMSE, MAE, AIC, BICresidual_analysis— normality tests, homoscedasticity tests, autocorrelation, outlier indices, Cook’s distancesfeature_selection— selected features and importance scores (when enabled)pipeline_config— configuration settings used
tool_evaluate_model
Evaluate a fitted model’s performance on held-out data.
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
SQLAlchemy engine |
|
|
required |
SQL query returning test data |
|
|
required |
Ground-truth outcome column |
|
|
required |
Column containing model predictions |
|
|
|
Model type for interpretation context |
|
|
|
Row cap |
For direct use of evaluate_model_performance from the domain:
evaluation = evaluate_model_performance(
model=fitted_model,
X_test=X_test,
y_test=y_test,
X_train=X_train, # optional, enables overfitting check
y_train=y_train,
)
Returns: dict with keys:
test_metrics— R², MSE, RMSE, MAE, explained variancetrain_metrics— same metrics for training data (when provided)overfitting_check— R² gap between train and test;likely_overfitting=Truewhen gap > 0.1test_predictions— model predictions as listtest_residuals— prediction errors as list
Method Details
Linear Regression (OLS)
Fits ordinary least squares with sklearn.linear_model.LinearRegression and computes full statistical diagnostics via statsmodels.
Outputs include:
Coefficients with standard errors, t-statistics, and p-values for each feature
R² and adjusted R²
F-statistic for overall model significance
AIC and BIC for model comparison
Key parameters of LinearRegressionTransformer:
Parameter |
Default |
Description |
|---|---|---|
|
|
Include intercept term |
|
|
Run full statsmodels diagnostics |
|
|
Significance level for tests |
Regularised Regression
All three variants use cross-validation to select the optimal regularisation strength when alpha="auto" (default).
Ridge: Penalises the sum of squared coefficients (L2). All features remain in the model; coefficients shrink toward zero. Use when you want to reduce variance without eliminating predictors.
Lasso: Penalises the sum of absolute coefficients (L1). Drives some coefficients exactly to zero, performing automatic feature selection. Use when you suspect many irrelevant features.
Elastic Net: Combines L1 and L2 penalties. The l1_ratio parameter controls the mix (0 = Ridge, 1 = Lasso). Use when features are correlated and Lasso tends to arbitrarily drop one from a correlated group.
Key parameters of RegularizedRegressionTransformer:
Parameter |
Default |
Description |
|---|---|---|
|
|
|
|
|
Regularisation strength; |
|
|
ElasticNet L1/L2 mix (only for elastic_net) |
|
|
Cross-validation folds for hyperparameter search |
|
|
Solver iteration limit |
Logistic Regression
LogisticRegressionTransformer fits a regularised logistic regression for binary or multiclass classification. Reports coefficients, odds ratios, and classification metrics (accuracy, precision, recall, F1, AUC-ROC).
Polynomial Regression
PolynomialRegressionTransformer expands features to polynomial terms up to a specified degree, then fits OLS. Use for capturing non-linear relationships in low-dimensional data. Beware overfitting at high degrees.
Feature Selection
Three methods are available via FeatureSelectionTransformer:
Model-based (method="model_based"): Uses sklearn.feature_selection.SelectFromModel with a LassoCV estimator. Features with near-zero Lasso coefficients are dropped.
Recursive Feature Elimination (method="rfe"): Iteratively fits the model and removes the least important feature. The number of features to keep is set by k.
RFECV (method="rfecv"): Like RFE but selects k automatically via cross-validation. Reports the optimal number of features and cross-validation scores.
Univariate (method="univariate"): Ranks features by F-statistic from SelectKBest(f_regression). Fast but ignores feature interactions.
Key parameters of FeatureSelectionTransformer:
Parameter |
Default |
Description |
|---|---|---|
|
|
Selection method |
|
|
Number of features to select (RFE, univariate) |
|
|
Cross-validation folds (RFECV) |
|
|
Evaluation metric for RFECV |
Returns include: selected_features, feature_importance, R² before and after selection, and feature reduction ratio.
Residual Analysis
ResidualAnalysisTransformer performs full residual diagnostics automatically when residual_analysis=True in the pipeline.
Normality tests:
Shapiro-Wilk (n < 5,000): most powerful for small samples
Anderson-Darling: compared against critical values at 5% significance
Jarque-Bera: tests whether skewness and kurtosis match the normal distribution
Homoscedasticity tests:
Breusch-Pagan: regresses squared residuals on features; significant result indicates heteroscedasticity
White: tests for non-linear forms of heteroscedasticity
Autocorrelation:
Durbin-Watson statistic: values near 2 indicate no autocorrelation; < 1.5 suggests positive, > 2.5 suggests negative autocorrelation
Influence measures:
Leverage: diagonal of the hat matrix; high leverage points have unusual feature values
Cook’s distance: measures overall influence on all fitted values; values > 4/n are flagged
Studentised residuals: standardised by leave-one-out standard error; |value| > 2.5 are flagged as outliers
Model Evaluation Metrics
Metric |
Range |
Better when |
|---|---|---|
R² |
0 – 1 |
Higher |
Adjusted R² |
< R² |
Higher (penalises extra features) |
RMSE |
≥ 0 |
Lower (same units as target) |
MAE |
≥ 0 |
Lower (robust to outliers) |
AIC |
any |
Lower (model comparison) |
BIC |
any |
Lower (stronger penalty for complexity) |
The overfitting check in evaluate_model_performance flags when the train R² exceeds test R² by more than 0.1. The MSE ratio (test / train) above 1.5 is a secondary signal.
Composition
Next step |
Purpose |
|---|---|
|
Validate model assumptions; test residual normality and correlation between residuals and features |
|
Identify clusters in residuals that may indicate omitted subgroup structure |
|
Use fitted regression as part of a decomposition or as a feature in forecasting |
|
Translate model coefficients into business impact estimates |
The regression_analysis result dict from the pipeline can be passed directly to statistical_analysis tools by supplying the residuals array and feature matrix.
Examples
Fit a linear model with diagnostics
result = tool_fit_regression(
engine=engine,
query="SELECT price, sqft, bedrooms, age, neighborhood FROM housing",
target_column="price",
model_type="linear",
)
reg = result["regression_analysis"]
print(f"R² = {reg['r2']:.3f}, Adjusted R² = {reg['adj_r2']:.3f}")
print(f"RMSE = {reg['rmse']:.1f}")
# Feature coefficients
for feat, coef in reg["coefficients"].items():
print(f" {feat}: {coef:.3f} (p={reg['p_values'][feat]:.4f})")
# Residual diagnostics
res = result["residual_analysis"]
print("Residuals normal?", res["normality_test"]["shapiro_wilk"]["is_normal"])
print("Homoscedastic?", res["homoscedasticity_test"]["breusch_pagan"]["is_homoscedastic"])
Regularised regression with automatic alpha selection
from localdata_mcp.domains.regression_modeling import RegularizedRegressionTransformer
import pandas as pd
df = pd.read_sql("SELECT * FROM features", engine)
X = df.drop(columns=["target"]).values
y = df["target"].values
feature_names = df.drop(columns=["target"]).columns.tolist()
transformer = RegularizedRegressionTransformer(method="lasso", alpha="auto", cv=5)
transformer.fit(X, y, feature_names=feature_names)
result = transformer.get_result()
print(f"Best alpha: {result['best_alpha']:.5f}")
print("Non-zero features:", result["non_zero_features"])
Feature selection before model fitting
result = tool_fit_regression(
engine=engine,
query="SELECT * FROM wide_feature_table",
target_column="outcome",
model_type="linear",
feature_selection=True, # enable RFECV-based selection
)
sel = result["feature_selection"]
print(f"Selected {sel['n_selected']} of {sel['n_original']} features")
print("Selected:", sel["selected_features"])
print(f"R² retained: {sel['comparison']['r2_selected']:.3f}")
Evaluate overfitting on a hold-out set
# Fit on training data
train_result = tool_fit_regression(
engine=engine,
query="SELECT * FROM train_data",
target_column="sales",
model_type="ridge",
)
# Evaluate on test data
eval_result = tool_evaluate_model(
engine=engine,
query="SELECT * FROM test_data",
target_column="sales",
prediction_column="predicted_sales", # pre-computed or use model directly
)
check = eval_result["overfitting_check"]
print(f"R² gap: {check['r2_gap']:.3f}")
print(f"Likely overfitting: {check['likely_overfitting']}")
Full pipeline: feature selection → lasso → residual diagnostics
from localdata_mcp.domains.regression_modeling import RegressionModelingPipeline
pipeline = RegressionModelingPipeline(
model_type="lasso",
cross_validation=True,
residual_analysis=True,
feature_selection=True,
)
pipeline.fit(X_train, y_train, feature_names=feature_names)
results = pipeline.get_results()
# Report
print("AIC:", results["regression_analysis"]["aic"])
outliers = results["residual_analysis"]["outliers"]
print(f"Potential outliers at indices: {outliers}")
cooks = results["residual_analysis"]["cooks_distance"]
high_influence = [i for i, c in enumerate(cooks) if c is not None and c > 4/len(y_train)]
print(f"High-influence observations: {high_influence}")