# Regression and Modeling Domain

## Overview

The regression and modeling domain fits regression models, evaluates their performance, and diagnoses their residuals. Use it when you need to quantify the relationship between a continuous outcome and one or more predictor variables, or when you need to predict numeric values from a set of features.

**When to use this domain:**

- Estimating the effect of one or more variables on a continuous outcome
- Predicting numeric values from structured features
- Selecting the most informative features from a large feature set
- Checking whether model assumptions (linearity, homoscedasticity, independence) hold
- Comparing in-sample versus out-of-sample performance to detect overfitting

**Source:** `src/localdata_mcp/domains/regression_modeling/`

---

## Available Analyses

| Method | Class | Description |
|---|---|---|
| Ordinary least squares | `LinearRegressionTransformer` | Standard linear regression with full statistical diagnostics |
| Ridge regression | `RegularizedRegressionTransformer` | L2 regularisation; shrinks coefficients without eliminating them |
| Lasso regression | `RegularizedRegressionTransformer` | L1 regularisation; performs automatic feature selection |
| Elastic net | `RegularizedRegressionTransformer` | L1+L2 combination; balances Ridge and Lasso properties |
| Logistic regression | `LogisticRegressionTransformer` | Binary or multi-class classification |
| Polynomial regression | `PolynomialRegressionTransformer` | Non-linear relationships via polynomial feature expansion |
| Model-based feature selection | `FeatureSelectionTransformer` | Select features via Lasso coefficient shrinkage |
| Recursive feature elimination | `FeatureSelectionTransformer` | Iteratively remove least important features (RFE / RFECV) |
| Univariate feature selection | `FeatureSelectionTransformer` | F-statistic based selection (SelectKBest) |
| Residual normality tests | `ResidualAnalysisTransformer` | Shapiro-Wilk, Anderson-Darling, Jarque-Bera |
| Homoscedasticity tests | `ResidualAnalysisTransformer` | Breusch-Pagan and White tests |
| Autocorrelation test | `ResidualAnalysisTransformer` | Durbin-Watson statistic |
| Influence measures | `ResidualAnalysisTransformer` | Leverage, Cook's distance, studentised residuals |
| Cross-validation | `RegressionModelingPipeline` | K-fold R² and RMSE |

---

## MCP Tool Reference

The domain exposes two primary MCP tools via `src/localdata_mcp/datascience_tools.py`.

### `tool_fit_regression`

Fit a regression model on data retrieved from a SQL query, with optional residual analysis and cross-validation.

**Parameters:**

| Parameter | Type | Default | Description |
|---|---|---|---|
| `engine` | `Engine` | required | SQLAlchemy engine from an active connection |
| `query` | `str` | required | SQL query returning features and target column |
| `target_column` | `str` | required | Name of the numeric outcome column |
| `feature_columns` | `list[str]` | `None` | Feature columns; all non-target columns used if None |
| `model_type` | `str` | `"linear"` | `"linear"`, `"ridge"`, `"lasso"`, `"elastic_net"`, `"logistic"`, `"polynomial"` |
| `regularization` | `str` | `None` | Override regularisation method (alternative to `model_type`) |
| `max_rows` | `int` | `None` | Row cap (default 500,000) |

Underlying `RegressionModelingPipeline` also accepts:

| Parameter | Type | Default | Description |
|---|---|---|---|
| `cross_validation` | `bool` | `True` | Perform K-fold cross-validation |
| `residual_analysis` | `bool` | `True` | Run residual diagnostics after fitting |
| `feature_selection` | `bool` | `False` | Run automatic feature selection before fitting |
| `preprocessing` | `str` | `"auto"` | Preprocessing level: `"minimal"`, `"auto"`, `"comprehensive"` |

**Returns:** `dict` with keys:

- `model_type` — model type fitted
- `regression_analysis` — coefficients, standard errors, p-values, R², adjusted R², RMSE, MAE, AIC, BIC
- `residual_analysis` — normality tests, homoscedasticity tests, autocorrelation, outlier indices, Cook's distances
- `feature_selection` — selected features and importance scores (when enabled)
- `pipeline_config` — configuration settings used

---

### `tool_evaluate_model`

Evaluate a fitted model's performance on held-out data.

**Parameters:**

| Parameter | Type | Default | Description |
|---|---|---|---|
| `engine` | `Engine` | required | SQLAlchemy engine |
| `query` | `str` | required | SQL query returning test data |
| `target_column` | `str` | required | Ground-truth outcome column |
| `prediction_column` | `str` | required | Column containing model predictions |
| `model_type` | `str` | `"regression"` | Model type for interpretation context |
| `max_rows` | `int` | `None` | Row cap |

For direct use of `evaluate_model_performance` from the domain:

```python
evaluation = evaluate_model_performance(
    model=fitted_model,
    X_test=X_test,
    y_test=y_test,
    X_train=X_train,   # optional, enables overfitting check
    y_train=y_train,
)
```

**Returns:** `dict` with keys:

- `test_metrics` — R², MSE, RMSE, MAE, explained variance
- `train_metrics` — same metrics for training data (when provided)
- `overfitting_check` — R² gap between train and test; `likely_overfitting=True` when gap > 0.1
- `test_predictions` — model predictions as list
- `test_residuals` — prediction errors as list

---

## Method Details

### Linear Regression (OLS)

Fits ordinary least squares with `sklearn.linear_model.LinearRegression` and computes full statistical diagnostics via `statsmodels`.

Outputs include:
- Coefficients with standard errors, t-statistics, and p-values for each feature
- R² and adjusted R²
- F-statistic for overall model significance
- AIC and BIC for model comparison

**Key parameters of `LinearRegressionTransformer`:**

| Parameter | Default | Description |
|---|---|---|
| `fit_intercept` | `True` | Include intercept term |
| `include_diagnostics` | `True` | Run full statsmodels diagnostics |
| `alpha` | `0.05` | Significance level for tests |

---

### Regularised Regression

All three variants use cross-validation to select the optimal regularisation strength when `alpha="auto"` (default).

**Ridge**: Penalises the sum of squared coefficients (L2). All features remain in the model; coefficients shrink toward zero. Use when you want to reduce variance without eliminating predictors.

**Lasso**: Penalises the sum of absolute coefficients (L1). Drives some coefficients exactly to zero, performing automatic feature selection. Use when you suspect many irrelevant features.

**Elastic Net**: Combines L1 and L2 penalties. The `l1_ratio` parameter controls the mix (0 = Ridge, 1 = Lasso). Use when features are correlated and Lasso tends to arbitrarily drop one from a correlated group.

**Key parameters of `RegularizedRegressionTransformer`:**

| Parameter | Default | Description |
|---|---|---|
| `method` | `"ridge"` | `"ridge"`, `"lasso"`, `"elastic_net"` |
| `alpha` | `"auto"` | Regularisation strength; `"auto"` uses CV |
| `l1_ratio` | `0.5` | ElasticNet L1/L2 mix (only for elastic_net) |
| `cv` | `5` | Cross-validation folds for hyperparameter search |
| `max_iter` | `1000` | Solver iteration limit |

---

### Logistic Regression

`LogisticRegressionTransformer` fits a regularised logistic regression for binary or multiclass classification. Reports coefficients, odds ratios, and classification metrics (accuracy, precision, recall, F1, AUC-ROC).

---

### Polynomial Regression

`PolynomialRegressionTransformer` expands features to polynomial terms up to a specified degree, then fits OLS. Use for capturing non-linear relationships in low-dimensional data. Beware overfitting at high degrees.

---

### Feature Selection

Three methods are available via `FeatureSelectionTransformer`:

**Model-based** (`method="model_based"`): Uses `sklearn.feature_selection.SelectFromModel` with a LassoCV estimator. Features with near-zero Lasso coefficients are dropped.

**Recursive Feature Elimination** (`method="rfe"`): Iteratively fits the model and removes the least important feature. The number of features to keep is set by `k`.

**RFECV** (`method="rfecv"`): Like RFE but selects k automatically via cross-validation. Reports the optimal number of features and cross-validation scores.

**Univariate** (`method="univariate"`): Ranks features by F-statistic from `SelectKBest(f_regression)`. Fast but ignores feature interactions.

**Key parameters of `FeatureSelectionTransformer`:**

| Parameter | Default | Description |
|---|---|---|
| `method` | `"model_based"` | Selection method |
| `k` | `"all"` | Number of features to select (RFE, univariate) |
| `cv` | `5` | Cross-validation folds (RFECV) |
| `scoring` | `"r2"` | Evaluation metric for RFECV |

**Returns** include: `selected_features`, `feature_importance`, R² before and after selection, and feature reduction ratio.

---

### Residual Analysis

`ResidualAnalysisTransformer` performs full residual diagnostics automatically when `residual_analysis=True` in the pipeline.

**Normality tests:**
- **Shapiro-Wilk** (n < 5,000): most powerful for small samples
- **Anderson-Darling**: compared against critical values at 5% significance
- **Jarque-Bera**: tests whether skewness and kurtosis match the normal distribution

**Homoscedasticity tests:**
- **Breusch-Pagan**: regresses squared residuals on features; significant result indicates heteroscedasticity
- **White**: tests for non-linear forms of heteroscedasticity

**Autocorrelation:**
- **Durbin-Watson** statistic: values near 2 indicate no autocorrelation; < 1.5 suggests positive, > 2.5 suggests negative autocorrelation

**Influence measures:**
- **Leverage**: diagonal of the hat matrix; high leverage points have unusual feature values
- **Cook's distance**: measures overall influence on all fitted values; values > 4/n are flagged
- **Studentised residuals**: standardised by leave-one-out standard error; |value| > 2.5 are flagged as outliers

---

### Model Evaluation Metrics

| Metric | Range | Better when |
|---|---|---|
| R² | 0 – 1 | Higher |
| Adjusted R² | < R² | Higher (penalises extra features) |
| RMSE | ≥ 0 | Lower (same units as target) |
| MAE | ≥ 0 | Lower (robust to outliers) |
| AIC | any | Lower (model comparison) |
| BIC | any | Lower (stronger penalty for complexity) |

The overfitting check in `evaluate_model_performance` flags when the train R² exceeds test R² by more than 0.1. The MSE ratio (test / train) above 1.5 is a secondary signal.

---

## Composition

| Next step | Purpose |
|---|---|
| `statistical_analysis` | Validate model assumptions; test residual normality and correlation between residuals and features |
| `pattern_recognition` | Identify clusters in residuals that may indicate omitted subgroup structure |
| `time_series` | Use fitted regression as part of a decomposition or as a feature in forecasting |
| `business_intelligence` | Translate model coefficients into business impact estimates |

The `regression_analysis` result dict from the pipeline can be passed directly to `statistical_analysis` tools by supplying the residuals array and feature matrix.

---

## Examples

### Fit a linear model with diagnostics

```python
result = tool_fit_regression(
    engine=engine,
    query="SELECT price, sqft, bedrooms, age, neighborhood FROM housing",
    target_column="price",
    model_type="linear",
)

reg = result["regression_analysis"]
print(f"R² = {reg['r2']:.3f}, Adjusted R² = {reg['adj_r2']:.3f}")
print(f"RMSE = {reg['rmse']:.1f}")

# Feature coefficients
for feat, coef in reg["coefficients"].items():
    print(f"  {feat}: {coef:.3f} (p={reg['p_values'][feat]:.4f})")

# Residual diagnostics
res = result["residual_analysis"]
print("Residuals normal?", res["normality_test"]["shapiro_wilk"]["is_normal"])
print("Homoscedastic?", res["homoscedasticity_test"]["breusch_pagan"]["is_homoscedastic"])
```

### Regularised regression with automatic alpha selection

```python
from localdata_mcp.domains.regression_modeling import RegularizedRegressionTransformer
import pandas as pd

df = pd.read_sql("SELECT * FROM features", engine)
X = df.drop(columns=["target"]).values
y = df["target"].values
feature_names = df.drop(columns=["target"]).columns.tolist()

transformer = RegularizedRegressionTransformer(method="lasso", alpha="auto", cv=5)
transformer.fit(X, y, feature_names=feature_names)
result = transformer.get_result()

print(f"Best alpha: {result['best_alpha']:.5f}")
print("Non-zero features:", result["non_zero_features"])
```

### Feature selection before model fitting

```python
result = tool_fit_regression(
    engine=engine,
    query="SELECT * FROM wide_feature_table",
    target_column="outcome",
    model_type="linear",
    feature_selection=True,  # enable RFECV-based selection
)

sel = result["feature_selection"]
print(f"Selected {sel['n_selected']} of {sel['n_original']} features")
print("Selected:", sel["selected_features"])
print(f"R² retained: {sel['comparison']['r2_selected']:.3f}")
```

### Evaluate overfitting on a hold-out set

```python
# Fit on training data
train_result = tool_fit_regression(
    engine=engine,
    query="SELECT * FROM train_data",
    target_column="sales",
    model_type="ridge",
)

# Evaluate on test data
eval_result = tool_evaluate_model(
    engine=engine,
    query="SELECT * FROM test_data",
    target_column="sales",
    prediction_column="predicted_sales",  # pre-computed or use model directly
)

check = eval_result["overfitting_check"]
print(f"R² gap: {check['r2_gap']:.3f}")
print(f"Likely overfitting: {check['likely_overfitting']}")
```

### Full pipeline: feature selection → lasso → residual diagnostics

```python
from localdata_mcp.domains.regression_modeling import RegressionModelingPipeline

pipeline = RegressionModelingPipeline(
    model_type="lasso",
    cross_validation=True,
    residual_analysis=True,
    feature_selection=True,
)
pipeline.fit(X_train, y_train, feature_names=feature_names)
results = pipeline.get_results()

# Report
print("AIC:", results["regression_analysis"]["aic"])
outliers = results["residual_analysis"]["outliers"]
print(f"Potential outliers at indices: {outliers}")
cooks = results["residual_analysis"]["cooks_distance"]
high_influence = [i for i, c in enumerate(cooks) if c is not None and c > 4/len(y_train)]
print(f"High-influence observations: {high_influence}")
```