Pattern Recognition Domain

Overview

The pattern recognition domain provides clustering, dimensionality reduction, and anomaly detection for unlabelled or partially labelled datasets. Use it when you need to discover natural groupings in data, visualise high-dimensional structure in two or three dimensions, or identify observations that deviate significantly from normal behaviour.

When to use this domain:

Segmenting customers, products, or events into natural groups
Reducing many correlated features to a compact representation before modelling
Visualising high-dimensional data for exploration
Flagging unusual observations for manual review or downstream investigation
Validating whether group labels correspond to real data structure

Source: src/localdata_mcp/domains/pattern_recognition/

Available Analyses

Method	Class	Description
K-means clustering	`ClusteringTransformer`	Partition-based clustering with automatic k selection
Hierarchical clustering	`ClusteringTransformer`	Agglomerative clustering with configurable linkage
DBSCAN	`ClusteringTransformer`	Density-based clustering; handles arbitrary shapes and noise
Gaussian mixture models	`ClusteringTransformer`	Soft probabilistic cluster assignments
Spectral clustering	`ClusteringTransformer`	Graph-based clustering for non-convex structures
PCA	`DimensionalityReductionTransformer`	Linear projection maximising variance
t-SNE	`DimensionalityReductionTransformer`	Non-linear neighbourhood-preserving embedding
UMAP	`DimensionalityReductionTransformer`	Fast non-linear embedding; preserves global structure better than t-SNE
ICA	`DimensionalityReductionTransformer`	Independent component decomposition
LDA	`DimensionalityReductionTransformer`	Supervised linear projection maximising class separability
Isolation Forest	`AnomalyDetectionTransformer`	Anomaly detection via random feature splitting
One-Class SVM	`AnomalyDetectionTransformer`	Boundary-based anomaly detection
Local Outlier Factor (LOF)	`AnomalyDetectionTransformer`	Density-based local anomaly scoring
Statistical anomaly detection	`AnomalyDetectionTransformer`	Z-score and IQR based outlier flagging
Silhouette score	`PatternEvaluationTransformer`	Average inter-cluster separation vs intra-cluster cohesion
Davies-Bouldin index	`PatternEvaluationTransformer`	Average cluster similarity measure (lower is better)
Calinski-Harabasz score	`PatternEvaluationTransformer`	Variance ratio criterion (higher is better)
Adjusted Rand Index	`PatternEvaluationTransformer`	Cluster agreement with ground truth labels
Normalised Mutual Information	`PatternEvaluationTransformer`	Information-theoretic cluster agreement

MCP Tool Reference

The domain exposes three MCP tools via src/localdata_mcp/datascience_tools.py.

`tool_clustering`

Perform clustering on data retrieved from a SQL query.

Parameters:

Parameter	Type	Default	Description
`engine`	`Engine`	required	SQLAlchemy engine from an active connection
`query`	`str`	required	SQL query returning numeric feature columns
`columns`	`list[str]`	`None`	Columns to use as features; all numeric columns used if None
`method`	`str`	`"kmeans"`	Algorithm: `"kmeans"`, `"hierarchical"`, `"dbscan"`, `"gmm"`, `"spectral"`
`n_clusters`	`int`	`None`	Number of clusters; auto-selected if None
`max_rows`	`int`	`None`	Row cap (default 500,000)

Underlying ClusteringTransformer also accepts:

Parameter	Type	Default	Description
`auto_k_selection`	`bool`	`True`	Search k_range for optimal k when n_clusters is None
`k_range`	`tuple`	`(2, 10)`	Range to search for k
`standardize`	`bool`	`True`	Standardise features before clustering
`random_state`	`int`	`42`	Reproducibility seed

Returns: dict with keys:

labels — cluster assignment per observation
n_clusters — number of clusters found
cluster_centers — centroid coordinates (K-means and GMM)
cluster_sizes — count per cluster
inertia — within-cluster sum of squares (K-means only)
evaluation — silhouette score, Davies-Bouldin index, Calinski-Harabasz score
k_scores — scores across k values when auto-selection ran

`tool_anomaly_detection`

Detect anomalous observations in data retrieved from a SQL query.

Parameters:

Parameter	Type	Default	Description
`engine`	`Engine`	required	SQLAlchemy engine
`query`	`str`	required	SQL query
`columns`	`list[str]`	`None`	Feature columns; all numeric columns used if None
`method`	`str`	`"isolation_forest"`	Algorithm: `"isolation_forest"`, `"one_class_svm"`, `"lof"`, `"statistical"`
`contamination`	`float`	`0.1`	Expected proportion of anomalies (0.0 – 0.5)
`max_rows`	`int`	`None`	Row cap

Underlying AnomalyDetectionTransformer also accepts:

Parameter	Type	Default	Description
`standardize`	`bool`	`True`	Standardise features before detection
`random_state`	`int`	`42`	Reproducibility seed

Returns: dict with keys:

anomaly_labels — 1 for normal, -1 for anomaly per observation
anomaly_scores — continuous anomaly score (lower = more anomalous for Isolation Forest)
n_anomalies — count of detected anomalies
n_samples — total observation count
anomaly_rate — fraction of observations flagged
anomaly_indices — indices of flagged observations
evaluation — precision, recall, F1 when ground truth provided; otherwise threshold and score distribution

`tool_dimensionality_reduction`

Reduce feature dimensions in data retrieved from a SQL query.

Parameters:

Parameter	Type	Default	Description
`engine`	`Engine`	required	SQLAlchemy engine
`query`	`str`	required	SQL query
`columns`	`list[str]`	`None`	Feature columns; all numeric columns used if None
`method`	`str`	`"pca"`	Algorithm: `"pca"`, `"tsne"`, `"umap"`, `"ica"`, `"lda"`
`n_components`	`int`	`2`	Number of output dimensions
`max_rows`	`int`	`None`	Row cap

Underlying DimensionalityReductionTransformer also accepts:

Parameter	Type	Default	Description
`preserve_variance`	`float`	`0.95`	Minimum variance to preserve for PCA auto-selection
`standardize`	`bool`	`True`	Standardise features before reduction
`random_state`	`int`	`42`	Reproducibility seed

Returns: dict with keys:

transformed_data — reduced-dimension representation
original_dimensions — input feature count
reduced_dimensions — output component count
explained_variance_ratio — per-component variance explained (PCA only)
cumulative_variance_explained — cumulative variance (PCA only)
evaluation — reconstruction error and variance preservation metrics
loadings — feature loadings per component (PCA, ICA)

Method Details

K-means Clustering

Partitions data into k clusters by minimising within-cluster sum of squared distances to centroids. Assumes roughly spherical, equal-sized clusters.

When to use: Large datasets, known approximate number of clusters, roughly convex cluster shapes.

Auto k-selection: When n_clusters=None and auto_k_selection=True, the transformer evaluates k across k_range using the silhouette score and selects the k with the highest value.

Key limitations: Sensitive to outliers; assumes equal cluster variances; does not handle non-convex shapes.

Hierarchical Clustering

Builds a dendrogram by iteratively merging the two closest clusters (agglomerative). Does not require specifying k in advance; the dendrogram can be cut at any level.

When to use: Exploratory analysis where the number of clusters is unknown; when a hierarchical structure in the data is expected.

Linkage methods (set via algorithm_params): "ward" (minimises within-cluster variance), "complete", "average", "single".

DBSCAN

Groups points that are closely packed together and marks low-density points as noise (label -1). Does not require specifying k.

When to use: Arbitrarily shaped clusters; datasets with noise; unknown number of clusters.

Key parameters (passed via algorithm_params):

Parameter	Default	Description
`eps`	`0.5`	Maximum distance between two samples in the same neighbourhood
`min_samples`	`5`	Minimum samples in a neighbourhood to form a core point

Note: DBSCAN does not produce centroid coordinates. Cluster labels start from 0; -1 denotes noise.

Gaussian Mixture Models (GMM)

Fits a mixture of Gaussian distributions and assigns each observation a soft probability of belonging to each component.

When to use: When clusters overlap or have different covariance structures; when you want probabilistic membership.

Key parameters (via algorithm_params):

Parameter	Default	Description
`covariance_type`	`"full"`	`"full"`, `"tied"`, `"diag"`, `"spherical"`
`max_iter`	`100`	EM algorithm iterations

PCA

Linear projection that finds orthogonal directions of maximum variance. Components are ordered by explained variance.

When to use: Pre-processing before other algorithms; visualisation; removing correlated features.

Auto component selection: When n_components=None, PCA selects the minimum number of components that preserve preserve_variance (default 95%) of total variance.

Interpretation: explained_variance_ratio tells you how much information each component captures. loadings show which original features contribute to each component.

t-SNE

Non-linear dimensionality reduction that places similar observations close together in a 2D or 3D embedding. Optimised for visualisation.

When to use: Visualising cluster structure in high-dimensional data.

Key limitations: Does not preserve global distances reliably; stochastic (results vary across runs unless random_state is fixed); not suitable for dimensionality reduction before machine learning (use PCA for that).

Key parameters (via algorithm_params):

Parameter	Default	Description
`perplexity`	`30`	Balances local vs global structure (typical range 5–50)
`n_iter`	`1000`	Optimisation iterations
`learning_rate`	`"auto"`	Step size for gradient descent

UMAP

Non-linear manifold learning that is faster than t-SNE and preserves both local and global structure better.

When to use: Large datasets; when t-SNE is too slow; when global structure matters for interpretation.

Key parameters (via algorithm_params):

Parameter	Default	Description
`n_neighbors`	`15`	Local neighbourhood size
`min_dist`	`0.1`	Minimum distance between embedded points
`metric`	`"euclidean"`	Distance metric

Dependency note: UMAP requires the umap-learn package. The transformer falls back gracefully if it is not installed.

Isolation Forest

Detects anomalies by randomly partitioning the feature space. Anomalies are isolated in fewer splits than normal points and therefore have shorter average path lengths.

When to use: General-purpose anomaly detection; high-dimensional data; no assumptions about the anomaly distribution.

Key parameters (via algorithm_params):

Parameter	Default	Description
`n_estimators`	`100`	Number of isolation trees
`max_samples`	`"auto"`	Samples per tree (256 by default)

Local Outlier Factor (LOF)

Measures the local density deviation of each point relative to its k nearest neighbours. Points in low-density regions compared to neighbours are flagged.

When to use: Detecting anomalies that are only outliers relative to their local neighbourhood; useful when data has clusters of varying density.

Key parameters (via algorithm_params):

Parameter	Default	Description
`n_neighbors`	`20`	Neighbourhood size
`metric`	`"minkowski"`	Distance metric

Clustering Quality Metrics

Silhouette score (−1 to 1): Measures how similar each observation is to its own cluster compared to other clusters. Higher is better. Values > 0.5 indicate reasonable separation; > 0.7 indicates strong structure.

Davies-Bouldin index (≥ 0): Average ratio of within-cluster scatter to between-cluster separation. Lower is better. Zero indicates perfect separation.

Calinski-Harabasz score (≥ 0): Ratio of between-cluster to within-cluster dispersion. Higher is better. No absolute threshold; use for comparing k values.

Adjusted Rand Index (−1 to 1): Agreement between predicted labels and ground truth. 1 = perfect agreement; 0 = random; negative = worse than random.

Normalised Mutual Information (0 to 1): Information-theoretic agreement with ground truth. 1 = perfect; 0 = no mutual information.

Composition

Next step	Purpose
`statistical_analysis`	Test whether clusters differ significantly on key variables
`regression_modeling`	Use cluster labels as features or stratify model fitting per cluster
`time_series`	Detect time-series anomalies; compare with spatial anomalies
`business_intelligence`	Translate customer clusters into segments for targeting
`reduce_dimensions`	Reduce dimensions first, then cluster in the lower-dimensional space

Typical composition patterns:

Cluster then test: Run tool_clustering, then pass cluster labels to tool_hypothesis_test with cluster as the group variable to determine which features drive cluster separation.
Reduce then cluster: Run tool_dimensionality_reduction (PCA, n_components=10) to reduce noise, then run tool_clustering on the reduced representation.
Detect then investigate: Run tool_anomaly_detection, extract anomaly indices, then query those rows back from the database for detailed review.

Examples

Customer segmentation with automatic k selection

result = tool_clustering(
    engine=engine,
    query="SELECT avg_order_value, order_frequency, days_since_last_order FROM customers",
    method="kmeans",
    n_clusters=None,  # auto-select
)

print(f"Optimal k: {result['n_clusters']}")
print(f"Silhouette score: {result['evaluation']['silhouette_score']:.3f}")
print(f"Davies-Bouldin: {result['evaluation']['davies_bouldin_score']:.3f}")

for k, score in result["k_scores"].items():
    print(f"  k={k}: silhouette={score:.3f}")

Fraud detection with Isolation Forest

result = tool_anomaly_detection(
    engine=engine,
    query="SELECT amount, merchant_category, hour_of_day, distance_from_home FROM transactions",
    method="isolation_forest",
    contamination=0.02,   # expect ~2% fraud rate
)

print(f"Anomalies detected: {result['n_anomalies']} / {result['n_samples']}")
print(f"Anomaly rate: {result['anomaly_rate']:.2%}")

# Get row indices to investigate
anomaly_indices = result["anomaly_indices"]

PCA for visualisation and noise reduction

result = tool_dimensionality_reduction(
    engine=engine,
    query="SELECT * FROM high_dimensional_features",
    method="pca",
    n_components=2,
)

print(f"Explained variance: {result['cumulative_variance_explained'][-1]:.1%}")
print("Component loadings:", result["loadings"])

# 2D embedding ready for plotting
import numpy as np
coords = np.array(result["transformed_data"])
# coords[:, 0] = PC1, coords[:, 1] = PC2

Reduce dimensions then cluster

# Step 1: reduce with PCA to remove noise
reduction = tool_dimensionality_reduction(
    engine=engine,
    query="SELECT * FROM sensor_readings",
    method="pca",
    n_components=10,
)

# Step 2: cluster in reduced space using direct transformer
import numpy as np
from localdata_mcp.domains.pattern_recognition import ClusteringTransformer, PatternEvaluationTransformer

X_reduced = np.array(reduction["transformed_data"])

clusterer = ClusteringTransformer(algorithm="dbscan", auto_k_selection=False)
clusterer.fit(X_reduced)
cluster_result = clusterer.get_clustering_result(X_reduced)

evaluator = PatternEvaluationTransformer("clustering")
eval_result = evaluator.evaluate_clustering(X_reduced, cluster_result.labels)

print(f"Clusters found: {cluster_result.n_clusters}")
print(f"Noise points: {(cluster_result.labels == -1).sum()}")
print(f"Silhouette: {eval_result.metrics['silhouette_score']:.3f}")

Compare clustering against known labels

from localdata_mcp.domains.pattern_recognition import perform_clustering
import numpy as np

# Assume X is feature array and y_true are known labels
result = perform_clustering(X, algorithm="gmm", n_clusters=4, y_true=y_true)

eval_data = result["evaluation"]
print(f"Adjusted Rand Index: {eval_data['adjusted_rand_score']:.3f}")
print(f"Normalised Mutual Info: {eval_data['normalized_mutual_info']:.3f}")
print(eval_data["quality_assessment"])  # e.g. "Good clustering structure"

t-SNE visualisation of cluster structure

result = tool_dimensionality_reduction(
    engine=engine,
    query="SELECT * FROM embedding_features",
    method="tsne",
    n_components=2,
    max_rows=5000,   # t-SNE is expensive; limit rows
)

coords = result["transformed_data"]  # shape (n, 2)
# Use with any plotting library

Pattern Recognition Domain

Overview

Available Analyses

MCP Tool Reference

tool_clustering

tool_anomaly_detection

tool_dimensionality_reduction

Method Details

K-means Clustering

Hierarchical Clustering

DBSCAN

Gaussian Mixture Models (GMM)

PCA

t-SNE

UMAP

Isolation Forest

Local Outlier Factor (LOF)

Clustering Quality Metrics

Composition

Examples

Customer segmentation with automatic k selection

Fraud detection with Isolation Forest

PCA for visualisation and noise reduction

Reduce dimensions then cluster

Compare clustering against known labels

t-SNE visualisation of cluster structure

`tool_clustering`

`tool_anomaly_detection`

`tool_dimensionality_reduction`