Pattern Recognition Domain
Overview
The pattern recognition domain provides clustering, dimensionality reduction, and anomaly detection for unlabelled or partially labelled datasets. Use it when you need to discover natural groupings in data, visualise high-dimensional structure in two or three dimensions, or identify observations that deviate significantly from normal behaviour.
When to use this domain:
Segmenting customers, products, or events into natural groups
Reducing many correlated features to a compact representation before modelling
Visualising high-dimensional data for exploration
Flagging unusual observations for manual review or downstream investigation
Validating whether group labels correspond to real data structure
Source: src/localdata_mcp/domains/pattern_recognition/
Available Analyses
Method |
Class |
Description |
|---|---|---|
K-means clustering |
|
Partition-based clustering with automatic k selection |
Hierarchical clustering |
|
Agglomerative clustering with configurable linkage |
DBSCAN |
|
Density-based clustering; handles arbitrary shapes and noise |
Gaussian mixture models |
|
Soft probabilistic cluster assignments |
Spectral clustering |
|
Graph-based clustering for non-convex structures |
PCA |
|
Linear projection maximising variance |
t-SNE |
|
Non-linear neighbourhood-preserving embedding |
UMAP |
|
Fast non-linear embedding; preserves global structure better than t-SNE |
ICA |
|
Independent component decomposition |
LDA |
|
Supervised linear projection maximising class separability |
Isolation Forest |
|
Anomaly detection via random feature splitting |
One-Class SVM |
|
Boundary-based anomaly detection |
Local Outlier Factor (LOF) |
|
Density-based local anomaly scoring |
Statistical anomaly detection |
|
Z-score and IQR based outlier flagging |
Silhouette score |
|
Average inter-cluster separation vs intra-cluster cohesion |
Davies-Bouldin index |
|
Average cluster similarity measure (lower is better) |
Calinski-Harabasz score |
|
Variance ratio criterion (higher is better) |
Adjusted Rand Index |
|
Cluster agreement with ground truth labels |
Normalised Mutual Information |
|
Information-theoretic cluster agreement |
MCP Tool Reference
The domain exposes three MCP tools via src/localdata_mcp/datascience_tools.py.
tool_clustering
Perform clustering on data retrieved from a SQL query.
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
SQLAlchemy engine from an active connection |
|
|
required |
SQL query returning numeric feature columns |
|
|
|
Columns to use as features; all numeric columns used if None |
|
|
|
Algorithm: |
|
|
|
Number of clusters; auto-selected if None |
|
|
|
Row cap (default 500,000) |
Underlying ClusteringTransformer also accepts:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Search k_range for optimal k when n_clusters is None |
|
|
|
Range to search for k |
|
|
|
Standardise features before clustering |
|
|
|
Reproducibility seed |
Returns: dict with keys:
labels— cluster assignment per observationn_clusters— number of clusters foundcluster_centers— centroid coordinates (K-means and GMM)cluster_sizes— count per clusterinertia— within-cluster sum of squares (K-means only)evaluation— silhouette score, Davies-Bouldin index, Calinski-Harabasz scorek_scores— scores across k values when auto-selection ran
tool_anomaly_detection
Detect anomalous observations in data retrieved from a SQL query.
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
SQLAlchemy engine |
|
|
required |
SQL query |
|
|
|
Feature columns; all numeric columns used if None |
|
|
|
Algorithm: |
|
|
|
Expected proportion of anomalies (0.0 – 0.5) |
|
|
|
Row cap |
Underlying AnomalyDetectionTransformer also accepts:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Standardise features before detection |
|
|
|
Reproducibility seed |
Returns: dict with keys:
anomaly_labels— 1 for normal, -1 for anomaly per observationanomaly_scores— continuous anomaly score (lower = more anomalous for Isolation Forest)n_anomalies— count of detected anomaliesn_samples— total observation countanomaly_rate— fraction of observations flaggedanomaly_indices— indices of flagged observationsevaluation— precision, recall, F1 when ground truth provided; otherwise threshold and score distribution
tool_dimensionality_reduction
Reduce feature dimensions in data retrieved from a SQL query.
Parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
SQLAlchemy engine |
|
|
required |
SQL query |
|
|
|
Feature columns; all numeric columns used if None |
|
|
|
Algorithm: |
|
|
|
Number of output dimensions |
|
|
|
Row cap |
Underlying DimensionalityReductionTransformer also accepts:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Minimum variance to preserve for PCA auto-selection |
|
|
|
Standardise features before reduction |
|
|
|
Reproducibility seed |
Returns: dict with keys:
transformed_data— reduced-dimension representationoriginal_dimensions— input feature countreduced_dimensions— output component countexplained_variance_ratio— per-component variance explained (PCA only)cumulative_variance_explained— cumulative variance (PCA only)evaluation— reconstruction error and variance preservation metricsloadings— feature loadings per component (PCA, ICA)
Method Details
K-means Clustering
Partitions data into k clusters by minimising within-cluster sum of squared distances to centroids. Assumes roughly spherical, equal-sized clusters.
When to use: Large datasets, known approximate number of clusters, roughly convex cluster shapes.
Auto k-selection: When n_clusters=None and auto_k_selection=True, the transformer evaluates k across k_range using the silhouette score and selects the k with the highest value.
Key limitations: Sensitive to outliers; assumes equal cluster variances; does not handle non-convex shapes.
Hierarchical Clustering
Builds a dendrogram by iteratively merging the two closest clusters (agglomerative). Does not require specifying k in advance; the dendrogram can be cut at any level.
When to use: Exploratory analysis where the number of clusters is unknown; when a hierarchical structure in the data is expected.
Linkage methods (set via algorithm_params): "ward" (minimises within-cluster variance), "complete", "average", "single".
DBSCAN
Groups points that are closely packed together and marks low-density points as noise (label -1). Does not require specifying k.
When to use: Arbitrarily shaped clusters; datasets with noise; unknown number of clusters.
Key parameters (passed via algorithm_params):
Parameter |
Default |
Description |
|---|---|---|
|
|
Maximum distance between two samples in the same neighbourhood |
|
|
Minimum samples in a neighbourhood to form a core point |
Note: DBSCAN does not produce centroid coordinates. Cluster labels start from 0; -1 denotes noise.
Gaussian Mixture Models (GMM)
Fits a mixture of Gaussian distributions and assigns each observation a soft probability of belonging to each component.
When to use: When clusters overlap or have different covariance structures; when you want probabilistic membership.
Key parameters (via algorithm_params):
Parameter |
Default |
Description |
|---|---|---|
|
|
|
|
|
EM algorithm iterations |
PCA
Linear projection that finds orthogonal directions of maximum variance. Components are ordered by explained variance.
When to use: Pre-processing before other algorithms; visualisation; removing correlated features.
Auto component selection: When n_components=None, PCA selects the minimum number of components that preserve preserve_variance (default 95%) of total variance.
Interpretation: explained_variance_ratio tells you how much information each component captures. loadings show which original features contribute to each component.
t-SNE
Non-linear dimensionality reduction that places similar observations close together in a 2D or 3D embedding. Optimised for visualisation.
When to use: Visualising cluster structure in high-dimensional data.
Key limitations: Does not preserve global distances reliably; stochastic (results vary across runs unless random_state is fixed); not suitable for dimensionality reduction before machine learning (use PCA for that).
Key parameters (via algorithm_params):
Parameter |
Default |
Description |
|---|---|---|
|
|
Balances local vs global structure (typical range 5–50) |
|
|
Optimisation iterations |
|
|
Step size for gradient descent |
UMAP
Non-linear manifold learning that is faster than t-SNE and preserves both local and global structure better.
When to use: Large datasets; when t-SNE is too slow; when global structure matters for interpretation.
Key parameters (via algorithm_params):
Parameter |
Default |
Description |
|---|---|---|
|
|
Local neighbourhood size |
|
|
Minimum distance between embedded points |
|
|
Distance metric |
Dependency note: UMAP requires the umap-learn package. The transformer falls back gracefully if it is not installed.
Isolation Forest
Detects anomalies by randomly partitioning the feature space. Anomalies are isolated in fewer splits than normal points and therefore have shorter average path lengths.
When to use: General-purpose anomaly detection; high-dimensional data; no assumptions about the anomaly distribution.
Key parameters (via algorithm_params):
Parameter |
Default |
Description |
|---|---|---|
|
|
Number of isolation trees |
|
|
Samples per tree (256 by default) |
Local Outlier Factor (LOF)
Measures the local density deviation of each point relative to its k nearest neighbours. Points in low-density regions compared to neighbours are flagged.
When to use: Detecting anomalies that are only outliers relative to their local neighbourhood; useful when data has clusters of varying density.
Key parameters (via algorithm_params):
Parameter |
Default |
Description |
|---|---|---|
|
|
Neighbourhood size |
|
|
Distance metric |
Clustering Quality Metrics
Silhouette score (−1 to 1): Measures how similar each observation is to its own cluster compared to other clusters. Higher is better. Values > 0.5 indicate reasonable separation; > 0.7 indicates strong structure.
Davies-Bouldin index (≥ 0): Average ratio of within-cluster scatter to between-cluster separation. Lower is better. Zero indicates perfect separation.
Calinski-Harabasz score (≥ 0): Ratio of between-cluster to within-cluster dispersion. Higher is better. No absolute threshold; use for comparing k values.
Adjusted Rand Index (−1 to 1): Agreement between predicted labels and ground truth. 1 = perfect agreement; 0 = random; negative = worse than random.
Normalised Mutual Information (0 to 1): Information-theoretic agreement with ground truth. 1 = perfect; 0 = no mutual information.
Composition
Next step |
Purpose |
|---|---|
|
Test whether clusters differ significantly on key variables |
|
Use cluster labels as features or stratify model fitting per cluster |
|
Detect time-series anomalies; compare with spatial anomalies |
|
Translate customer clusters into segments for targeting |
|
Reduce dimensions first, then cluster in the lower-dimensional space |
Typical composition patterns:
Cluster then test: Run
tool_clustering, then pass cluster labels totool_hypothesis_testwith cluster as the group variable to determine which features drive cluster separation.Reduce then cluster: Run
tool_dimensionality_reduction(PCA, n_components=10) to reduce noise, then runtool_clusteringon the reduced representation.Detect then investigate: Run
tool_anomaly_detection, extract anomaly indices, then query those rows back from the database for detailed review.
Examples
Customer segmentation with automatic k selection
result = tool_clustering(
engine=engine,
query="SELECT avg_order_value, order_frequency, days_since_last_order FROM customers",
method="kmeans",
n_clusters=None, # auto-select
)
print(f"Optimal k: {result['n_clusters']}")
print(f"Silhouette score: {result['evaluation']['silhouette_score']:.3f}")
print(f"Davies-Bouldin: {result['evaluation']['davies_bouldin_score']:.3f}")
for k, score in result["k_scores"].items():
print(f" k={k}: silhouette={score:.3f}")
Fraud detection with Isolation Forest
result = tool_anomaly_detection(
engine=engine,
query="SELECT amount, merchant_category, hour_of_day, distance_from_home FROM transactions",
method="isolation_forest",
contamination=0.02, # expect ~2% fraud rate
)
print(f"Anomalies detected: {result['n_anomalies']} / {result['n_samples']}")
print(f"Anomaly rate: {result['anomaly_rate']:.2%}")
# Get row indices to investigate
anomaly_indices = result["anomaly_indices"]
PCA for visualisation and noise reduction
result = tool_dimensionality_reduction(
engine=engine,
query="SELECT * FROM high_dimensional_features",
method="pca",
n_components=2,
)
print(f"Explained variance: {result['cumulative_variance_explained'][-1]:.1%}")
print("Component loadings:", result["loadings"])
# 2D embedding ready for plotting
import numpy as np
coords = np.array(result["transformed_data"])
# coords[:, 0] = PC1, coords[:, 1] = PC2
Reduce dimensions then cluster
# Step 1: reduce with PCA to remove noise
reduction = tool_dimensionality_reduction(
engine=engine,
query="SELECT * FROM sensor_readings",
method="pca",
n_components=10,
)
# Step 2: cluster in reduced space using direct transformer
import numpy as np
from localdata_mcp.domains.pattern_recognition import ClusteringTransformer, PatternEvaluationTransformer
X_reduced = np.array(reduction["transformed_data"])
clusterer = ClusteringTransformer(algorithm="dbscan", auto_k_selection=False)
clusterer.fit(X_reduced)
cluster_result = clusterer.get_clustering_result(X_reduced)
evaluator = PatternEvaluationTransformer("clustering")
eval_result = evaluator.evaluate_clustering(X_reduced, cluster_result.labels)
print(f"Clusters found: {cluster_result.n_clusters}")
print(f"Noise points: {(cluster_result.labels == -1).sum()}")
print(f"Silhouette: {eval_result.metrics['silhouette_score']:.3f}")
Compare clustering against known labels
from localdata_mcp.domains.pattern_recognition import perform_clustering
import numpy as np
# Assume X is feature array and y_true are known labels
result = perform_clustering(X, algorithm="gmm", n_clusters=4, y_true=y_true)
eval_data = result["evaluation"]
print(f"Adjusted Rand Index: {eval_data['adjusted_rand_score']:.3f}")
print(f"Normalised Mutual Info: {eval_data['normalized_mutual_info']:.3f}")
print(eval_data["quality_assessment"]) # e.g. "Good clustering structure"
t-SNE visualisation of cluster structure
result = tool_dimensionality_reduction(
engine=engine,
query="SELECT * FROM embedding_features",
method="tsne",
n_components=2,
max_rows=5000, # t-SNE is expensive; limit rows
)
coords = result["transformed_data"] # shape (n, 2)
# Use with any plotting library