Sampling & Estimation Domain

Overview

The sampling and estimation domain provides methods for drawing representative samples from data, quantifying uncertainty around statistics, and performing probabilistic inference. It covers classical sampling theory, bootstrap resampling, Monte Carlo simulation, and Bayesian estimation.

Use this domain when you need to:

Draw a representative subset from a large dataset for faster downstream analysis
Estimate confidence intervals for a statistic when distributional assumptions are uncertain
Simulate outcomes or propagate uncertainty through a model using Monte Carlo methods
Update prior beliefs with observed data and obtain posterior credible intervals

All transformers are sklearn-compatible (BaseEstimator, TransformerMixin). High-level functions accept a DataFrame or a file path and return JSON-serializable dictionaries.

Available Analyses

Analysis	Function	Description
Simple random sampling	`generate_sample` with `method="simple_random"`	Uniform random selection without replacement
Stratified sampling	`generate_sample` with `method="stratified"`	Proportional allocation across strata
Cluster sampling	`generate_sample` with `method="cluster"`	Select random clusters, take all members
Systematic sampling	`generate_sample` with `method="systematic"`	Regular interval selection with random start
Weighted sampling	`generate_sample` with `method="weighted"`	Probability-proportional-to-size sampling
Percentile bootstrap CI	`bootstrap_statistic` with `method="percentile"`	Distribution-free confidence intervals
BCa bootstrap CI	`bootstrap_statistic` with `method="bca"`	Bias-corrected and accelerated intervals
Basic bootstrap	`bootstrap_statistic` with `method="basic"`	Pivotal confidence intervals
Studentised bootstrap	`bootstrap_statistic` with `method="studentized"`	Bootstrap-t intervals
Monte Carlo integration	`monte_carlo_simulate` with `type="integration"`	Numerical integration by random sampling
Monte Carlo simulation	`monte_carlo_simulate` with `type="simulation"`	Forward uncertainty propagation
Importance sampling	`monte_carlo_simulate` with `type="importance_sampling"`	Variance reduction for rare events
Posterior estimation	`bayesian_estimate` with `type="posterior"`	Bayesian parameter estimation
Bayesian updating	`bayesian_estimate` with `type="updating"`	Sequential belief update
Credible intervals	`bayesian_estimate`	Highest density interval (HDI) or equal-tailed CI

MCP Tool Reference

`generate_sample`

Draw a sample from a dataset using a chosen sampling method.

Parameters

Parameter	Type	Default	Description
`data`	DataFrame or str	required	Input DataFrame or path to CSV/JSON file
`sampling_method`	str	`"simple_random"`	Sampling method (see table above)
`sample_size`	int or float	`0.1`	Absolute count (int) or fraction of population (float 0–1)
`random_state`	int	None	Seed for reproducibility
`stratify_column`	str	None	Column to stratify by (required for `stratified`)
`cluster_column`	str	None	Column with cluster labels (optional for `cluster`)
`weights_column`	str	None	Column with sampling weights (required for `weighted`)
`replacement`	bool	`False`	Sample with replacement

Return format

{
  "sample_data": [
    {"col_a": 1.2, "col_b": "foo"},
    ...
  ],
  "sampling_results": {
    "sampling_method": "stratified",
    "sample_size": 500,
    "population_size": 5000,
    "sampling_params": {"stratify_column": "region", "replacement": false},
    "quality_metrics": {
      "representativeness_score": 0.97,
      "mean_absolute_difference": 0.03,
      "std_ratio_mean": 0.99
    },
    "strata_info": {
      "North": {"population_size": 1500, "sample_size": 150, "proportion_in_population": 0.30},
      "South": {"population_size": 3500, "sample_size": 350, "proportion_in_population": 0.70}
    }
  }
}

`bootstrap_statistic`

Estimate confidence intervals for a statistic via bootstrap resampling.

Parameters

Parameter	Type	Default	Description
`data`	DataFrame or str	required	Input DataFrame or file path
`statistic_func`	callable or str	`"mean"`	Statistic to bootstrap; string names: `mean`, `median`, `std`, `var`, `sum`
`n_bootstrap`	int	`1000`	Number of bootstrap resamples
`confidence_level`	float	`0.95`	Confidence level (e.g., 0.95 for 95% CI)
`method`	str	`"percentile"`	Interval method: `percentile`, `bca`, `basic`, `studentized`
`random_state`	int	None	Seed for reproducibility

Return format

{
  "statistic_name": "mean",
  "original_statistic": 42.7,
  "bootstrap_method": "percentile",
  "n_bootstrap": 1000,
  "bias_estimate": 0.03,
  "bias_corrected_estimate": 42.67,
  "variance_estimate": 1.24,
  "standard_error": 1.11,
  "confidence_intervals": {
    "percentile": [40.5, 44.9],
    "bca": [40.3, 44.7]
  },
  "convergence_info": {"bootstrap_se_stability": 0.02}
}

`monte_carlo_simulate`

Run a Monte Carlo simulation or numerical integration.

Parameters

Parameter	Type	Default	Description
`data`	DataFrame or str	required	Input data for simulation parameters
`simulation_type`	str	`"integration"`	Type: `integration`, `simulation`, `importance_sampling`
`n_simulations`	int	`10000`	Number of simulation draws
`random_state`	int	None	Seed for reproducibility

Return format

{
  "simulation_type": "simulation",
  "n_simulations": 10000,
  "estimated_value": 18.4,
  "confidence_interval": [17.1, 19.7],
  "standard_error": 0.66,
  "convergence_diagnostic": {
    "relative_error": 0.004,
    "effective_sample_size": 9800
  },
  "simulation_params": {...}
}

`bayesian_estimate`

Perform Bayesian parameter estimation with credible intervals.

Parameters

Parameter	Type	Default	Description
`data`	DataFrame or str	required	Input DataFrame or file path
`estimation_type`	str	`"posterior"`	Type: `posterior`, `updating`
`prior_distribution`	str	`"normal"`	Prior: `normal`, `beta`, `gamma`, `uniform`
`confidence_level`	float	`0.95`	Credible interval level
`random_state`	int	None	Seed for reproducibility

Return format

{
  "parameter_name": "mu",
  "estimation_method": "posterior",
  "posterior_mean": 5.23,
  "posterior_mode": 5.19,
  "posterior_median": 5.21,
  "credible_intervals": {
    "equal_tailed": [4.81, 5.65],
    "hdi": [4.79, 5.62]
  },
  "prior_info": {"distribution": "normal", "params": {"loc": 0, "scale": 10}},
  "bayes_factor": 12.4,
  "mcmc_diagnostics": {"r_hat": 1.002, "ess": 3200}
}

Method Details

Sampling Methods

Simple Random Sampling

Selects rows uniformly at random. The default and simplest method. Use when the population is homogeneous or when no auxiliary information is available to guide allocation.

With replacement=False (default), each row appears at most once. With replacement=True, the same row can appear multiple times (needed for bootstrap-style samples).

Stratified Sampling

Divides the population into non-overlapping strata defined by stratify_column, then samples from each stratum in proportion to its share of the population. This guarantees representation of all groups and typically reduces variance compared to simple random sampling.

Output includes strata_info showing the population size, sample size, and proportions for each stratum. The representativeness_score (0–1, higher is better) compares stratum means between the sample and population.

When to use: Surveys with demographic subgroups, A/B test allocation, analysis where rare categories must appear in sufficient numbers.

Cluster Sampling

Selects random clusters, then includes all (or a sample of) members from those clusters. If no cluster_column is provided, clusters are created automatically using K-means on numeric columns, with the number of clusters set to sqrt(sample_size).

More efficient than stratified sampling when travel cost or data collection cost is grouped geographically or organisationally. Variance is higher than SRS for the same total sample size.

When to use: Geographic surveys, school studies (sample schools, then survey all students in selected schools), log analysis where records cluster by session.

Systematic Sampling

Selects every k-th element after a random starting position, where k = population_size / sample_size. Provides even coverage over an ordered list.

The result includes sampling_interval and starting_point in sampling_params.

When to use: Quality control sampling on ordered production lines, time-series subsampling, sorted database tables where a uniform spread is needed.

Weighted Sampling

Samples rows with probability proportional to values in weights_column. Weights are normalised to sum to 1 internally. Use with replacement=True for importance sampling applications.

When to use: Oversampling rare events, inverse-probability-of-treatment weighting (IPTW), upweighting recent records.

Bootstrap Resampling

Bootstrap methods estimate the sampling distribution of a statistic by resampling with replacement from the observed data. No parametric distributional assumptions are required.

n_bootstrap recommendations:

1000 for exploratory work and interval width estimation
5000–10000 for stable BCa intervals or tail probabilities
10000+ for p-values and when the statistic has high variability

CI methods comparison:

Method	When to prefer
`percentile`	Symmetric distributions; large samples; quick results
`bca`	Default recommendation; corrects for bias and skewness automatically
`basic`	Alternative when distribution is approximately symmetric
`studentized`	When studentisation (dividing by bootstrap SE) is feasible; more accurate for small samples

Bias correction: When bias_estimate is non-negligible relative to standard_error, use bias_corrected_estimate as the point estimate instead of original_statistic.

Statistic functions: Pass a string name (mean, median, std, var, sum) or a Python callable f(x) -> float that operates on a 1D NumPy array.

Monte Carlo Simulation

Monte Carlo methods approximate quantities by averaging over random draws. The key result fields:

estimated_value — the Monte Carlo estimate of the target quantity
standard_error — uncertainty of the estimate (decreases as 1/sqrt(n_simulations))
confidence_interval — normal approximation CI around the estimate
convergence_diagnostic.relative_error — SE / estimated_value; below 0.01 indicates good convergence

Simulation types:

Type	Description
`integration`	Estimate the integral of a function over a domain by uniform random sampling
`simulation`	Forward propagation: draw uncertain inputs, compute output distribution
`importance_sampling`	Reduce variance for rare-event probabilities by sampling from a proposal distribution

n_simulations guidance: Start with 1000 to verify setup, then increase to 10,000–100,000 for stable estimates. Check convergence_diagnostic.relative_error < 0.01 for 1% accuracy.

Bayesian Estimation

Bayesian estimation combines a prior belief about a parameter with observed data to produce a posterior distribution.

Prior distributions:

`prior_distribution`	Parameters	Typical use
`normal`	`loc`, `scale`	Continuous unbounded parameters (mean, regression coefficients)
`beta`	`alpha`, `beta`	Probabilities and proportions (0–1 range)
`gamma`	`alpha`, `beta`	Positive-valued parameters (rates, variances)
`uniform`	`low`, `high`	Completely uninformative over a bounded range

Credible intervals vs. confidence intervals:

A 95% credible interval [a, b] means there is a 95% posterior probability that the true parameter lies in [a, b]. This is the intuitive interpretation often (incorrectly) attributed to frequentist confidence intervals.

Two credible interval types are reported:

equal_tailed — 2.5th to 97.5th percentile of the posterior
hdi — Highest Density Interval; the narrowest interval containing the specified probability mass; preferred for skewed posteriors

Bayes factor: When available, summarises the evidence ratio between hypotheses. BF > 10 is considered strong evidence; BF > 100 is decisive.

MCMC diagnostics:

r_hat — Gelman-Rubin convergence statistic; values < 1.01 indicate convergence
ess — effective sample size; below 400 suggests the chain needs more iterations

Composition

After sampling/estimation	Chain to	Purpose
`generate_sample` result	Any domain	All downstream analyses on the sample instead of full data
`bootstrap_statistic` CIs	Business Intelligence	Uncertainty-aware reporting of KPIs
`bootstrap_statistic` CIs	Statistical Analysis	Non-parametric comparison of two statistics
`monte_carlo_simulate`	Regression/Modeling	Uncertainty propagation through a fitted model
`bayesian_estimate` posterior	Statistical Analysis	Posterior predictive checks
Stratified sample	Regression/Modeling	Balanced training sets for model fitting

Examples

Draw a stratified sample for a survey

result = generate_sample(
    data=customer_df,
    sampling_method="stratified",
    sample_size=1000,
    stratify_column="region",
    random_state=42,
)
sample = pd.DataFrame(result["sample_data"])
print(result["sampling_results"]["strata_info"])

Bootstrap a median with BCa intervals

result = bootstrap_statistic(
    data=revenue_df,
    statistic_func="median",
    n_bootstrap=5000,
    confidence_level=0.95,
    method="bca",
    random_state=0,
)
print(f"Median: {result['original_statistic']:.2f}")
print(f"95% BCa CI: {result['confidence_intervals']['bca']}")

Custom statistic: interquartile range

import numpy as np

result = bootstrap_statistic(
    data=df,
    statistic_func=lambda x: np.percentile(x, 75) - np.percentile(x, 25),
    n_bootstrap=2000,
    confidence_level=0.90,
)

Monte Carlo uncertainty propagation

result = monte_carlo_simulate(
    data=model_params_df,
    simulation_type="simulation",
    n_simulations=50000,
    random_state=1,
)
print(f"Expected output: {result['estimated_value']:.3f} ± {result['standard_error']:.3f}")
print(f"90% CI: {result['confidence_interval']}")

Bayesian estimation of a conversion rate

# Prior: Beta(2, 20) — weak prior of ~9% conversion
result = bayesian_estimate(
    data=experiment_df,
    estimation_type="posterior",
    prior_distribution="beta",
    confidence_level=0.95,
)
print(f"Posterior mean: {result['posterior_mean']:.3f}")
print(f"95% HDI: {result['credible_intervals']['hdi']}")

Full workflow: sample then analyse

# 1. Draw a stratified 20% sample
sample_result = generate_sample(
    data=large_df,
    sampling_method="stratified",
    sample_size=0.2,
    stratify_column="product_category",
    random_state=7,
)
sample_df = pd.DataFrame(sample_result["sample_data"])

# 2. Bootstrap the mean order value on the sample
ci_result = bootstrap_statistic(
    data=sample_df[["order_value"]],
    statistic_func="mean",
    n_bootstrap=2000,
    method="bca",
)
print(f"Mean order value: {ci_result['original_statistic']:.2f}")
print(f"95% CI: {ci_result['confidence_intervals']['bca']}")

Sampling & Estimation Domain

Overview

Available Analyses

MCP Tool Reference

generate_sample

bootstrap_statistic

monte_carlo_simulate

bayesian_estimate

Method Details

Sampling Methods

Simple Random Sampling

Stratified Sampling

Cluster Sampling

Systematic Sampling

Weighted Sampling

Bootstrap Resampling

Monte Carlo Simulation

Bayesian Estimation

Composition

Examples

Draw a stratified sample for a survey

Bootstrap a median with BCa intervals

Custom statistic: interquartile range

Monte Carlo uncertainty propagation

Bayesian estimation of a conversion rate

Full workflow: sample then analyse

`generate_sample`

`bootstrap_statistic`

`monte_carlo_simulate`

`bayesian_estimate`