Sampling & Estimation Domain
Overview
The sampling and estimation domain provides methods for drawing representative samples from data, quantifying uncertainty around statistics, and performing probabilistic inference. It covers classical sampling theory, bootstrap resampling, Monte Carlo simulation, and Bayesian estimation.
Use this domain when you need to:
Draw a representative subset from a large dataset for faster downstream analysis
Estimate confidence intervals for a statistic when distributional assumptions are uncertain
Simulate outcomes or propagate uncertainty through a model using Monte Carlo methods
Update prior beliefs with observed data and obtain posterior credible intervals
All transformers are sklearn-compatible (BaseEstimator, TransformerMixin). High-level
functions accept a DataFrame or a file path and return JSON-serializable dictionaries.
Available Analyses
Analysis |
Function |
Description |
|---|---|---|
Simple random sampling |
|
Uniform random selection without replacement |
Stratified sampling |
|
Proportional allocation across strata |
Cluster sampling |
|
Select random clusters, take all members |
Systematic sampling |
|
Regular interval selection with random start |
Weighted sampling |
|
Probability-proportional-to-size sampling |
Percentile bootstrap CI |
|
Distribution-free confidence intervals |
BCa bootstrap CI |
|
Bias-corrected and accelerated intervals |
Basic bootstrap |
|
Pivotal confidence intervals |
Studentised bootstrap |
|
Bootstrap-t intervals |
Monte Carlo integration |
|
Numerical integration by random sampling |
Monte Carlo simulation |
|
Forward uncertainty propagation |
Importance sampling |
|
Variance reduction for rare events |
Posterior estimation |
|
Bayesian parameter estimation |
Bayesian updating |
|
Sequential belief update |
Credible intervals |
|
Highest density interval (HDI) or equal-tailed CI |
MCP Tool Reference
generate_sample
Draw a sample from a dataset using a chosen sampling method.
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
DataFrame or str |
required |
Input DataFrame or path to CSV/JSON file |
|
str |
|
Sampling method (see table above) |
|
int or float |
|
Absolute count (int) or fraction of population (float 0–1) |
|
int |
None |
Seed for reproducibility |
|
str |
None |
Column to stratify by (required for |
|
str |
None |
Column with cluster labels (optional for |
|
str |
None |
Column with sampling weights (required for |
|
bool |
|
Sample with replacement |
Return format
{
"sample_data": [
{"col_a": 1.2, "col_b": "foo"},
...
],
"sampling_results": {
"sampling_method": "stratified",
"sample_size": 500,
"population_size": 5000,
"sampling_params": {"stratify_column": "region", "replacement": false},
"quality_metrics": {
"representativeness_score": 0.97,
"mean_absolute_difference": 0.03,
"std_ratio_mean": 0.99
},
"strata_info": {
"North": {"population_size": 1500, "sample_size": 150, "proportion_in_population": 0.30},
"South": {"population_size": 3500, "sample_size": 350, "proportion_in_population": 0.70}
}
}
}
bootstrap_statistic
Estimate confidence intervals for a statistic via bootstrap resampling.
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
DataFrame or str |
required |
Input DataFrame or file path |
|
callable or str |
|
Statistic to bootstrap; string names: |
|
int |
|
Number of bootstrap resamples |
|
float |
|
Confidence level (e.g., 0.95 for 95% CI) |
|
str |
|
Interval method: |
|
int |
None |
Seed for reproducibility |
Return format
{
"statistic_name": "mean",
"original_statistic": 42.7,
"bootstrap_method": "percentile",
"n_bootstrap": 1000,
"bias_estimate": 0.03,
"bias_corrected_estimate": 42.67,
"variance_estimate": 1.24,
"standard_error": 1.11,
"confidence_intervals": {
"percentile": [40.5, 44.9],
"bca": [40.3, 44.7]
},
"convergence_info": {"bootstrap_se_stability": 0.02}
}
monte_carlo_simulate
Run a Monte Carlo simulation or numerical integration.
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
DataFrame or str |
required |
Input data for simulation parameters |
|
str |
|
Type: |
|
int |
|
Number of simulation draws |
|
int |
None |
Seed for reproducibility |
Return format
{
"simulation_type": "simulation",
"n_simulations": 10000,
"estimated_value": 18.4,
"confidence_interval": [17.1, 19.7],
"standard_error": 0.66,
"convergence_diagnostic": {
"relative_error": 0.004,
"effective_sample_size": 9800
},
"simulation_params": {...}
}
bayesian_estimate
Perform Bayesian parameter estimation with credible intervals.
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
DataFrame or str |
required |
Input DataFrame or file path |
|
str |
|
Type: |
|
str |
|
Prior: |
|
float |
|
Credible interval level |
|
int |
None |
Seed for reproducibility |
Return format
{
"parameter_name": "mu",
"estimation_method": "posterior",
"posterior_mean": 5.23,
"posterior_mode": 5.19,
"posterior_median": 5.21,
"credible_intervals": {
"equal_tailed": [4.81, 5.65],
"hdi": [4.79, 5.62]
},
"prior_info": {"distribution": "normal", "params": {"loc": 0, "scale": 10}},
"bayes_factor": 12.4,
"mcmc_diagnostics": {"r_hat": 1.002, "ess": 3200}
}
Method Details
Sampling Methods
Simple Random Sampling
Selects rows uniformly at random. The default and simplest method. Use when the population is homogeneous or when no auxiliary information is available to guide allocation.
With replacement=False (default), each row appears at most once. With replacement=True,
the same row can appear multiple times (needed for bootstrap-style samples).
Stratified Sampling
Divides the population into non-overlapping strata defined by stratify_column, then samples
from each stratum in proportion to its share of the population. This guarantees representation
of all groups and typically reduces variance compared to simple random sampling.
Output includes strata_info showing the population size, sample size, and proportions for each
stratum. The representativeness_score (0–1, higher is better) compares stratum means between
the sample and population.
When to use: Surveys with demographic subgroups, A/B test allocation, analysis where rare categories must appear in sufficient numbers.
Cluster Sampling
Selects random clusters, then includes all (or a sample of) members from those clusters. If no
cluster_column is provided, clusters are created automatically using K-means on numeric columns,
with the number of clusters set to sqrt(sample_size).
More efficient than stratified sampling when travel cost or data collection cost is grouped geographically or organisationally. Variance is higher than SRS for the same total sample size.
When to use: Geographic surveys, school studies (sample schools, then survey all students in selected schools), log analysis where records cluster by session.
Systematic Sampling
Selects every k-th element after a random starting position, where k = population_size / sample_size. Provides even coverage over an ordered list.
The result includes sampling_interval and starting_point in sampling_params.
When to use: Quality control sampling on ordered production lines, time-series subsampling, sorted database tables where a uniform spread is needed.
Weighted Sampling
Samples rows with probability proportional to values in weights_column. Weights are normalised
to sum to 1 internally. Use with replacement=True for importance sampling applications.
When to use: Oversampling rare events, inverse-probability-of-treatment weighting (IPTW), upweighting recent records.
Bootstrap Resampling
Bootstrap methods estimate the sampling distribution of a statistic by resampling with replacement from the observed data. No parametric distributional assumptions are required.
n_bootstrap recommendations:
1000 for exploratory work and interval width estimation
5000–10000 for stable BCa intervals or tail probabilities
10000+ for p-values and when the statistic has high variability
CI methods comparison:
Method |
When to prefer |
|---|---|
|
Symmetric distributions; large samples; quick results |
|
Default recommendation; corrects for bias and skewness automatically |
|
Alternative when distribution is approximately symmetric |
|
When studentisation (dividing by bootstrap SE) is feasible; more accurate for small samples |
Bias correction: When bias_estimate is non-negligible relative to standard_error, use
bias_corrected_estimate as the point estimate instead of original_statistic.
Statistic functions: Pass a string name (mean, median, std, var, sum) or a Python
callable f(x) -> float that operates on a 1D NumPy array.
Monte Carlo Simulation
Monte Carlo methods approximate quantities by averaging over random draws. The key result fields:
estimated_value— the Monte Carlo estimate of the target quantitystandard_error— uncertainty of the estimate (decreases as 1/sqrt(n_simulations))confidence_interval— normal approximation CI around the estimateconvergence_diagnostic.relative_error— SE / estimated_value; below 0.01 indicates good convergence
Simulation types:
Type |
Description |
|---|---|
|
Estimate the integral of a function over a domain by uniform random sampling |
|
Forward propagation: draw uncertain inputs, compute output distribution |
|
Reduce variance for rare-event probabilities by sampling from a proposal distribution |
n_simulations guidance: Start with 1000 to verify setup, then increase to 10,000–100,000
for stable estimates. Check convergence_diagnostic.relative_error < 0.01 for 1% accuracy.
Bayesian Estimation
Bayesian estimation combines a prior belief about a parameter with observed data to produce a posterior distribution.
Prior distributions:
|
Parameters |
Typical use |
|---|---|---|
|
|
Continuous unbounded parameters (mean, regression coefficients) |
|
|
Probabilities and proportions (0–1 range) |
|
|
Positive-valued parameters (rates, variances) |
|
|
Completely uninformative over a bounded range |
Credible intervals vs. confidence intervals:
A 95% credible interval [a, b] means there is a 95% posterior probability that the true
parameter lies in [a, b]. This is the intuitive interpretation often (incorrectly) attributed
to frequentist confidence intervals.
Two credible interval types are reported:
equal_tailed— 2.5th to 97.5th percentile of the posteriorhdi— Highest Density Interval; the narrowest interval containing the specified probability mass; preferred for skewed posteriors
Bayes factor: When available, summarises the evidence ratio between hypotheses. BF > 10 is considered strong evidence; BF > 100 is decisive.
MCMC diagnostics:
r_hat— Gelman-Rubin convergence statistic; values < 1.01 indicate convergenceess— effective sample size; below 400 suggests the chain needs more iterations
Composition
After sampling/estimation |
Chain to |
Purpose |
|---|---|---|
|
Any domain |
All downstream analyses on the sample instead of full data |
|
Business Intelligence |
Uncertainty-aware reporting of KPIs |
|
Statistical Analysis |
Non-parametric comparison of two statistics |
|
Regression/Modeling |
Uncertainty propagation through a fitted model |
|
Statistical Analysis |
Posterior predictive checks |
Stratified sample |
Regression/Modeling |
Balanced training sets for model fitting |
Examples
Draw a stratified sample for a survey
result = generate_sample(
data=customer_df,
sampling_method="stratified",
sample_size=1000,
stratify_column="region",
random_state=42,
)
sample = pd.DataFrame(result["sample_data"])
print(result["sampling_results"]["strata_info"])
Bootstrap a median with BCa intervals
result = bootstrap_statistic(
data=revenue_df,
statistic_func="median",
n_bootstrap=5000,
confidence_level=0.95,
method="bca",
random_state=0,
)
print(f"Median: {result['original_statistic']:.2f}")
print(f"95% BCa CI: {result['confidence_intervals']['bca']}")
Custom statistic: interquartile range
import numpy as np
result = bootstrap_statistic(
data=df,
statistic_func=lambda x: np.percentile(x, 75) - np.percentile(x, 25),
n_bootstrap=2000,
confidence_level=0.90,
)
Monte Carlo uncertainty propagation
result = monte_carlo_simulate(
data=model_params_df,
simulation_type="simulation",
n_simulations=50000,
random_state=1,
)
print(f"Expected output: {result['estimated_value']:.3f} ± {result['standard_error']:.3f}")
print(f"90% CI: {result['confidence_interval']}")
Bayesian estimation of a conversion rate
# Prior: Beta(2, 20) — weak prior of ~9% conversion
result = bayesian_estimate(
data=experiment_df,
estimation_type="posterior",
prior_distribution="beta",
confidence_level=0.95,
)
print(f"Posterior mean: {result['posterior_mean']:.3f}")
print(f"95% HDI: {result['credible_intervals']['hdi']}")
Full workflow: sample then analyse
# 1. Draw a stratified 20% sample
sample_result = generate_sample(
data=large_df,
sampling_method="stratified",
sample_size=0.2,
stratify_column="product_category",
random_state=7,
)
sample_df = pd.DataFrame(sample_result["sample_data"])
# 2. Bootstrap the mean order value on the sample
ci_result = bootstrap_statistic(
data=sample_df[["order_value"]],
statistic_func="mean",
n_bootstrap=2000,
method="bca",
)
print(f"Mean order value: {ci_result['original_statistic']:.2f}")
print(f"95% CI: {ci_result['confidence_intervals']['bca']}")