Continuous outcomes are measured on a numeric scale (e.g. test scores, blood pressure, weight). Binary outcomes have only two possible results (e.g. responded/did not respond, relapsed/did not relapse). The choice affects which statistical test is used and how effect sizes are defined.
Within-subject designs
Each participant receives both the treatment and control conditions at different times, serving as their own control. This eliminates stable between-person differences from the analysis, making these designs substantially more efficient than parallel designs for the same sample size. Crossover trials alternate conditions within a single study period. Stepped wedge and delayed-start designs stagger when groups cross over from control to treatment.
Each participant receives both treatment and control in two separate phases (AB/BA sequences). The order is counterbalanced to eliminate period effects. Because each person is their own control, between-person variance cancels, making this the most efficient design when treatment effects are reversible. Best for: short-acting interventions with no carryover (e.g. analgesics, cognitive tasks, acute drug effects).
Groups of participants (cohorts) start in the control condition and sequentially cross over to treatment at staggered time points. Every cohort eventually receives treatment, which can be ethically advantageous. Classic stepped wedge uses the H&H framework; delayed-start gives each cohort equal treatment duration. Best for: interventions that cannot be withdrawn, policy rollouts, or when withholding treatment long-term is impractical.
Parallel designs
Participants are assigned to one condition only (treatment or control) and stay in that group for the entire study. Between-person variability remains in the analysis, so these designs typically require larger sample sizes than within-subject designs. However, they are simpler to implement and avoid concerns about treatment carryover or ordering effects.
The standard two-arm randomized controlled trial. Participants are randomly assigned to treatment or control and measured one or more times. The simplest and most widely understood design. Best for: disease-modifying or irreversible treatments, when carryover is a concern, or when the primary goal is regulatory approval with a straightforward analysis.
Power analysis for trials where the primary endpoint is time until an event occurs (death, relapse, progression). Uses the log-rank test / Cox proportional hazards framework. Power depends on the number of events observed, not just sample size. Requires specifying hazard ratio, accrual period, and follow-up duration. Best for: oncology, cardiology, any trial measuring duration until a clinical event.
Randomization occurs at the group level (e.g. clinics, schools, communities) rather than the individual level. All participants within a cluster receive the same condition. The design effect (DEFF) inflates the required sample size because within-cluster similarity reduces effective information. Best for: interventions delivered at the group level (e.g. training programs, policy changes, community health interventions).
Tests whether a new treatment is not meaningfully worse than an existing one, within a pre-specified margin. Uses one-sided testing. The goal is to show the new treatment is "good enough" while potentially offering other advantages (fewer side effects, lower cost, easier administration). Best for: generic drugs, biosimilars, or new treatments expected to match (not beat) the standard of care.
Tests whether two treatments are equivalent within a pre-specified margin (±Δ) using Two One-Sided Tests (TOST). Both tests must reject to conclude equivalence. Unlike NI testing (which only checks one direction), TOST confirms the difference is bounded in BOTH directions. Best for: bioequivalence studies, generic drug approval, validating measurement instruments, or any context where you need to prove two treatments produce essentially the same result.
Tests two interventions simultaneously by crossing them (A vs no-A) × (B vs no-B), creating 4 cells. Main effects use all participants (efficient), but interaction tests have less power. Best for: testing two interventions that can be combined (e.g. drug + therapy, exercise + diet), especially when interaction is not the primary question.
Outcome▼
What are you trying to detect, and how noisy is the data?
Treatment effect (δ)
Standard deviation (σ)
Control rate (p₀)
Treatment rate (p₁)
Treatment mean
Control mean
Treatment SD
Control SD
Treatment events
Control events
Treatment N
Control N
Or enter an effect size
NI margin (Δ)
Hazard ratio (HR)
The hazard ratio compares how quickly events happen in the treatment group versus the control group. An HR of 0.70 means treated participants experience the event (death, relapse, etc.) at 70% the rate of control, meaning, a 30% reduction in risk. An HR of 0.50 means a 50% reduction (the treatment cuts the event rate in half). Smaller HR = bigger treatment benefit = fewer total events needed to prove it works.
Median survival, control (months)
Median time to event in the control arm, in months. This is how long 50% of control participants survive before experiencing the event. Used to calculate the baseline hazard rate under an exponential survival model.
Accrual period (months)
Duration of participant enrollment in months. Participants enter the study uniformly over this period. Longer accrual means more variable follow-up times across participants.
Minimum follow-up (months)
Minimum follow-up time after the last participant is enrolled, in months. The total study duration is accrual + follow-up. Longer follow-up means more events are observed, improving power.
Design▼
What study are you running, and how big does it need to be?
Size
Number of cohorts (K)
Number of cohorts (K)
Participants per cohort
Clusters per step
What this does: Controls how many cohorts cross over from control to treatment at the same time. By default (1), each cohort crosses over on its own - so K cohorts means K crossover points and K+1 measurement periods.
When you increase this: Cohorts are grouped into batches that cross over together. For example, with 12 cohorts and 4 per step, there are only 3 crossover points and 4 measurement periods - instead of 12 crossover points and 13 periods.
Why use this: In practice, it's often infeasible to stagger every single site individually. Hospitals, clinics, or schools are rolled out in batches (e.g., 4 sites switch to the new protocol each quarter). This slider models that real-world constraint. More clusters per step = fewer measurement periods = simpler logistics, but potentially less statistical information per cluster.
Key relationship: Steps = K ÷ clusters/step. Periods = steps + 1. Total clusters (K) stays the same - they're just grouped differently.
Reference: Hooper et al. (2016) use this framework: their example has 12 clusters with 4 per step = 3 steps.
Control assessments (nDE)
Treatment assessments (nTX)
Follow-up assessments (nFU)
Correlation inputs
Intraclass Correlation (ICC)
Between-cluster ICC
Cluster size variability (CV)
Structure
Randomization unit
Sites: each row of the design matrix is a cluster (clinic, hospital, school). Treatment is rolled out to whole sites at a time. K is the number of sites, and every site contributes m patients per period.
People: each row is a cohort (delay group) of individually-randomised participants who cross from control to treatment together. The primary design knob is Number of cohorts; People per cohort is the secondary knob, and total participants (K) is derived as cohorts × people-per-cohort. m is forced to 1. Use this when you randomise individual patients to staggered start dates rather than randomising whole clinics.
The power math (H&H GLS) is identical in both modes — only the labels and the default small-K warning threshold change.
Calendar-anchored: All cohorts are assessed at shared calendar time points (e.g. Jan, Feb, Mar). The model includes a time covariate to adjust for secular trends (maturation, seasonal effects, practice effects). This is more conservative but protects against time-varying confounds. The treatment effect is interpreted as the improvement beyond what would have happened due to time alone.
Rolling (unchecked): Each cohort's assessments are relative to their own crossover point. No shared time axis. The model omits the time covariate, which yields higher power but assumes no systematic time trends. The treatment effect is interpreted as the within-person change from baseline. Choose rolling when enrollment is staggered and calendar time is not meaningful.
Baseline step increment
Follow-up step increment
Carryover proportion (λ)
Carryover (λ): if the treatment effect doesn't fully wash out between phases, some lingers into the control phase, reducing the detectable contrast by λ/2.
0Perfect washout (≥5 half-lives)
0.05-0.15Adequate washout, short-acting drugs
0.15-0.30Suboptimal washout, slow-clearance drugs
0.30-0.50Substantial. Consider parallel design
>0.50Crossover likely inappropriate
Without data, use 0.10-0.20 as a conservative sensitivity analysis.
Power for
Main effect: tests whether factor A (or B) has an overall effect, averaging over the other factor.
Allocation
Treatment : Control ratio
1 : 1 (equal)
Inference▼
How will significance be tested?
Alpha (two-sided α)
Target power (1 − β)
Expected dropout rate (%)
When a study tests more than one hypothesis , such as two primary endpoints, both main effects in a factorial, or multiple subgroup comparisons . Each test at α=0.05 increases the chance of a false positive. With 3 independent tests at α=0.05, there is a ~14% chance of at least one false positive instead of 5%.
A multiple comparison correction lowers the per-test significance threshold so the overall false positive rate stays at the intended α. This reduces power, meaning you need more participants to detect the same effect.
Tests
The number of confirmatory hypotheses in your study.
2: Two co-primary endpoints, or both main effects in a factorial. 3: Both main effects + interaction in a factorial, or three co-primary endpoints. 4+: Multiple dose levels, subgroups, or endpoints. Type a number up to 20.
Only count pre-specified confirmatory analyses. Exploratory or secondary analyses don't need correction here.
Method
Divides your significance level by the number of tests: α/k.
With α=0.05 and 2 tests, each test uses α=0.025. With 3 tests, each uses α=0.0167.
This is the most widely used correction. It is conservative. It guarantees the overall false positive rate stays below α no matter how the tests are correlated. Because it is conservative, it may slightly overestimate the sample size needed.
Uses the formula 1−(1−α)1/k to compute the per-test threshold.
With α=0.05 and 2 tests, each test uses α≈0.0253 (vs 0.025 for Bonferroni). With 3 tests, α≈0.0170 (vs 0.0167).
Slightly less conservative than Bonferroni, giving marginally more power. The difference is small in practice. Technically exact when the tests are independent, and still valid when tests are positively correlated (which is the usual case in clinical trials).
Advanced▼
Correlation structure
Block exchangeable (default): cluster-level correlation between any two periods is the same constant, τ²·CAC. This is the standard Hussey & Hughes (2007) assumption extended by Hooper et al. (2016). Appropriate when the study spans a short window and cluster effects are expected to remain roughly stable.
Discrete time decay: cluster-level correlation decays with the distance between periods as τ²·CAC|t−t′|. This models settings where a cluster's performance in period 1 is highly predictive of period 2, but much less predictive of period 5. More conservative than block exchangeable when CAC < 1, and better suited for studies spanning many periods or longer time intervals.
Both structures reduce to the standard compound symmetry model when CAC = 1.
Reference: Hooper et al. (2016), Statist. Med., 35(26), 4718-4728, §3.
τ²·CAC - same correlation at all lags
Correlation input
ICC / CAC (default): Specify the within-person intraclass correlation (ICC) and cluster autocorrelation (CAC) directly. ICC captures how similar repeated measures are within the same person; CAC captures how stable the cluster effect is over time.
ρ₀ / ρ₁ (Kasza): Alternative parameterization using within-period ICC (ρ₀) and between-period ICC (ρ₁). Popular in the stepped wedge literature. In this mode, the ICC slider above becomes ρ₀ and a separate ρ₁ slider appears. Relationships: CAC = ρ₁/ρ₀, ICC = ρ₀. Must satisfy ρ₁ ≤ ρ₀.
Reference: Kasza et al. (2019), Statist. Med., 38(22), 4292-4309.
ρ₁ (between-period ICC)
Between-period ICC (ρ₁): Correlation between two individuals in the same cluster measured in different periods. Always ≤ ρ₀ because temporal separation adds noise. The ratio ρ₁/ρ₀ equals the cluster autocorrelation (CAC).
Typical values: Usually close to ρ₀ for short studies (stable clusters), but can be much smaller for long studies with staff turnover or changing patient populations.
Cluster Autocorrelation (CAC)
Cluster Autocorrelation (CAC) answers: if a clinic performs above average in January, is it still above average in June?
CAC = 1 means cluster effects are perfectly stable (a good clinic stays good). CAC < 1 means cluster performance drifts over time, which reduces the benefit of repeated measurements within the same cluster.
Quick guide: • 1.0: Perfectly stable clusters. Standard assumption for short studies. • 0.8: Conservative default for site-based studies spanning several months. • 0.6-0.7: Substantial drift. Multi-year studies or sites with high staff turnover.
How it differs from ICC and IAC: ICC is about similarity between people in a group. CAC is about whether the group itself changes over time. IAC is about whether each person changes over time.
Under block exchangeable: CAC sets a constant correlation floor - all period-pairs share the same cluster covariance τ²·CAC. Under discrete time decay: CAC is the per-step decay rate - covariance between periods t and t′ is τ²·CAC|t−t′|. Adjacent periods are highly correlated; distant ones less so.
Reference: Hooper et al. (2016), Statist. Med., 35(26), 4718-4728.
Individual Autocorrelation (IAC)
Individual Autocorrelation (IAC) answers: of the within-person noise (the part NOT explained by ICC), how much persists from one time point to the next?
Think of it this way: ICC captures stable person effects (Alice always scores higher than Bob). IAC captures whether the leftover fluctuations are also sticky (if Alice has a bad day, is tomorrow also bad?).
If your cohorts are randomly assigned groups (not clinics or schools): IAC is especially important for you. Since your cohorts have no inherent cluster structure, ICC will typically be high (most variance is between-person) and CAC can stay at 1.0. IAC then captures how much of the remaining within-person noise carries over between assessments. Set it based on your outcome measure's short-term stability.
Quick guide: • 0.0: Within-person noise is completely random each time. Most conservative. Also use for repeated cross-section designs (different people each period). • 0.3-0.5: Some persistence. Mood, pain, or behavioral outcomes. • 0.5-0.7: Good persistence. Validated scales (e.g. depression, self-esteem). • 0.8-0.9: Very persistent noise. Chronic biomarkers, cognitive ability.
Relationship to ICC: True test-retest reliability = ICC + (1-ICC) x IAC. Example: ICC=0.50, IAC=0.70 gives test-retest = 0.85. Higher IAC boosts power in closed cohort designs.
CAC clamped to 0.001. CAC = 0 with IAC > 0 is a degenerate regime — clusters reset each period while individuals remain autocorrelated within them. A small cluster-level anchor is needed for the information matrix to stay positive-definite.
Effect shape
Effect shape controls how the treatment effect develops in the periods after a cluster crosses over. Default is Step (instant full effect, the classic Hussey-Hughes assumption). Choose another shape if you expect the effect to ramp up, fade, or plateau below full strength. Step — default
Full effect from the moment a cluster crosses over, forever after. Matches the classic power formula. Hussey & Hughes 2007.
Ramp-up
Effect grows linearly from 0 at crossover to full strength after R periods. Typical for therapies that take time to reach steady state. Set R below.
Fade
Effect peaks at crossover and declines linearly to 0 over R periods. Typical for interventions that wear off. Set R below.
Saturation
Effect plateaus at a fraction of the full target (never reaches 100%). Set the plateau level below.
Shape length (R periods)
R is the number of calendar periods (time steps in your design) over which the shape plays out after a cluster crosses over.
For Ramp-up: the effect grows from 0 at crossover to full strength after R periods. If R = 3 and a cluster crosses at period 2, the effect is 1/3 at period 2, 2/3 at period 3, and 1 at period 4 onward.
For Fade: the effect peaks at crossover and drops to 0 after R periods. If R = 3, weights are 1, 2/3, 1/3, 0 at periods 0–3 after crossover.
For Saturation: this field becomes the plateau level (0–1) instead of a period count — the effect is clamped to that fraction of full strength at every post-crossover period.
Degrees of freedom
Degrees of freedom picks the critical value the t-test uses. Fewer df → more conservative power. All four methods converge once K reaches roughly 30. K − 2 — default
Fixed formula: df = K − 2 (K = number of clusters). Simple, widely reported. Has a small jump at K=2 (falls back to z) and K=3 where t(1) has a big critical value. Hooper et al. 2016, Stat Med.
Satterthwaite
Smoothly approximates effective df from the variance-component uncertainty. No K=2→K=3 jump. Slightly more conservative than K−2 at very small K. Uses the diagonal of the variance-component Fisher info.
Kenward-Roger
Full small-sample correction: adjusts both the variance estimator and df. Matches pbkrtest and SAS DDFM=KENWARDROGER2. Typically 1–3 pp below Satterthwaite at small K. Kenward & Roger 1997, Biometrics; Li & Redden 2015 for SWD.
Normal (z) — opt-in back-compat
Fixed z-critical regardless of K. Reproduces Hussey-Taljaard 2016 Table 3 numbers and other z-based sample-size calculators. Higher power than K−2 at small K — use only to match those older tools. 3–8 pp gap above K−2 at K<30, per t-vs-z table in docs/VALIDATION.md.
Total N
-
Power
-
Effect (Cohen's d)
-
Min N for 80% power
-
⚠ Cohort-anchored requires at least 2 control-phase assessments (nDE ≥ 2). With only one baseline observation, the model cannot separate natural change over time from the treatment effect. If you never observe how participants change during the control phase, you can't attribute post-crossover improvement to treatment rather than time. A second pre-crossover assessment anchors the time slope and is essential for this design. Power at nDE=1 collapses to near the alpha level regardless of sample size. Either add a second control-phase assessment, or switch to Calendar-anchored: No (rolling), which drops the time covariate and works fine at nDE=1.
⚠ Extreme event rates detected. The linearized normal approximation used for binary power is most accurate for event rates between 0.10 and 0.90. At extreme rates, the approximation may over- or underestimate power. For validation, use the exported R simulation script which uses exact logistic mixed models.
Design matrix
CSV design matrix — one row per cluster, one column per period. Cell values (case-insensitive):
Control (pre-treatment): DE / 0 / C / control Treatment: TX / 1 / T / treatment Excluded (cell not observed): X / - / . / blank Follow-up: FU / followup / follow-up Follow-up coded as control: FU-ctrl / FUC
Example row: DE,DE,TX,TX. Accepted separators: commas, tabs, semicolons, pipes. Lines starting with # are treated as comments and skipped.
Design info
Design comparison
Power visualized
Power distributions
Variance breakdown
Sensitivity analysis
Swipe → variance breakdown → sensitivity
Sensitivity analysis min N across ICC × effect
Summary paragraph
Statistical model
Export report
Parameters, justification, design matrix, and sensitivity table.
Share configuration
Recomputing…
NeuPower — Quick Guide
NeuPower computes sample-size and power for clinical trial designs using generalized least-squares (GLS) mixed-model frameworks.
1. Choose a design
Use the mode tabs at the top to select your trial design: SW (stepped wedge), RCT (parallel/clustered), Crossover, NI (non-inferiority), Equivalence, Factorial, or Survival. Each mode shows the controls relevant to that design.
2. Set parameters
Drag sliders or type values into the input boxes. Effect size (δ/σ) and ICC are the key determinants of required N. Use the pilot-data importer (↗ icon) to derive δ and σ from raw summary statistics.
3. Read the output cards
Power — achieved power at current N.
Min-N — smallest sample size reaching the target power (default 80%).
Effect — Cohen’s d (or risk difference for binary outcomes) and the MDE at current N.
4. Verify with simulation
Click Simulate in the Power card to run a Monte Carlo check (10 k trials by default). The simulated power appears as a confidence interval below the analytical result. Enable Auto-verify to run a simulation automatically when parameters change.
5. Advanced options
Expand the Advanced section to control correlation structure (block-exchangeable vs. discrete decay), degrees-of-freedom method (K−2, Satterthwaite, Kenward-Roger, or Normal z), time model (period FE vs. linear trend), Effect shape for time-varying treatment effects (Step / Ramp-up / Fade / Saturation), and more.
What's Neu?
v0.3.72 · plain-English recap
New in v0.3.72
Logit-link GLMM variance framework for binary outcomes. Third variance framework in the engine, alongside the existing Kasza marginal path and the continuous Hussey-Hughes GLS. Matches the modern biostatistical standard used by Hughes et al. 2024 and Xia et al. 2021, and narrows NeuPower's gap from that paper by 80% compared to v0.3.71.
Earlier this release cycle
Published-example cross-check for the time-varying-effect model (v0.3.71). The 2024 Hughes et al. ADDRESS-BP stepped-wedge trial is now pinned in the validation suite; brings the count to 13 fixtures across 6 papers.
Free-coefficient time-varying effect, binary outcomes (v0.3.70). Binary sibling of v0.3.69's continuous ETI math: fit a separate post-crossover effect at each exposure time, then test any weighted average.
Free-coefficient time-varying effect, continuous outcomes (v0.3.69). First piece of a new model option: fit a separate treatment effect for each exposure-time period, then test any weighted average.
Validation summary tracked in one place (v0.3.68). NeuPower now lists exactly which published papers and which external tools it has been cross-verified against at the top of the validation doc. Count as of this release: 6 papers, 12 fixtures, 4 independent tools.
Cross-tool methodology audit, closed (v0.3.67). Read the other tool's source code to pin down why its numbers differ from NeuPower's. The gap is entirely its older cluster-size-input convention plus its normal-approximation critical value; once normalized, NeuPower in legacy mode agrees within a percentage point.
Cross-tool methodology audit (v0.3.66). Compared NeuPower's attained-power numbers against another analytic power tool head-to-head; confirmed the two disagree by design because NeuPower uses more modern small-sample and variance conventions.
Attained-power view locked in with a validation test (v0.3.65). New published-example test pins the histogram's mean and percentile summary against a classical stepped-wedge reference from Ouyang 2020, so future refactors can't silently drift the numbers.
Attained power when cluster sizes vary (v0.3.64). New Attained power… button above the design matrix (stepped wedge) opens a histogram of power across realistic size mixes — a design pegged at 80% mean can still have a 25th percentile well below target.
Rename your period columns (v0.3.63). Click any period header in the design matrix (Period 1, Follow-up, Baseline, etc.) and type your own name — Week 4, 3-month visit, whatever fits your trial. Enter commits, Escape reverts. Labels save with your scenario and show up in CSV exports.
Matrix drag hardened (v0.3.62). Added a regression test that locks in v0.3.58's drag-during-re-render fix so it can't silently break again in a future refactor.
Sigmoid reset now clears Effect shape (v0.3.61). Clicking the logo restores default settings; previously the ETI shape (ramp-up / fade / saturation) stuck around. Now it snaps back to Step.
Shape-aware binary df (v0.3.60). When you use a non-step Effect shape (ramp-up, fade, saturation) with a binary outcome, the Satterthwaite / Kenward-Roger degrees of freedom now reflect the actual shape-weighted contrast. Step shape is unaffected.
LCARS theme (v0.3.59). Ninth theme option — authentic Okudagram palette (orange / peach / lavender / rose) with stat-card colored top-bands and bold Antonio pill buttons.
Design-matrix drag fixed (v0.3.58). Dragging a cell to a new position sometimes silently no-op'd when the grid re-rendered mid-gesture. Drags now cancel cleanly on re-render, and drop resolution falls back to the live grid if the captured one went stale — so the offset change lands whether or not the background recompute races.
Footer & help copy refreshed (v0.3.58). Footer line now names every supported design family (parallel, clustered, stepped wedge, crossover, non-inferiority, equivalence, factorial, survival) plus the KR / Satterthwaite small-sample story. Help modal's Advanced summary updated to list all four DoF methods and the new Effect shape control.
Time-varying treatment effects (v0.3.57).Effect shape control in Advanced (stepped wedge). Pick Step (default, instant full effect), Ramp-up, Fade, or Saturation and the power calculation reflects how the treatment develops over the periods following each cluster's crossover. One parameter per shape: ramp/fade length or plateau level.
Simpler Degrees of freedom helper. Replaced the wall of math in the DoF i tooltip with a one-line intro plus a click-to-expand explainer for each method (K−2, Satterthwaite, Kenward-Roger, Normal z). Tucked inside Advanced.
Dedicated CSV toolbar.Import CSV, Copy CSV, and Save CSV ↓ now share one right-justified row above the design matrix with a small i toggle that reveals the format reference inline — no more hunting them across the panel.
Resources tab. New Resources button in the header opens a full bibliography organised by design (stepped wedge, cluster, crossover, non-inferiority, survival, …) so you can see exactly which peer-reviewed paper each formula comes from.
Plainer About modal. Rewritten in normal English — what NeuPower does, how it works, who built it, what it is not. Citations moved to the new Resources tab.
Bigger NeuPower logo on phones. The wordmark was too small next to the sigmoid on mobile; both now read as a proper lockup.
Cleaner "Show the math" panel. Dropped the legacy verification walkthrough (built for parallel-only designs, not stepped wedge or clustered). Formulas now fill the whole row. For reproducibility code, use Export & share → R / Python code.
Differential clustering (IRGT).Clustered RCT checkbox: treatment arm delivered in groups (therapy sessions, classrooms) so TX clusters, controls get usual-care one-on-one. Uses the Roberts & Roberts 2005 formula with Satterthwaite df.
R / Python reproducibility code. Button under Export & share opens a panel with a short, copy-paste R script (swCRTdesign / clusterPower) and a Python script (numpy+scipy or statsmodels) that reproduces the current power calculation outside NeuPower.
Parallel + baseline button. Preset above the matrix builds the 2-period, half-control / half-crossover design in one click. Uses the current cluster count. Hemming-Taljaard 2016.
Custom staircase builder.Custom staircase… button opens a dialog: pick sequences, clusters per sequence, and pre/post-switch period counts. Defaults to the Grantham 2024 NICU layout.
Recent highlights
Download and copy your design.Copy CSV and Save CSV ↓ buttons above the matrix. The exported file re-imports exactly, so you can iterate, share, or archive a design in one click.
CSV format help. The small i button next to the CSV buttons reveals the accepted cell values, an example row, and the round-trip guarantee inline.
Upload your own design.Import CSV button above the matrix. Rows = clusters, columns = periods, values = DE / TX / X (or 0 / 1 / –).
Parallel-with-baseline and staircase designs. The math engine scores these layouts — six reference-paper tests in the suite confirm we match Hemming-Taljaard 2016 Table 3 and Grantham 2024 at the 4-decimal level.
Design Comparison matches Min N. The Stepped Wedge bar in the comparison panel was doing a different search than the Min N card. Both now line up.
"Match older tools" option.Normal (z) setting in Advanced → Degrees of freedom. Flip it on when you need numbers that line up with legacy sample-size tables.
Earlier wins worth knowing about
Three degrees-of-freedom methods for stepped-wedge designs (under Advanced): the classic K−2 default, Satterthwaite, and the full Kenward-Roger small-sample correction. Pick the one your analyst will use.
CSV, Word, and PDF exports under Export & share, plus saved scenarios that stick around between browser sessions.
Editable design matrix. Right-click any cell to change its phase, drag to reorder, or use the staircase / dog-ear preset buttons for incomplete designs.
Design info that auto-adapts. The right-column panel names your design (Classic SW, Staircase, Dog-ear, Parallel, etc.), explains when to use it, and links to real-world examples.
Guardrails. An audit banner warns you about broken designs before they show a misleading power number.
Live updates. Sliders move the power card, min N, matrix, and narrative instantly — no more lag.
Full technical change log is in the repo under docs/.
Coming Soon
A short preview of what's planned. Order may shift based on feedback.
Nicer to use
Guided tour for first-time users.
Behind the scenes
Validation paper. A peer-reviewed write-up of how NeuPower works and how it compares to existing tools.
Effect heterogeneity and cost-effectiveness modules for more specialised study types.
Performance tuning for the bigger, heavier calculations.
Want something we haven't listed? Open an issue on GitHub.
Staircase SC(S,K,R0,R1)
Parametric build of a staircase CRT. Each sequence observes R0 control periods then R1 treatment periods, shifted one calendar period per sequence. Grantham et al. 2024.
Total clusters: 4 · calendar periods: 5
Presets: SC(S,1,1,1) — classic one-step staircase (Kasza & Forbes 2019). SC(4,1,1,1) — Grantham 2024 NICU. SC(S,K,2,2) — dog-ear with K clusters per sequence.
Attained-power distribution
Under unequal cluster sizes, SW-CRT attained power is a distribution, not a point. Sizes are drawn from lognormal(n̄, CV); per-realization analytical power follows Hussey-Hughes GLS with per-cluster n (Ouyang 2020, BMC Med Res Methodol 20:166).
Seed pins the cluster-size realization so results are reproducible. Higher CV widens the distribution — a design with "80% mean attained power" can still have a 25th percentile well below the 80% target.
Reproducibility code
A short script that reproduces the current power calculation in R or Python. Intended as a cross-check, not a substitute — small differences can come from t-vs-z critical values or the df method.
(generated on demand)
(generated on demand)
NeuPower
Clinical Trial Power Calculator · v0.3.72
NeuPower is a sample-size and power calculator for clinical trials. It runs entirely in your browser — nothing is installed and no design or input data leaves your device.
What it does
Computes statistical power and minimum sample size for the designs that matter most in clinical research — parallel RCTs, stepped wedge cluster trials, crossover designs, clustered RCTs (including differential clustering for group-therapy trials), non-inferiority, equivalence, factorial 2×2, and survival. It runs the same math a trained biostatistician would run by hand or in R, including the Kenward-Roger and Satterthwaite small-sample corrections that older textbook tools often skip.
How it works
Every number you see is computed in your browser from published, peer-reviewed formulas. You can edit the design matrix by hand, export the exact R or Python code that reproduces the result, and cross-check against landmark published examples — each design family is anchored to numerical fixtures from a peer-reviewed paper.
Who built it
Developed by Nicholas Neuwald, PhD at the University at Buffalo. Maintained as a research and educational tool.
What it is not
Not a substitute for biostatistical consultation on a real trial.
The peer-reviewed references and methods behind each design in NeuPower. Each design family is anchored to at least one numerical fixture from a landmark paper — see docs/VALIDATION.md in the repository for the fixture-by-fixture match.
Stepped Wedge Cluster Randomised Trial (SW-CRT)
Hussey MA, Hughes JP (2007). Design and analysis of stepped wedge cluster randomized trials. Contemporary Clinical Trials, 28, 182–191. Foundational power formula.
Hooper R, Teerenstra S, de Hoop E, Eldridge S (2016). Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Statistics in Medicine, 35, 4718–4728. Block-exchangeable correlation; K−2 df convention.
Kasza J, Forbes AB (2019). Inference for the treatment effect in multiple-period cluster randomised trials when random effect correlation structure is misspecified. Statistical Methods in Medical Research, 28(10–11), 3112–3122.
Hemming K, Taljaard M (2016). Sample size calculations for stepped wedge and cluster randomised trials: a unified approach. Statistics in Medicine, 35(26), 4695–4709. Parallel + baseline CRT-BA formulation.
Hemming K, Taljaard M, McKenzie JE, et al. (2020). Reporting of stepped wedge cluster randomised trials: extension of the CONSORT 2010 statement with explanation and elaboration. BMJ, 363, k1614.
Hughes JP, Heagerty PJ, Xia F, et al. (2024). ADDRESS-BP: accounting for time-varying treatment effects in stepped wedge CRTs. Background for the Effect-shape control (Advanced → Effect shape).
Hemming K, Taljaard M, Forbes A (2016). Statistics in Medicine, Table 3 anchor. Parallel + baseline CRT-BA.
Roberts C, Roberts SA (2005). Design and analysis of clinical trials with clustering effects due to treatment. Clinical Trials, 2(2), 152–162. Differential clustering / IRGT: treatment clusters, controls i.i.d.
Eldridge SM, Ashby D, Kerry S (2006). Sample size for cluster randomized trials: effect of coefficient of variation of cluster size and analysis method. International Journal of Epidemiology, 35, 1292–1300.
Small-sample Degrees-of-freedom Corrections
Kenward MG, Roger JH (1997). Small sample inference for fixed effects from restricted maximum likelihood. Biometrics, 53, 983–997.
Satterthwaite FE (1946). An approximate distribution of estimates of variance components. Biometrics Bulletin, 2(6), 110–114.
Li P, Redden DT (2015). Small sample performance of bias-corrected sandwich estimators for cluster-randomized trials with binary outcomes. Statistics in Medicine, 34(2), 281–296. KR adaptation for SWD.
Crossover RCT
Senn SJ (2002). Cross-over Trials in Clinical Research (2nd ed.). Chichester: Wiley. 2×2 AB/BA crossover.
Non-inferiority & Equivalence (TOST)
Schuirmann DJ (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6), 657–680. TOST.
Julious SA, Owen RJ (2011). A comparison of methods for sample size estimation for non-inferiority studies with binary outcomes. Statistical Methods in Medical Research, 20(6), 595–612.
Survival / Time-to-event
Schoenfeld DA (1983). Sample-size formula for the proportional-hazards regression model. Biometrics, 39(2), 499–503.
Freedman LS (1982). Tables of the number of patients required in clinical trials using the logrank test. Statistics in Medicine, 1(2), 121–129.
Factorial Designs
Montgomery DC (2013). Design and Analysis of Experiments (8th ed.). Wiley. 2×2 factorial main-effects framework.
Green SB, Byar DP (1984). The choice of treatment allocation ratio in clinical trials. Controlled Clinical Trials, 5(1), 55–64.
Internal documentation
For the complete fixture-by-fixture validation table and the t-vs-z percentage-point cost table, see docs/VALIDATION.md in the source repository. Architecture and phase plans are under docs/.
Citations are included for methodological provenance. This list is not exhaustive — it covers the papers whose formulas NeuPower directly implements. Contact the author for questions about specific derivations.