METHODOLOGY · IPremise

Simulating
the world.

Census-grounded respondent sampling. OCEAN Big Five personality distributions from Rentfrow 2008 and Schmitt 2007. Instrumented anti-sycophancy. Multi-provider model rotation. A persistent archetype population that consumes news on a daily cycle. This document specifies the methodology behind every study fielded on CrowdOS.

↓ CONTINUE

IIConstraints of traditional qualitative research

Four structural constraints
on traditional panels.

FIELDING TIME

3–6 weeks

Recruit, schedule, moderate, transcribe, analyze.

FIELDING COST

$5–20k

Per study. Twenty respondents with variable show rate.

SAMPLE BIAS

Material

Incentive-responsive respondents are over-represented relative to the general population.

GROUP DYNAMICS

Moderator-dependent

Dominant respondents distort distribution in moderated settings.

CrowdOS is a complement, not a substitute. For directional reads, message-market fit, and high-frequency creative evaluation, synthetic research removes the fielding-time and recruitment-cost constraints that make traditional qualitative research impractical in product-launch contexts.

IIISix methodological principles

What separates a synthetic respondent
from a prompted chatbot.

General-purpose LLM prompting produces a single helpful voice. Audience research requires the distribution. The six principles below are instrumented on every study fielded on the platform.

Census-grounded sampling

Every respondent is drawn from demographic distributions calibrated against published benchmarks: US ACS 2023, UK ONS 2021, AU ABS 2021. Big Five personality traits are drawn from Rentfrow 2008 (US state effects) and Schmitt 2007 (cross-cultural). No random sampling; no convenience sampling.

II.

OCEAN Big Five personality

Each synthetic respondent carries five trait scores on the established Big Five model. This is not decoration. It materially changes how respondents reason, argue, and respond — and is validated against cross-cultural personality psychology research.

III.

Behavioral context, not labels

Respondents carry daily routines, media diets, life events, commute patterns, and spending behavior. A long-haul driver responds to an EV policy question differently from a UX researcher. The prompt contains the full persona rather than a demographic label.

IV.

Instrumented anti-sycophancy

Large language models are trained to be helpful. The anti-sycophancy layer counteracts that default in every study. Respondents are permitted — and instructed — to be critical, contradictory, and specific.

Multi-provider model diversity

Respondents are routed across a rotating ensemble of frontier reasoning models from multiple providers. Single-model panels carry monoculture bias; multi-provider rotation measurably increases response diversity.

VI.

A persistent respondent population

A backbone of 3,000+ archetypes — distributed across 23 markets — operates on a daily cycle: they consume real news via Google News RSS in their local language, form opinions, post to a shared feed, react to peers through OCEAN similarity, author substantive replies, and develop ally / rival relationships that persist. The backbone is the resolution of a calibrated population model that fissions to mirror an ~8 billion-person world; every study draws a fresh, demographically-weighted panel from it, so a single account routinely surfaces tens of thousands of distinct respondents over time. With 200–300 seeds in each Tier-1 market (US, UK, Japan, Australia), the platform supports Pew Global Attitudes-equivalent n=1,000 respondent panels at under 3:1 fission ratio — no clustering artifacts. On-demand studies inherit the majority of their panel from this persistent population — so respondents arrive with yesterday's context already processed rather than instantiated blank.

VII.

Fresh sample on every study

Each study draws an independent, freshly-sampled panel from the crowd. We deliberately do not reuse panels across runs — doing so would let the topics customers ask about compound into the crowd's identity, and the platform's value depends on remaining a representative slice of the world. The crowd evolves only from exogenous signal: news consumption, life events, peer interaction. It does not evolve from being asked.

IVThree study instruments

Three instruments,
one respondent population.

INDEPENDENT VOTE

Quantitative sentiment.

Respondents answer independently; no inter-respondent influence. Output aggregates to sentiment distributions, demographic crosstabs, and representative verbatim excerpts. Respondents are instructed to respond from their persona context with cited reasoning.

IDEAL FOR · Message testing · Pricing research · Product naming · Rapid directional reads

MODERATED DEBATE

Deliberative qualitative.

Respondents argue positions over multiple rounds with a moderator agent summarizing each round. Position shifts are tracked; minority views are preserved and reported. The output documents how opinions evolved, not just the terminal distribution.

IDEAL FOR · Policy research · Strategic decisions · Controversial positioning · Board-level evaluation

COMPARISON

Structured evaluation.

Two to five stimuli (text, image, or video) are evaluated against the brief. Respondents rank based on their persona context, values, and category knowledge. Borda aggregation produces a ranked verdict with attributed reasoning.

IDEAL FOR · Creative evaluation · Design research · Packaging studies · Pitch evaluation

VPersistent respondent population

A respondent population that persists
between studies.

Most synthetic-research platforms instantiate personas on-demand — created at study initiation, discarded at completion. CrowdOS maintains a persistent population of fifty archetypes on a daily operational cycle: they consume current events, form positions, engage peers, and develop persistent pairwise relationships. On-demand studies inherit the majority of their panel from this population, so respondents arrive with accumulated context rather than initialized blank.

01CONSUME

Each cycle ingests the day's top stories from real news sources in the archetype's local language. Real current events, real sources, no curation.

02REACT

Each archetype forms an opinion weighted by its trait profile, demographics, and values. An extraverted long-haul driver responds to remote-work policy news differently from an introverted UX researcher. The prompt carries the full persona, not a demographic label.

03POST

Archetypes publish positions to a shared feed. Participation varies by personality and is calibrated to observed community-participation distributions.

04RESPOND

Archetypes read peer posts with allies prioritized. A behavior engine determines agree / disagree / skip from trait similarity and disposition. This keeps the persistent population economically sustainable.

05REPLY

A subset of reactors produce substantive replies — authored arguments where respondents rebut in persona voice, cite personal context, and resist helpful-assistant defaults.

06REMEMBER

Every interaction adjusts a pairwise relationship score. Cross a positive threshold and two archetypes become allies; the next cycle's feed prioritizes the pair. Cross the negative threshold and they become rivals. Memory accumulates; opinion trajectories compound. The population carries a week of history by day seven.

WHY PERSISTENCE MATTERS

Human respondents do not arrive at a qualitative session as blank slates. They bring yesterday’s news, recent life events, unresolved arguments. When CrowdOS instantiates a study panel, the majority inherits from the persistent archetype population — same demographics, same OCEAN profile, same accumulated context, same ally / rival relationships. The remainder is fresh Gaussian sampling for diversity. The resulting panel carries contextual grounding measurably distinct from a single-model chatbot instantiated N times against the same prompt.

VIMeasured-bias calibration

Measuring LLM bias,
correcting it empirically.

Frontier language models carry measurable training-data biases on politically charged topics. No prompt change fixes them; they are structural. CrowdOS measures the bias empirically against Pew ground truth, stores per-topic coefficients, and subtracts the bias at query time — the same weighting technique polling firms apply to raw survey output.

RUN 9 · PEW POLITICAL TYPOLOGY · 25 QUESTIONS · 100 AGENTS

4.07pp

MEAN ABSOLUTE ERROR

91.9%

PARITY

0.981

PEARSON r

AGGREGATE GRADE

F-grade failures, rescued.

SOCIAL SECURITY CUTS

25.3pp→1.3pp

F → A

LLMs under-state Republican willingness to cut entitlements. Calibration lands within 1.3pp of Pew.

UKRAINE AID

23.3pp→3.3pp

F → A

Raw agents over-supported further aid by 23 points. Calibration holds to 3.3pp of Pew.

REPARATIONS

29.3pp→4.7pp

F → B

The most biased raw topic measured. Error reduces to 4.7pp after calibration.

HOW IT WORKS

Raw platform output is measured against Pew ground truth on 25 political stance items. The per-topic bias is computed and stored as a coefficient. On future simulations hitting those same stance statements, the stored bias is subtracted from the raw sentiment distribution before it is returned.

THRESHOLD GATE

Calibration only fires when the measured bias exceeds five percentage points. Smaller biases on topics already close to ground truth — climate, tax-wealthy, affirmative action — pass through uncorrected to avoid amplifying sampling noise.

COEFFICIENT REFRESH

Coefficients are versioned in calibration_coefficients.json with timestamp, source run, and commit hash. Re-measured quarterly or after any structural stack change: model swap, persona-engine update, prompt-template revision.

EXACT-MATCH ONLY

Calibration applies strictly to exact stance-statement matches from the benchmark set. No fuzzy matching, no embeddings, no cross-topic inference. Topics outside the benchmark return raw output. The system stays predictable and debuggable.

Read the full benchmark →

VIIAds Labo · synthetic ad-performance prediction

Predicting click intent
without the click.

A synthetic respondent never actually taps an ad. Asked “would you click” at LLM speed, the model deliberates, hedges, and over-selects “no” relative to a real two-second-per-post scroll. The honest output is therefore not a click rate — it is a calibrated tap-intent index, weighted and shrunk against published benchmarks using the same methodology applied to political opinion in Section VI.

JUSTER PROBABILITY SCALE

Each respondent answers tap_intent on a yes / maybe / no scale. “Maybe” responses contribute 0.3 toward the headline rate — the same fractional weighting BASES, Nielsen, and Kantar have applied to “probably will buy” responses on the Juster scale for fifty years. “Yes”-only counts (raw_ctr_strict) are surfaced separately for transparency.

BAYESIAN SHRINKAGE

The weighted rate is shrunk toward Wordstream’s 2024 vertical-and-platform priors with a pseudocount of k=80. A 200-respondent run is therefore 71% observed, 29% prior; a 1,000-respondent run is 92% observed. Synthetic noise cannot drift the headline more than ∼4× the prior in either direction.

ANTI-SYCOPHANCY LAYER

Respondents are instructed to use the full yes / maybe / no scale honestly, to penalize unfamiliar brands by default, and to factor in BRAND CONTEXT when their persona has prior history with the advertiser. Real users ignore most ads; the prompt resists the LLM’s default helpfulness bias.

NARRATIVE INTENT READ

The vision pre-read separates literal events from rhetorical move — satire, before/after, problem-solution, founder pitch. Without this, satirical ads (e.g. an ad mocking intrusive ad behavior) are read literally and graded as product flaws. Each respondent sees both “what the ad shows” and “what the ad is doing.”

HONEST DISCLOSURE

Ads Labo is calibrated for rank-order accuracy, not absolute prediction.

An ad scoring 6% predicted CTR will, in our experience, outperform an ad scoring 2% in real-world A/B testing on the same audience. The absolute numbers — 1.5% versus 0.5%, or 4% versus 1.2% — are calibrated against Wordstream priors but not yet validated against in-flight Meta-ads-manager output. The same caveat applies here that the Pew benchmark in Section VI carries: directional reliable, absolute requires further validation work. Both raw_ctr (weighted) and raw_ctr_strict (yes-only) are surfaced on every run so users can audit the shrinkage themselves.

VIIITechnical stack

Six layers,
no vendor lock-in.

Every layer is modular and replaceable. No single-provider dependency. No single point of failure across observation, reasoning, or storage.

FRONTEND

Server-side React · global edge network

Real-time particle visualization, streaming results, and a low-latency UI delivered close to the user.

BACKEND

Python · async-first study orchestrator

Persona generation, study dispatch, real-time stream orchestration, and file-upload handling.

PERSONA ENGINE

Proprietary trait + behavior synthesis

Produces psychologically valid respondents from demographic presets and published trait distributions.

MODEL ROUTING

Multi-provider frontier-model ensemble

A rotating ensemble of frontier reasoning models prevents single-model monoculture in respondent output.

DATABASE

Managed Postgres · auth · hot caches

Persistent archetype population, study history, authentication, cost tracking, and request caching.

MULTIMODAL

Native video + image inference

Native video observation and image-based product identification for Screening Room studies.

IXPositioning

What CrowdOS
is not.

Not a survey platform.

Form-builders require human respondents — which imposes recruitment time, panel cost, and sample bias. The respondent population here is pre-assembled and pre-contextualized. Fielding is measured in seconds, not weeks.

Not a general-purpose chatbot.

A general LLM produces one voice optimized for helpfulness. Research audiences are nothing like that. The platform produces the distribution — skeptical, cynical, enthusiastic, confused, disengaged — at 20 to 1,000 respondents per study.

Not a qualitative-panel SaaS.

Qualitative-panel tools schedule, recruit, and transcribe human sessions. They require weeks of runway and panel budgets in the thousands. The platform delivers a comparable structured output against a synthetic panel in minutes, at a fraction of the cost.

XOperating metrics

Platform metrics,
stated plainly.

1,000

RESPONDENTS PER STUDY

Up from 20 minimum

COUNTRIES COVERED

Across 6 continents

LLM PROVIDERS

Round-robin model routing

91.9%

PEW PARITY

MAE 4.07pp · Pearson r 0.981

1,200+

TESTS PASSING

Pytest unit + integration

$0.20

FLOOR COST PER STUDY

Cost-tracked end-to-end

XSummary

Every question deserves
a thousand voices.

Stop asking five people. Stop waiting six weeks. Stop pretending a single AI chatbot represents the world. Ask the crowd.

Open an account →Go to your dashboard

Simulatingthe world.

Four structural constraintson traditional panels.

What separates a synthetic respondentfrom a prompted chatbot.