How Assay scores market microstructure integrity

Market microstructure integrity scoring. Ten observable metrics across four dimensions, computed daily from public market data. No exchange cooperation required, no self-reported figures used as inputs, no proprietary feeds.

Summary

Assay v1 measures observable market microstructure characteristics across spot markets. Scores reflect the consistency of observed trading behaviour with patterns associated with genuine market activity. They do not constitute a determination of fraud, manipulation, or intent. Scores are regime-sensitive: a high-volatility day and a low-volatility day will produce different microstructure readings for the same venue.

Each day, public trade data and order book snapshots from ten venues are ingested, ten metrics across four dimensions are computed, and a composite score and band assignment per exchange are published. Scores refresh every 24 hours.

The design is motivated by a single constraint: participants making decisions about exchange venues commit six-to-seven figures to a venue based on data the venue itself provides about itself. Existing alternatives are either self-reported (CoinMarketCap, CoinGecko), opaque (CER.live), or gated behind enterprise pricing (Kaiko). Assay fills the gap with an audit-grade, methodology-transparent layer priced for the people making those decisions.

This document describes what is measured (in full) and how scores are computed (in full). Specific threshold calibration parameters are held proprietary for the reasons set out in § What stays proprietary.

At a glance

Exchanges covered10 global spot venues
Trading pairsBTC/USDT, ETH/USDT (venue-native equivalents where applicable)
Data sourcePublic REST and WebSocket APIs only
Update frequencyDaily, approximately 23:30 UTC
WindowTrailing 24 hours (00:00-24:00 UTC)
Minimum history60 days of peer baseline calibration before publication

The four dimensions

Each dimension answers one question a listing lead needs to answer about a venue. Together, the four dimensions cover the failure modes documented in the academic and industry literature on exchange quality.

D1
Volume authenticity
Does the reported trading volume follow patterns consistent with organic market activity?
M01, M02, M03
3 metrics weight 30%
D2
Order book quality
Is the displayed liquidity reflected in observable trading at size?
M04, M05, M06
3 metrics weight 25%
D3
Price formation integrity
Do observed trades produce price dynamics consistent with informed market participation?
M07, M08
2 metrics weight 25%
D4
Cross-venue consistency
Do this venue's aggregate characteristics align with its observed position in the peer basket?
M10, M11
2 metrics weight 20%

Weights reflect relative importance to the buyer use case: volume authenticity carries the highest weight because wash trading is the single signal most likely to mislead a listing decision, and the metrics in this dimension are the hardest for an exchange to game.

The ten metrics

Each metric is specified below with its question, rationale, inputs, score mapping, and cost-to-replicate analysis. Metrics are computed independently for each (exchange, pair, day) combination. Where a metric is not applicable to a venue, it is marked as not applicable and excluded from dimension aggregation.

M01 Trade size distribution (Benford-adjusted)

Question

Does the distribution of trade sizes match what organic trading produces?

Rationale

Organic trading produces trade sizes that follow a power-law-like distribution with a characteristic tail. Automated or bot-driven volume tends to produce concentrations at round numbers (0.1 BTC, 1.0 BTC, 10.0 BTC) or unusually uniform size distributions. A leading-digit test against Benford's Law captures both failure modes without assuming a "normal" venue size profile.

What this catches and what it does not. M01 is sensitive to unsophisticated trade-size manipulation: round-number bias, uniform-size bot output, and arithmetically-generated streams that ignore the natural leading-digit distribution. Conformance with Benford's Law on its own does not rule out sophisticated wash trading. A counterparty-aware system that constructs synthetic trades with properly power-distributed sizes can pass this metric while still being artificial. M01 is one of three components of the volume-authenticity dimension and is read alongside M02 and M03; agreement across all three is what establishes a high dimension score.

M01 is treated as a heuristic signal, not a deterministic test. It is sensitive to venue-specific trading patterns and is weighted accordingly.

Inputs

Data sourcePublic trades, 24h rolling window
FieldTrade size in base asset
Outlier treatmentWinsorised at 99.9th percentile
Minimum sample1,000 trades; below this the metric is not scored for that day
StatisticPearson chi-squared against Benford expected digit frequencies, normalised by sample size: χ²/N. M01 applies multi-digit distributional testing: both the leading digit (Benford's first-digit law, digits 1-9) and the second leading digit (Benford's second-digit law, digits 0-9) are evaluated independently and combined into the headline score. Normalisation makes the statistic sample-size invariant; raw chi-squared scales linearly with N and would otherwise rank exchanges by trading volume rather than distribution shape.

Score mapping

Piecewise linear in χ²/N: lower per-trade divergence maps to a higher score. Anchors:

χ²/NScore
0.00100
0.0580
0.1550
0.3520
0.650

Cost to replicate

Matching this distribution requires generating synthetic trades with properly power-distributed sizes, a non-trivial engineering requirement. Simple automation with uniform or round sizes produces detectable leading-digit patterns.

M02 Volume-to-order-book-depth ratio

Question

Does the reported 24-hour volume plausibly pass through the observable order book?

Rationale

An exchange reporting $1B daily BTC/USDT volume with a $50k order book has an implied book turnover rate that is physically unusual. Reference venues show a ratio of 24h volume to mean bid+ask depth within ±2% of mid that sits in a consistent empirical range. Values outside this range fall into two distinct regimes with different operational meanings. High ratios suggest reported volume may exceed what the visible order book could plausibly absorb, a potential integrity signal. Low ratios indicate deep liquidity relative to reported flow, a structural pattern common in market-making-dominant venues, not a quality failure in itself. Current v1 scoring penalises both regimes; a forthcoming methodology revision will reflect the asymmetry directly in the score.

Inputs

Volume24h USD notional, computed from own trade data; exchange-reported ticker is not used
DepthTime-averaged ±2% book depth across 1,440 one-minute snapshots
Coverage thresholdThe metric is not scored for that day below 50% snapshot coverage

Score mapping

Non-monotonic in the raw ratio R: both unusually low and unusually high R produce lower scores in v1, with the typical zone in the middle of the empirical peer-basket range. The two penalised regimes have distinct interpretations (see Rationale): high-R is the volume-integrity concern, low-R is a structural deep-book pattern. The score itself is direction-symmetric in v1; commissioned reports interpret direction narratively. M10 has the same non-monotonic shape and the same caveat.

Cost to replicate

The two failure modes have asymmetric replication costs. Reaching the typical mid-range from the high-R side requires reducing reported volume to what the book can actually absorb, the more difficult adjustment for venues whose volume figures are inflated, since real volume is hard to manufacture and reported volume cannot be cut without admitting prior overstatement. Reaching it from the low-R side is operationally simpler (reduce displayed depth, or attract more flow) but commercially undesirable for venues whose model is market-making-dominant deep-book provision. Deep books themselves carry a direct cost: market-maker incentives, inventory risk, and capital that could otherwise earn yield.

M03 Trade interval entropy

Question

Are trades arriving at times consistent with Poisson-like arrival processes?

Rationale

Organic trading produces trade arrival times that follow approximately a Poisson or Hawkes process: bursty, self-exciting, with heavy-tailed inter-arrival times. Bot-driven flow often produces either suspiciously regular intervals (exact 1-second spacings, periodic patterns) or unnaturally uniform distributions. M03 evaluates two complementary timing signals: the Shannon entropy of the inter-arrival distribution (spread across time scales) and the Pearson autocorrelation at lag-1 of the inter-arrival sequence (whether consecutive intervals are similar to each other). Entropy is also evaluated separately at multiple time-scale resolutions (intra-burst, tick-level, and order-level) so that timing structure visible only within a narrow scale (a 250ms cadence buried inside an otherwise diverse stream, for example) is not averaged away by the full-scale view. Together these signals catch both single-scale clustering and consecutive-interval regularity.

Inputs

Data sourceTrade timestamps for the target pair, 24h window
BucketingLogarithmic: 0-100ms, 100ms-1s, 1s-10s, 10s-100s, 100s+ (full-scale entropy signal); per-scale sub-bucketing within 0-100ms, 0-1s, and 0-10s ranges (multi-resolution entropy)
Minimum sample5,000 trades; below this the metric is not scored for that day

Score mapping

Monotonic in Shannon entropy of the inter-arrival distribution: higher entropy maps to a higher score. Maximum possible entropy is log2(5) ≈ 2.32 bits (trades distributed perfectly uniformly across the five buckets); minimum is 0 (every trade in a single bucket, characteristic of algorithmic regularity or single-frequency wash patterns). Anchor points: entropy of 1.0 maps to score 50, entropy of 1.5 maps to 80, entropy near the maximum maps to 100. The same anchor table is applied to the per-scale sub-entropies and to the full-scale entropy; the four entropy scores are blended into a single entropy signal. The lag-1 autocorrelation of the inter-arrival sequence is mapped on a separate scale: autocorrelation near zero or negative scores high (organic irregular flow), high positive autocorrelation scores low (regular bot spacing). The blended entropy signal and the autocorrelation sub-score are combined into the headline score.

Cost to replicate

Matching a Poisson-like arrival distribution requires sophisticated bot design that injects timing variability deliberately. Simpler automation tends to produce detectable regularity at the millisecond or second scale.

M04 Effective spread

Question

What does it actually cost to trade?

Rationale

Quoted spread (best ask − best bid) can be very narrow while the venue is quiet. Effective spread, the realised cost of market orders relative to mid-price, measures actual execution cost against actual trades.

Inputs

Mid-price reference1-minute order book snapshots
Trade directionBuy/sell from feed where provided; Lee-Ready inferred otherwise
AggregationVolume-weighted average over 24h, in basis points

v2 roadmap: a 7-day rolling fallback for days with sparse order-book snapshot coverage. v1 uses the 24h window unconditionally and gates to insufficient_data when fewer than 50% of expected 1-minute snapshots are available; trade count itself is reported but does not gate the metric, so quiet trade days with adequate snapshot coverage still produce a score.

Score mapping

Monotonic in effective spread: lower bps maps to a higher score. BTC/USDT benchmarks: under 5 bps is excellent, 5-15 bps typical, above 50 bps atypical. Exact thresholds are anchored to the live peer distribution and recalibrated periodically.

Cost to replicate

Effective spread reflects actual execution cost. A low effective spread without lowering fees or subsidising market makers is not easily replicated; both substitutes carry direct costs.

M05 Order book slope (Kyle's λ proxy)

Question

How much does the book absorb? What is the price impact of size?

Rationale

A deep book has a gradual slope: a $1M sell order moves price incrementally. A book with thin layers beyond the inside quote shows a steep slope at size. The relationship between executed size and price impact is a proxy for Kyle's lambda, computable from public order book data alone.

Inputs

Snapshots1-minute frequency, full depth within ±5% of mid
Standard sizes$10k, $100k, $1M notional on each side
Outlier treatmentSizes winsorised at 99.9th percentile

Score mapping

Monotonic in λ expressed as slippage-bps for a standard $100k trade: lower λ maps to a higher score. Typical range 10-50 bps; values above 200 bps are atypical.

Cost to replicate

Matching this requires maintaining real depth across multiple price levels. The capital cost scales with the square of the depth advertised.

M06 Quote stability and update rate

Question

Are quotes being maintained, or is the book updating at rates disproportionate to actual trading?

Rationale

Quote update rates disproportionate to trade rates can indicate rapid cancel-and-replace cycles that make displayed depth difficult to hit in practice. High update-to-trade ratios are associated with low effective fill rates in several academic studies of exchange microstructure.

Inputs

Order book streamWebSocket add/cancel/modify events from the continuous WS consumer
Quote-update ratiobook_events / trades over the 24h window
Coverage thresholdIf WebSocket sequence gaps cover more than 50% of the day, the metric is not scored for that day

Score mapping

Non-monotonic in the quote-update ratio: both very low and very high values map to a lower score. A near-zero ratio indicates a sleepy book with insufficient quoting activity; a very high ratio (thousands of book events per trade) indicates rapid cancel-and-replace cycles characteristic of quote-stuffing patterns. Healthy market-making sits in a moderate plateau (roughly 50-1000 events per trade) which maps to the highest scores.

Cost to replicate

Sustained moderate quote activity requires real market-maker capital and inventory tolerance. Neither extreme (too quiet, or churn-without-substance) is easily produced as a substitute.

Currently pending multi-venue WebSocket book event collection. Until at least three venues stream the input, M06 reports as not scored for the day rather than against an under-sampled peer distribution. Will populate automatically once sufficient peer data exists.

M07 Cross-venue price deviation

Question

Does this venue's price track the global market?

Rationale

A venue's mid-price for BTC/USDT should track the arithmetic mean of the peer basket's mid-prices within a tight tolerance, given arbitrage incentives. Persistent deviations may reflect lower arbitrage activity, stale feeds, or isolated price formation on that venue.

Inputs

SamplingMid-price every 60 seconds, target exchange and reference basket
Reference basketBinance, Coinbase, Kraken, OKX (arithmetic mean), excluding target if a basket member
Cross-rateUSD/USDT translated via Kraken USD VWAP / Binance USDT VWAP at each timestamp

Score mapping

Monotonic in mean absolute deviation from the reference basket: lower MAD maps to a higher score. Typical: MAD under 5 bps and tail under 30 bps. Atypical: MAD above 50 bps, or persistent autocorrelation above 0.5.

Cost to replicate

Keeping prices aligned with the global market requires arbitrage linkage. Without underlying liquidity to support two-way arbitrage, alignment is difficult to maintain.

v2 roadmap: replace the arithmetic mean with a volume-weighted reference once aggregated multi-venue volume data is sourced consistently across the peer basket.

M08 Mid-price reversion dynamics

Question

After large trades, does price revert in a way consistent with real market impact?

Rationale

In real markets, informed trades produce permanent impact and uninformed trades produce temporary impact that reverts. If trades at a venue show near-complete reversion within seconds regardless of size, this is inconsistent with the mix of informed and uninformed flow observed at reference venues.

Inputs

Trade filterPrints above approximately $50k notional
Mid-price series1-minute resolution, around each large-trade event
Reversion measureFraction of large trades whose 60-second reversion ≥ 0.8 (substantially reverted to pre-trade mid)
Minimum sample50 large trades per day; else 7-day rolling fallback

Score mapping

Monotonic-decreasing in the substantially-reverted fraction: lower fraction maps to a higher score. Anchors: 20% reversion maps to 90, 30% maps to 75, 50% maps to 30, 70% maps to 10, and 100% reversion (every large trade fully retraced) maps to 0; at that level the venue's flow carries no real impact. Reference venues typically sit in the 20-30% band (scoring in the 75-90 range); above 70% is atypical and inconsistent with the mix of informed flow observed at reference venues.

Cost to replicate

Matching this pattern requires trades that carry genuine price impact, which means actually moving the book and taking position risk.

M10 Volume share vs liquidity share

Question

Does this exchange's share of global volume match its share of global liquidity?

Rationale

A venue's share of global volume and its share of global depth should be of similar order. If a venue shows 15% of global BTC/USDT volume but 2% of global order book depth at ±2%, the ratio is unusual relative to peers.

Inputs

Volume24h notional, computed from own trade data; exchange ticker is not used
Reference volumeAggregate 24h across the peer basket, excluding target if a basket member
DepthTime-averaged ±2% depth for target and peers

Score mapping

Non-monotonic in R = vol_share / depth_share: both unusually low and unusually high R are atypical. Typical: R in [0.7, 1.5]. Atypical: outside [0.3, 5] in either direction. The other non-monotonic metric in the spec (alongside M02).

Cost to replicate

This is the measurement hardest to match without the underlying fundamentals. Matching both ends requires real volume and real depth simultaneously.

M11 Price leadership and contribution to price discovery

Question

Does this exchange lead global price discovery or lag it?

Rationale

Venues with informed flow tend to lead price moves: discovery happens there first and peers follow shortly after. The v1 implementation correlates the venue's per-minute returns against the peer basket's per-minute returns at seven forward lead times (1, 2, 5, 10, 15, 30, 60 minutes), i.e. today's target return versus the peer return k minutes later. The metric records the maximum correlation across those seven leads, plus the lag at which the maximum occurred. A high maximum correlation means the venue's moves predict subsequent peer moves; a low maximum means the venue moves independently of, or lags, the basket.

Inputs

Mid-price series1-minute resolution, target + peer basket (excluding target if a member), 24h window
Lead lags tested1, 2, 5, 10, 15, 30, 60 minutes (forward only)
AggregateMaximum Pearson correlation across the seven leads; the winning lag is surfaced as the metric's diagnostic raw value

Score mapping

Monotonic in the maximum forward-lead correlation: higher correlation maps to a higher score. Anchor points (provisional, calibrated against the empirical distribution observed at v1 launch): correlation of 0.02 maps to score 50, 0.04 to 80, 0.08 or higher to 100. Anchors are subject to revision after 30 days as the cohort distribution sharpens.

Cost to replicate

Matching this requires attracting informed traders, which compounds over time and is difficult to short-circuit.

v2 roadmap: replace the forward-lead correlation proxy with a Hasbrouck information-share computation. The v1 max-lead-correlation is a directionally-aligned proxy for the same construct (which venue contributes most to the cointegrating price level), but the Hasbrouck decomposition gives a quantitative attribution across the basket rather than a single-number per-venue readout.

M11 in v1 uses a lead-lag correlation proxy. This is an approximation of price leadership, not a full information share decomposition (e.g. Hasbrouck IS). A more rigorous implementation is planned for v2.

Composite scoring

CodeDimensionMetricsIntent weight
D1Volume authenticityM01, M02, M0330%
D2Order book qualityM04, M05, M0625%
D3Price formation integrityM07, M0825%
D4Cross-venue consistencyM10, M1120%

Intent weights describe the relative importance of each dimension to the buyer use case. They are not applied in the composite formula; the composite uses an unweighted geometric mean across all four dimensions (see Step 3). Equal weight in the formula is a deliberate v1 choice so that no dimension can be neglected.

Step 1: Metric normalisation

Each raw metric value is converted to a 0-100 score using piecewise linear mappings defined in a versioned mapping file. Higher score means higher quality. Every score row is stamped with the mapping_version active at computation time, so historical scores remain interpretable after methodology updates.

Step 2: Dimension scores

The dimension score is the arithmetic mean across that dimension's metrics that returned status='ok'.

Step 3: Composite score

The composite is the unweighted geometric mean across the four dimensions: (D1·D2·D3·D4)1/4. A composite is only published when every dimension has at least one scoring metric. The geometric mean is used rather than a weighted arithmetic average so that a single weak dimension cannot be compensated for by strong scores in the others. A zero in any dimension collapses the composite to zero.

Step 4: Band assignment

1≥ 85
275 - 85
360 - 75
445 - 60
5< 45

What stays proprietary

Tiered disclosure: M01 and M08 anchors are published; anchors for other metrics are disclosed in commissioned reports. Most calibration parameters are not published in full on the public methodology page. The published exceptions are listed first.

  • M01 score-mapping anchors are public. The five-anchor piecewise-linear curve from χ²/N to score is set out in the M01 entry above. Published as a worked example of the calibration shape and as a transparency commitment for the metric most often cited in commission discussions.
  • M08 score-mapping anchors are public. The five-anchor curve from substantially-reverted fraction to score is set out in the M08 entry above (20% → 90, 30% → 75, 50% → 30, 70% → 10, 100% → 0). Published alongside M01 as the second worked-example calibration.

The remainder of the calibration set is disclosed in commissioned reports rather than on this page:

  • Score-mapping anchors for M02-M07, M10, and M11. The score ranges shown qualitatively in each metric entry are illustrative; production thresholds are calibrated from observed distributions across the peer basket.
  • Peer baseline calibration parameters for M03, M06, M08, and M11.
  • Per-metric outlier rules beyond the published 99.9% winsorisation.

Threshold calibration uses observed distributions across a reference peer basket and is recalibrated periodically. Full calibration parameters are available to data licence customers; the baseline_version and mapping_version stamped on each score row allows any customer to reproduce any historical score exactly.

Data sources

All 10 covered exchanges are ingested via their public REST and WebSocket APIs. No authentication, no paid feeds, no exchange cooperation is required at any stage.

USD-quoted venues (Coinbase, Kraken) are handled in venue-native pairs for single-venue metrics. For cross-venue comparison, USD prices are translated via the Kraken USD VWAP / Binance USDT VWAP basis computed at each timestamp.

Limitations

Assay scores are narrow by design. They describe market data integrity, not broader aspects of an exchange's operation. The following are not measured and should not be inferred:

  • Custody security or solvency
  • Regulatory standing or licensing status
  • User experience, customer support quality, or dispute handling
  • Fiat on-ramp / off-ramp quality
  • Derivatives market quality (v1 is spot-only)
  • Token-level listing quality for pairs other than BTC and ETH

A high Assay score does not imply safety. A low Assay score reflects market data characteristics outside the typical range; it does not imply intent.

Several metrics (M04 effective spread, M08 large-trade reversion, and M11 price leadership) do not currently scale score uncertainty by sample size. A thinly-traded venue's headline score for these metrics carries wider intrinsic variance than a high-volume venue's score at the same number, even though both are reported on the same 0-100 scale. This is acknowledged as a known v1 limitation; sample-size-aware uncertainty bands are tracked on the v2 backlog alongside the next round of mapping recalibration.

Version history

VersionNotes
v1.0.0 Launch baseline. Composite scoring switched from a weighted linear average to an unweighted geometric mean across the four dimensions. Public copy reframed as market microstructure integrity scoring with the regime-sensitivity caveat.
v0.3.0 M09 (funding rate / basis sanity) removed entirely. v1 is spot-only and M09 required a derivatives data pipeline that is out of scope; reintroduction tracked as v2 work.
v0.1.0 Initial release. 10 exchanges, 11 metrics, 2 pairs.