How Assay scores market microstructure integrity
Market microstructure integrity scoring. Ten observable metrics across four dimensions, computed daily from public market data. No exchange cooperation required, no self-reported figures used as inputs, no proprietary feeds.
Summary
Assay v1 measures observable market microstructure characteristics across spot markets. Scores reflect the consistency of observed trading behaviour with patterns associated with genuine market activity. They do not constitute a determination of fraud, manipulation, or intent. Scores are regime-sensitive: a high-volatility day and a low-volatility day will produce different microstructure readings for the same venue.
Each day, public trade data and order book snapshots from ten venues are ingested, ten metrics across four dimensions are computed, and a composite score and band assignment per exchange are published. Scores refresh every 24 hours.
The design is motivated by a single constraint: participants making decisions about exchange venues commit six-to-seven figures to a venue based on data the venue itself provides about itself. Existing alternatives are either self-reported (CoinMarketCap, CoinGecko), opaque (CER.live), or gated behind enterprise pricing (Kaiko). Assay fills the gap with an audit-grade, methodology-transparent layer priced for the people making those decisions.
This document describes what is measured (in full) and how scores are computed (in full). Specific threshold calibration parameters are held proprietary for the reasons set out in § What stays proprietary.
At a glance
| Exchanges covered | 10 global spot venues |
|---|---|
| Trading pairs | BTC/USDT, ETH/USDT (venue-native equivalents where applicable) |
| Data source | Public REST and WebSocket APIs only |
| Update frequency | Daily, approximately 23:30 UTC |
| Window | Trailing 24 hours (00:00-24:00 UTC) |
| Minimum history | 60 days of peer baseline calibration before publication |
The four dimensions
Each dimension answers one question a listing lead needs to answer about a venue. Together, the four dimensions cover the failure modes documented in the academic and industry literature on exchange quality.
Weights reflect relative importance to the buyer use case: volume authenticity carries the highest weight because wash trading is the single signal most likely to mislead a listing decision, and the metrics in this dimension are the hardest for an exchange to game.
The ten metrics
Each metric is specified below with its question, rationale, inputs, score mapping, and cost-to-replicate analysis. Metrics are computed independently for each (exchange, pair, day) combination. Where a metric is not applicable to a venue, it is marked as not applicable and excluded from dimension aggregation.
M01 Trade size distribution (Benford-adjusted)#
Question
Does the distribution of trade sizes match what organic trading produces?
Rationale
Organic trading produces trade sizes that follow a power-law-like distribution with a characteristic tail. Automated or bot-driven volume tends to produce concentrations at round numbers (0.1 BTC, 1.0 BTC, 10.0 BTC) or unusually uniform size distributions. A leading-digit test against Benford's Law captures both failure modes without assuming a "normal" venue size profile.
What this catches and what it does not. M01 is sensitive to unsophisticated trade-size manipulation: round-number bias, uniform-size bot output, and arithmetically-generated streams that ignore the natural leading-digit distribution. Conformance with Benford's Law on its own does not rule out sophisticated wash trading. A counterparty-aware system that constructs synthetic trades with properly power-distributed sizes can pass this metric while still being artificial. M01 is one of three components of the volume-authenticity dimension and is read alongside M02 and M03; agreement across all three is what establishes a high dimension score.
M01 is treated as a heuristic signal, not a deterministic test. It is sensitive to venue-specific trading patterns and is weighted accordingly.
Inputs
| Data source | Public trades, 24h rolling window |
|---|---|
| Field | Trade size in base asset |
| Outlier treatment | Winsorised at 99.9th percentile |
| Minimum sample | 1,000 trades; below this the metric is not scored for that day |
| Statistic | Pearson chi-squared against Benford expected digit frequencies, normalised by sample size: χ²/N. M01 applies multi-digit distributional testing: both the leading digit (Benford's first-digit law, digits 1-9) and the second leading digit (Benford's second-digit law, digits 0-9) are evaluated independently and combined into the headline score. Normalisation makes the statistic sample-size invariant; raw chi-squared scales linearly with N and would otherwise rank exchanges by trading volume rather than distribution shape. |
Score mapping
Piecewise linear in χ²/N: lower per-trade divergence maps to a higher score. Anchors:
| χ²/N | Score |
|---|---|
| 0.00 | 100 |
| 0.05 | 80 |
| 0.15 | 50 |
| 0.35 | 20 |
| 0.65 | 0 |
Cost to replicate
Matching this distribution requires generating synthetic trades with properly power-distributed sizes, a non-trivial engineering requirement. Simple automation with uniform or round sizes produces detectable leading-digit patterns.
M02 Volume-to-order-book-depth ratio#
Question
Does the reported 24-hour volume plausibly pass through the observable order book?
Rationale
An exchange reporting $1B daily BTC/USDT volume with a $50k order book has an implied book turnover rate that is physically unusual. Reference venues show a ratio of 24h volume to mean bid+ask depth within ±2% of mid that sits in a consistent empirical range. Values outside this range fall into two distinct regimes with different operational meanings. High ratios suggest reported volume may exceed what the visible order book could plausibly absorb, a potential integrity signal. Low ratios indicate deep liquidity relative to reported flow, a structural pattern common in market-making-dominant venues, not a quality failure in itself. Current v1 scoring penalises both regimes; a forthcoming methodology revision will reflect the asymmetry directly in the score.
Inputs
| Volume | 24h USD notional, computed from own trade data; exchange-reported ticker is not used |
|---|---|
| Depth | Time-averaged ±2% book depth across 1,440 one-minute snapshots |
| Coverage threshold | The metric is not scored for that day below 50% snapshot coverage |
Score mapping
Non-monotonic in the raw ratio R: both unusually low and unusually high R produce lower scores in v1, with the typical zone in the middle of the empirical peer-basket range. The two penalised regimes have distinct interpretations (see Rationale): high-R is the volume-integrity concern, low-R is a structural deep-book pattern. The score itself is direction-symmetric in v1; commissioned reports interpret direction narratively. M10 has the same non-monotonic shape and the same caveat.
Cost to replicate
The two failure modes have asymmetric replication costs. Reaching the typical mid-range from the high-R side requires reducing reported volume to what the book can actually absorb, the more difficult adjustment for venues whose volume figures are inflated, since real volume is hard to manufacture and reported volume cannot be cut without admitting prior overstatement. Reaching it from the low-R side is operationally simpler (reduce displayed depth, or attract more flow) but commercially undesirable for venues whose model is market-making-dominant deep-book provision. Deep books themselves carry a direct cost: market-maker incentives, inventory risk, and capital that could otherwise earn yield.
M03 Trade interval entropy#
Question
Are trades arriving at times consistent with Poisson-like arrival processes?
Rationale
Organic trading produces trade arrival times that follow approximately a Poisson or Hawkes process: bursty, self-exciting, with heavy-tailed inter-arrival times. Bot-driven flow often produces either suspiciously regular intervals (exact 1-second spacings, periodic patterns) or unnaturally uniform distributions. M03 evaluates two complementary timing signals: the Shannon entropy of the inter-arrival distribution (spread across time scales) and the Pearson autocorrelation at lag-1 of the inter-arrival sequence (whether consecutive intervals are similar to each other). Entropy is also evaluated separately at multiple time-scale resolutions (intra-burst, tick-level, and order-level) so that timing structure visible only within a narrow scale (a 250ms cadence buried inside an otherwise diverse stream, for example) is not averaged away by the full-scale view. Together these signals catch both single-scale clustering and consecutive-interval regularity.
Inputs
| Data source | Trade timestamps for the target pair, 24h window |
|---|---|
| Bucketing | Logarithmic: 0-100ms, 100ms-1s, 1s-10s, 10s-100s, 100s+ (full-scale entropy signal); per-scale sub-bucketing within 0-100ms, 0-1s, and 0-10s ranges (multi-resolution entropy) |
| Minimum sample | 5,000 trades; below this the metric is not scored for that day |
Score mapping
Monotonic in Shannon entropy of the inter-arrival distribution: higher entropy maps to a higher score. Maximum possible entropy is log2(5) ≈ 2.32 bits (trades distributed perfectly uniformly across the five buckets); minimum is 0 (every trade in a single bucket, characteristic of algorithmic regularity or single-frequency wash patterns). Anchor points: entropy of 1.0 maps to score 50, entropy of 1.5 maps to 80, entropy near the maximum maps to 100. The same anchor table is applied to the per-scale sub-entropies and to the full-scale entropy; the four entropy scores are blended into a single entropy signal. The lag-1 autocorrelation of the inter-arrival sequence is mapped on a separate scale: autocorrelation near zero or negative scores high (organic irregular flow), high positive autocorrelation scores low (regular bot spacing). The blended entropy signal and the autocorrelation sub-score are combined into the headline score.
Cost to replicate
Matching a Poisson-like arrival distribution requires sophisticated bot design that injects timing variability deliberately. Simpler automation tends to produce detectable regularity at the millisecond or second scale.
M04 Effective spread#
Question
What does it actually cost to trade?
Rationale
Quoted spread (best ask − best bid) can be very narrow while the venue is quiet. Effective spread, the realised cost of market orders relative to mid-price, measures actual execution cost against actual trades.
Inputs
| Mid-price reference | 1-minute order book snapshots |
|---|---|
| Trade direction | Buy/sell from feed where provided; Lee-Ready inferred otherwise |
| Aggregation | Volume-weighted average over 24h, in basis points |
v2 roadmap: a 7-day rolling fallback for days with sparse order-book snapshot coverage. v1 uses the 24h window unconditionally and gates to insufficient_data when fewer than 50% of expected 1-minute snapshots are available; trade count itself is reported but does not gate the metric, so quiet trade days with adequate snapshot coverage still produce a score.
Score mapping
Monotonic in effective spread: lower bps maps to a higher score. BTC/USDT benchmarks: under 5 bps is excellent, 5-15 bps typical, above 50 bps atypical. Exact thresholds are anchored to the live peer distribution and recalibrated periodically.
Cost to replicate
Effective spread reflects actual execution cost. A low effective spread without lowering fees or subsidising market makers is not easily replicated; both substitutes carry direct costs.
M05 Order book slope (Kyle's λ proxy)#
Question
How much does the book absorb? What is the price impact of size?
Rationale
A deep book has a gradual slope: a $1M sell order moves price incrementally. A book with thin layers beyond the inside quote shows a steep slope at size. The relationship between executed size and price impact is a proxy for Kyle's lambda, computable from public order book data alone.
Inputs
| Snapshots | 1-minute frequency, full depth within ±5% of mid |
|---|---|
| Standard sizes | $10k, $100k, $1M notional on each side |
| Outlier treatment | Sizes winsorised at 99.9th percentile |
Score mapping
Monotonic in λ expressed as slippage-bps for a standard $100k trade: lower λ maps to a higher score. Typical range 10-50 bps; values above 200 bps are atypical.
Cost to replicate
Matching this requires maintaining real depth across multiple price levels. The capital cost scales with the square of the depth advertised.
M06 Quote stability and update rate#
Question
Are quotes being maintained, or is the book updating at rates disproportionate to actual trading?
Rationale
Quote update rates disproportionate to trade rates can indicate rapid cancel-and-replace cycles that make displayed depth difficult to hit in practice. High update-to-trade ratios are associated with low effective fill rates in several academic studies of exchange microstructure.
Inputs
| Order book stream | WebSocket add/cancel/modify events from the continuous WS consumer |
|---|---|
| Quote-update ratio | book_events / trades over the 24h window |
| Coverage threshold | If WebSocket sequence gaps cover more than 50% of the day, the metric is not scored for that day |
Score mapping
Non-monotonic in the quote-update ratio: both very low and very high values map to a lower score. A near-zero ratio indicates a sleepy book with insufficient quoting activity; a very high ratio (thousands of book events per trade) indicates rapid cancel-and-replace cycles characteristic of quote-stuffing patterns. Healthy market-making sits in a moderate plateau (roughly 50-1000 events per trade) which maps to the highest scores.
Cost to replicate
Sustained moderate quote activity requires real market-maker capital and inventory tolerance. Neither extreme (too quiet, or churn-without-substance) is easily produced as a substitute.
Currently pending multi-venue WebSocket book event collection. Until at least three venues stream the input, M06 reports as not scored for the day rather than against an under-sampled peer distribution. Will populate automatically once sufficient peer data exists.
M07 Cross-venue price deviation#
Question
Does this venue's price track the global market?
Rationale
A venue's mid-price for BTC/USDT should track the arithmetic mean of the peer basket's mid-prices within a tight tolerance, given arbitrage incentives. Persistent deviations may reflect lower arbitrage activity, stale feeds, or isolated price formation on that venue.
Inputs
| Sampling | Mid-price every 60 seconds, target exchange and reference basket |
|---|---|
| Reference basket | Binance, Coinbase, Kraken, OKX (arithmetic mean), excluding target if a basket member |
| Cross-rate | USD/USDT translated via Kraken USD VWAP / Binance USDT VWAP at each timestamp |
Score mapping
Monotonic in mean absolute deviation from the reference basket: lower MAD maps to a higher score. Typical: MAD under 5 bps and tail under 30 bps. Atypical: MAD above 50 bps, or persistent autocorrelation above 0.5.
Cost to replicate
Keeping prices aligned with the global market requires arbitrage linkage. Without underlying liquidity to support two-way arbitrage, alignment is difficult to maintain.
v2 roadmap: replace the arithmetic mean with a volume-weighted reference once aggregated multi-venue volume data is sourced consistently across the peer basket.
M08 Mid-price reversion dynamics#
Question
After large trades, does price revert in a way consistent with real market impact?
Rationale
In real markets, informed trades produce permanent impact and uninformed trades produce temporary impact that reverts. If trades at a venue show near-complete reversion within seconds regardless of size, this is inconsistent with the mix of informed and uninformed flow observed at reference venues.
Inputs
| Trade filter | Prints above approximately $50k notional |
|---|---|
| Mid-price series | 1-minute resolution, around each large-trade event |
| Reversion measure | Fraction of large trades whose 60-second reversion ≥ 0.8 (substantially reverted to pre-trade mid) |
| Minimum sample | 50 large trades per day; else 7-day rolling fallback |
Score mapping
Monotonic-decreasing in the substantially-reverted fraction: lower fraction maps to a higher score. Anchors: 20% reversion maps to 90, 30% maps to 75, 50% maps to 30, 70% maps to 10, and 100% reversion (every large trade fully retraced) maps to 0; at that level the venue's flow carries no real impact. Reference venues typically sit in the 20-30% band (scoring in the 75-90 range); above 70% is atypical and inconsistent with the mix of informed flow observed at reference venues.
Cost to replicate
Matching this pattern requires trades that carry genuine price impact, which means actually moving the book and taking position risk.
M10 Volume share vs liquidity share#
Question
Does this exchange's share of global volume match its share of global liquidity?
Rationale
A venue's share of global volume and its share of global depth should be of similar order. If a venue shows 15% of global BTC/USDT volume but 2% of global order book depth at ±2%, the ratio is unusual relative to peers.
Inputs
| Volume | 24h notional, computed from own trade data; exchange ticker is not used |
|---|---|
| Reference volume | Aggregate 24h across the peer basket, excluding target if a basket member |
| Depth | Time-averaged ±2% depth for target and peers |
Score mapping
Non-monotonic in R = vol_share / depth_share: both unusually low and unusually high R are atypical. Typical: R in [0.7, 1.5]. Atypical: outside [0.3, 5] in either direction. The other non-monotonic metric in the spec (alongside M02).
Cost to replicate
This is the measurement hardest to match without the underlying fundamentals. Matching both ends requires real volume and real depth simultaneously.
M11 Price leadership and contribution to price discovery#
Question
Does this exchange lead global price discovery or lag it?
Rationale
Venues with informed flow tend to lead price moves: discovery happens there first and peers follow shortly after. The v1 implementation correlates the venue's per-minute returns against the peer basket's per-minute returns at seven forward lead times (1, 2, 5, 10, 15, 30, 60 minutes), i.e. today's target return versus the peer return k minutes later. The metric records the maximum correlation across those seven leads, plus the lag at which the maximum occurred. A high maximum correlation means the venue's moves predict subsequent peer moves; a low maximum means the venue moves independently of, or lags, the basket.
Inputs
| Mid-price series | 1-minute resolution, target + peer basket (excluding target if a member), 24h window |
|---|---|
| Lead lags tested | 1, 2, 5, 10, 15, 30, 60 minutes (forward only) |
| Aggregate | Maximum Pearson correlation across the seven leads; the winning lag is surfaced as the metric's diagnostic raw value |
Score mapping
Monotonic in the maximum forward-lead correlation: higher correlation maps to a higher score. Anchor points (provisional, calibrated against the empirical distribution observed at v1 launch): correlation of 0.02 maps to score 50, 0.04 to 80, 0.08 or higher to 100. Anchors are subject to revision after 30 days as the cohort distribution sharpens.
Cost to replicate
Matching this requires attracting informed traders, which compounds over time and is difficult to short-circuit.
v2 roadmap: replace the forward-lead correlation proxy with a Hasbrouck information-share computation. The v1 max-lead-correlation is a directionally-aligned proxy for the same construct (which venue contributes most to the cointegrating price level), but the Hasbrouck decomposition gives a quantitative attribution across the basket rather than a single-number per-venue readout.
M11 in v1 uses a lead-lag correlation proxy. This is an approximation of price leadership, not a full information share decomposition (e.g. Hasbrouck IS). A more rigorous implementation is planned for v2.
Composite scoring
| Code | Dimension | Metrics | Intent weight |
|---|---|---|---|
| D1 | Volume authenticity | M01, M02, M03 | 30% |
| D2 | Order book quality | M04, M05, M06 | 25% |
| D3 | Price formation integrity | M07, M08 | 25% |
| D4 | Cross-venue consistency | M10, M11 | 20% |
Step 1: Metric normalisation
Each raw metric value is converted to a 0-100 score using piecewise linear mappings defined in a versioned mapping file. Higher score means higher quality. Every score row is stamped with the mapping_version active at computation time, so historical scores remain interpretable after methodology updates.
Step 2: Dimension scores
The dimension score is the arithmetic mean across that dimension's metrics that returned status='ok'.
Step 3: Composite score
The composite is the unweighted geometric mean across the four dimensions: (D1·D2·D3·D4)1/4. A composite is only published when every dimension has at least one scoring metric. The geometric mean is used rather than a weighted arithmetic average so that a single weak dimension cannot be compensated for by strong scores in the others. A zero in any dimension collapses the composite to zero.
Step 4: Band assignment
| 1 | ≥ 85 |
|---|---|
| 2 | 75 - 85 |
| 3 | 60 - 75 |
| 4 | 45 - 60 |
| 5 | < 45 |
What stays proprietary
Tiered disclosure: M01 and M08 anchors are published; anchors for other metrics are disclosed in commissioned reports. Most calibration parameters are not published in full on the public methodology page. The published exceptions are listed first.
- M01 score-mapping anchors are public. The five-anchor piecewise-linear curve from
χ²/Nto score is set out in the M01 entry above. Published as a worked example of the calibration shape and as a transparency commitment for the metric most often cited in commission discussions. - M08 score-mapping anchors are public. The five-anchor curve from substantially-reverted fraction to score is set out in the M08 entry above (20% → 90, 30% → 75, 50% → 30, 70% → 10, 100% → 0). Published alongside M01 as the second worked-example calibration.
The remainder of the calibration set is disclosed in commissioned reports rather than on this page:
- Score-mapping anchors for M02-M07, M10, and M11. The score ranges shown qualitatively in each metric entry are illustrative; production thresholds are calibrated from observed distributions across the peer basket.
- Peer baseline calibration parameters for M03, M06, M08, and M11.
- Per-metric outlier rules beyond the published 99.9% winsorisation.
Threshold calibration uses observed distributions across a reference peer basket and is recalibrated periodically. Full calibration parameters are available to data licence customers; the baseline_version and mapping_version stamped on each score row allows any customer to reproduce any historical score exactly.
Data sources
All 10 covered exchanges are ingested via their public REST and WebSocket APIs. No authentication, no paid feeds, no exchange cooperation is required at any stage.
USD-quoted venues (Coinbase, Kraken) are handled in venue-native pairs for single-venue metrics. For cross-venue comparison, USD prices are translated via the Kraken USD VWAP / Binance USDT VWAP basis computed at each timestamp.
Limitations
Assay scores are narrow by design. They describe market data integrity, not broader aspects of an exchange's operation. The following are not measured and should not be inferred:
- Custody security or solvency
- Regulatory standing or licensing status
- User experience, customer support quality, or dispute handling
- Fiat on-ramp / off-ramp quality
- Derivatives market quality (v1 is spot-only)
- Token-level listing quality for pairs other than BTC and ETH
A high Assay score does not imply safety. A low Assay score reflects market data characteristics outside the typical range; it does not imply intent.
Several metrics (M04 effective spread, M08 large-trade reversion, and M11 price leadership) do not currently scale score uncertainty by sample size. A thinly-traded venue's headline score for these metrics carries wider intrinsic variance than a high-volume venue's score at the same number, even though both are reported on the same 0-100 scale. This is acknowledged as a known v1 limitation; sample-size-aware uncertainty bands are tracked on the v2 backlog alongside the next round of mapping recalibration.
Version history
| Version | Notes |
|---|---|
| v1.0.0 | Launch baseline. Composite scoring switched from a weighted linear average to an unweighted geometric mean across the four dimensions. Public copy reframed as market microstructure integrity scoring with the regime-sensitivity caveat. |
| v0.3.0 | M09 (funding rate / basis sanity) removed entirely. v1 is spot-only and M09 required a derivatives data pipeline that is out of scope; reintroduction tracked as v2 work. |
| v0.1.0 | Initial release. 10 exchanges, 11 metrics, 2 pairs. |