No Out-of-Sample Edge: An Honest Backtest of a Technical BTC Strategy


This is a negative result, and a cautionary tale about trusting negative results. We set out to answer a narrow, falsifiable question: does a purely technical trading strategy — indicators, no fundamentals, no order-flow — have any out-of-sample edge on BTC/USDT after realistic costs? The answer was no. But the path to that answer ran straight through a silent data bug that nearly produced the same conclusion for an entirely different, wrong reason. The bug is the real lesson.

The question and the bet

A backtest that optimizes parameters and then reports the optimized return is measuring its own ability to fit noise, not edge. The only honest test is walk-forward: optimize on a training window, then measure performance on a later window the optimizer never saw, and roll that forward. Our hypothesis, stated as a bet we were willing to lose: a technical-only strategy, evaluated walk-forward with real fees, would land at a profit factor at or below 1.0 — no durable edge.

To make the test unforgiving we baked in the things that flatter a strategy when omitted: no look-ahead (signals computed only from closed candles), real 0.25% round-trip Binance costs, and long-only exposure. The parameter search was a genetic algorithm; the scoring was walk-forward profit factor on the held-out test year, not in-sample return.

First result — and why we didn’t trust it

The first run agreed with the hypothesis. Across every generation, walk-forward profit factor came in at 0.65–0.90 — consistently losing money after costs — over 58–107 out-of-sample trades, with zero parameter sets ever promoted. Clean confirmation.

Too clean. When a result lands exactly where you predicted, that is precisely when you should audit the pipeline, not celebrate. So we traced what data the backtester had actually loaded — and found it had quietly evaluated the strategy on 2024 only. Everything from January 2025 onward was missing.

The silent data bug

Binance’s public market-data dumps changed format at the start of 2025: candle timestamps went from 13-digit milliseconds to 16-digit microseconds. The loader parsed every timestamp as milliseconds. On the new files that produced nonsense far-future dates, the row threw — and the parse sat inside a bare catch that swallowed the exception and moved on. No error, no warning, no count. Every 2025–2026 candle was silently dropped, and the backtest happily ran on the surviving 2024 data.

The fix is small but the kind of fix matters:

// Normalize ms vs µs: 16-digit microsecond timestamps -> milliseconds
static long ToUnixMs(long raw) => raw >= 1_000_000_000_000_000L ? raw / 1000L : raw;

…applied to the open-time and close-time fields. And critically, the silent catch became a counted, logged skip — if rows are dropped now, the run tells you how many and why. A swallowed exception is not error handling; it is a landmine that converts data loss into a confident wrong answer.

Second result — the honest one

With the loader fixed, the test set re-ran on 189,722 candles spanning January 2021 to May 2026, with the walk-forward test year moved to the most recent data: May 2025 → May 2026. The verdict held, but now for real:

  • Every generation’s walk-forward profit factor: 0.43–0.85 — all failing.
  • The search early-stopped at generation 7, unable to beat its own seed.
  • The final “best” individual defaulted back to the seed parameters: profit factor 0.42, Sharpe −2.99, win rate 24% out-of-sample.

The regime change from 2024 to 2025–26 did not rescue the strategy. A technical- only approach has no out-of-sample edge on fresh data either. The in-sample fitness was fitting noise the whole time; walk-forward simply refused to pay for it.

Why this matters more than the trade

It would have been easy to publish the first result. It matched the hypothesis, the number was plausible, and “technical-only BTC strategies don’t survive costs” is a defensible claim. We would have been right by accident, on one year of data, with a broken loader — and we would never have known.

Two takeaways we now treat as rules:

  1. A negative result is only as trustworthy as the pipeline that produced it. Confirmation is the most dangerous moment to stop checking. The audit that feels redundant is the one that catches the silent bug.
  2. Never let a catch swallow data silently. If a row, a file, or a record can be dropped, the drop must be counted and logged. Silent failure in a data loader doesn’t crash — it lies.

The strategy is shelved; live trading stays gated off. But the backtester is now honest, the loader fails loudly, and the next hypothesis — whether a non-technical signal layer adds anything — gets measured on real data from the start. That is the only kind of “no” worth publishing.