Track A/B experiments comparing evaluation conditions: context formats, reasoning modes, and eval types.