1) New robustness dimensions in model evaluation
What changed: Public benchmarks added stability checks for chain-of-reasoning outputs.
Why it matters: It reduces over-reliance on single-pass accuracy metrics.
Source: Replace with final source links before production publication.