Small benchmark exploring reasoning invariance across prompt variants

I built a small experimental benchmark to explore a question about LLM reasoning stability.

Do reasoning trajectories remain stable when logically equivalent problems are phrased differently?

The benchmark currently includes a small suite of probes testing things like:

• instruction hierarchy effects
• constraint ordering
• missing-information traps
• logical paradox stability
• goal hijacking

Repository:

https://github.com/Ziechoes/reasoning-invariance-benchmark

The goal is to detect cases where reasoning paths or final answers diverge even though the logical structure of the problem remains identical.

Curious whether anyone has run similar invariance tests or has suggestions for strengthening the benchmark.