Research

My papers, and the open-source evals that come with them. Mostly AI safety.

Not All Instructions Are Forgotten Equal

JAIIO 2026 (ASAID symposium). Accepted, camera-ready in preparation.

I wanted to know why LLMs follow some instructions for a whole session and quietly drop others a few turns later. So I ran a small study and fit a Bayesian hierarchical ordered-logit model, mostly to keep the uncertainty honest. The tentative read: how well an instruction holds up depends a lot on which instruction it is, and at least one kind seems to get worse the more the model is reinforced on it. The intervals are wide and it's a single study, so I'd treat it as a direction to chase, not a settled result.

The benchmark behind it is open source, built on Inspect AI (UK AISI's eval framework): deterministic checkers plus a panel of LLM judges from different model families. Code's almost ready to release.

Paper (PDF) · ORCID