Current work:
UK AISI inspect evals: integrating CyberSecEval_2 benchmark with sandboxed vulnerability detection+exploitation
These are not the droids you're testing for: Benchmark evasion in mainstream evals
Earlier research:
Tacit cyber risk detection circuits in Mistral7B
LLM watermarks via text steganography
Opportunities & upcoming projects (contact me):
Backdoor hunting in open source models ITW
LoRA-based ablation techniques for hardening system prompts