Current work:

  • UK AISI inspect evals: integrating CyberSecEval_2 benchmark with sandboxed vulnerability detection+exploitation

  • These are not the droids you're testing for: Benchmark evasion in mainstream evals

Earlier research:

  • Tacit cyber risk detection circuits in Mistral7B

  • LLM watermarks via text steganography

Opportunities & upcoming projects (contact me):

  • Backdoor hunting in open source models ITW

  • LoRA-based ablation techniques for hardening system prompts