AI safety research agenda

Overview

As the industry increasingly builds and trusts LLM-based systems to become part of all technologies, AI safety and security is a clear priority for governments, the Tech sector, and for me. While leading cybersecurity teams since 2019, I’ve also focused on ML, from detection algorithms to safety prototypes for LLMs. The urgency of this intersection of AI and security led me to contribute to fantastic teams & projects, and to learn and grow in the process.

Primarily as a member of Arcadia Impact’s AI Safety Engineering Taskforce, I submitted several contributions
Released: Cyber capability elicitation (TPM)
- Guided the development of an agent with support for Reflection and Learning from Mistakes
- Evaluated the increase of an LLM’s performance on the Intercode_CTF (compared to Google DeepMind’s evaluation)
Released: CyberSecEval_2 (Research engineer)
- Meta’s original benchmark brought evaluations of LLM susceptibility to prompt injection, interpreter abuse, and vulnerability identification/exploitation.
- I integrated the system and dataset into the UK AISI inspect_evals framework to make it nearly effortless to assess new models/reproduce results.
- As part of this integration, I added interpreter sandboxing and JSON improvements to the harness for safety and robustness.
- Identified challenges CyberSecurity evals face with under-elicitation of models and underrated responses due to data formatting
Under review: Self-jailbreak methods (submission author)
- LLM-aided LLM jailbreaking evaluation (in the line of IRIS/AutoDAN)
- My eval proposal and prototype include three distinct “tasks” in which a model attacks LLM safety measures, applied to itself.
- In each task, the Attack Success Rate shows the model’s ability to bypass its own default refusals on “safety challenges” if asked for jailbreak techniques (python string transformations, decompositions of the prompt, and reflection applied to multi-turn conversation).
Under review: Sandbagging/Benchmark evasion in mainstream evals (submission author)
- My eval proposal (also submitted to OpenAI) compares the attack success rates of existing safety evals under traditional evaluation scaffolding vs production/human-like request patterns (i.e., preventing the model from detecting that it is being evaluated)
- Rooted in evasion techniques for bot detection (from my work at Akamai)
- I plan to follow this research with contributions to the inspec_ai framework to provide mitigations against sandbagging
- Easy input guardrails with semantic embeddings on Cloudflare: llm-proxy.com (Oct 2023)
- Tacit cyber risk detection in LLMs (mechanistic interpretability research project as part of BlueDot’s AI safety fundamentals course, also on LessWrong - June 2024)
- Making LLM outputs detectable using watermarks based on text steganography (Aug 2024)
- Scoping LLM applications / Hardening instruction hierarchy: state of the art, gaps, and alternatives (Apr 9, 2025)
- OWASP GenAI: Key contributor to the guide “Securing Agentic Applications” (RSAC 2025)
  - Including Memory-safe agentic data flows (separation of LLM control flow from data plane, April 2025)
- Hybrid Organizations & Trust: Risk Analyses of Agentic Human-AI Work Hierarchies in the Global Economic Transition (in progress since April 2025)
- Boston AI safety group: I host local meetups and build educational content
  - Next event: 6/26 @ Boston Public Library
- Getting Started in AI Safety: Key readings (Feb 2025)
- Trust Pressure Ratio: a nutrition label focused on commensurate safety testing for AI-powered applications
- Defend Your Castle: prompt engineering for the blue team, learning limitations and mitigations
- Hype or reality: benefits of LLMs amid broader AI risk-benefit tradeoffs.
- Guiding the safety of AI products at Panorama Education, including Solara and ClassCompanion

AI safety research agenda

Overview

UK AISI benchmarks

Research / prototypes

Governance & Awareness