AI safety research agenda
Overview
As the industry increasingly builds and trusts LLM-based systems to become part of all technologies, AI safety and security is a clear priority for governments, the Tech sector, and for me. While leading cybersecurity teams for 7 years, I’ve also focused on ML since 2018, from detection algorithms to safety prototypes for LLMs. The urgency of this intersection of AI and security led me to contribute to fantastic teams & projects, and to learn and grow in the process.
-
Primarily as a member of Arcadia Impact’s AI Safety Engineering Taskforce, I submitted several contributions
Released: Cyber capability elicitation (TPM)
Guided the development of an agent with support for Reflection and Learning from Mistakes
Evaluated the increase of an LLM’s performance on the Intercode_CTF (compared to Google DeepMind’s evaluation)
Released: CyberSecEval_2 (Research engineer)
Meta’s original benchmark brought evaluations of LLM susceptibility to prompt injection, interpreter abuse, and vulnerability identification/exploitation.
I integrated the system and dataset into the UK AISI inspect_evals framework to make it nearly effortless to assess new models/reproduce results.
As part of this integration, I added interpreter sandboxing and JSON improvements to the harness for safety and robustness.
Identified challenges CyberSecurity evals face with under-elicitation of models and underrated responses due to data formatting
Under review: Self-jailbreak methods (submission author)
LLM-aided LLM jailbreaking evaluation (in the line of IRIS/AutoDAN)
My eval proposal and prototype include three distinct “tasks” in which a model attacks LLM safety measures, applied to itself.
In each task, the Attack Success Rate shows the model’s ability to bypass its own default refusals on “safety challenges” if asked for jailbreak techniques (python string transformations, decompositions of the prompt, and reflection applied to multi-turn conversation).
Under review: Sandbagging/Benchmark evasion in mainstream evals (submission author)
My eval proposal (also submitted to OpenAI) compares the attack success rates of existing safety evals under traditional evaluation scaffolding vs production/human-like request patterns (i.e., preventing the model from detecting that it is being evaluated)
Rooted in evasion techniques for bot detection (from my work at Akamai)
I plan to follow this research with contributions to the inspec_ai framework to provide mitigations against sandbagging
-
Easy input guardrails with semantic embeddings on Cloudflare: llm-proxy.com (Oct 2023)
Tacit cyber risk detection in LLMs (mechanistic interpretability research project as part of BlueDot’s AI safety fundamentals course, also on LessWrong - June 2024)
Making LLM outputs detectable using watermarks based on text steganography (Aug 2024)
Scoping LLM applications / Hardening instruction hierarchy: state of the art, gaps, and alternatives (Apr 9, 2025)
OWASP: Securing Agentic Applications working group (2025)
Hybrid Organizations & Trust: Risk Analyses of Agentic Human-AI Work Hierarchies in the Global Economic Transition (in progress, April 2025)
-
Boston AI safety group: I host local meetups and build educational content
Next event: 4/27 @ Boston Public Library with the AI Safety Awareness Foundation)
Getting Started in AI Safety: Key readings (Feb 2025)
Trust Pressure Ratio: labeling commensurate safety testing for AI-powered applications
Defend Your Castle: prompt engineering for the blue team, learning limitations and mitigations
Guiding safety work for my current organization, Panorama Education