AI safety research agenda

Overview

As the industry increasingly builds and trusts LLM-based systems to become part of all technologies, AI safety and security is a clear priority for governments, the Tech sector, and for me. While leading cybersecurity teams for 7 years, I’ve also focused on ML since 2018, from detection algorithms to safety prototypes for LLMs. The urgency of this intersection of AI and security led me to contribute to fantastic teams & projects, and to learn and grow in the process.

  • Primarily as a member of Arcadia Impact’s AI Safety Engineering Taskforce, I submitted several contributions

    Released: Cyber capability elicitation (TPM)

    • Guided the development of an agent with support for Reflection and Learning from Mistakes

    • Evaluated the increase of an LLM’s performance on the Intercode_CTF (compared to Google DeepMind’s evaluation)

    Released: CyberSecEval_2 (Research engineer)

    • Meta’s original benchmark brought evaluations of LLM susceptibility to prompt injection, interpreter abuse, and vulnerability identification/exploitation.

    • I integrated the system and dataset into the UK AISI inspect_evals framework to make it nearly effortless to assess new models/reproduce results.

    • As part of this integration, I added interpreter sandboxing and JSON improvements to the harness for safety and robustness.

    • Identified challenges CyberSecurity evals face with under-elicitation of models and underrated responses due to data formatting

    Under review: Self-jailbreak methods (submission author)

    • LLM-aided LLM jailbreaking evaluation (in the line of IRIS/AutoDAN)

    • My eval proposal and prototype include three distinct “tasks” in which a model attacks LLM safety measures, applied to itself.

    • In each task, the Attack Success Rate shows the model’s ability to bypass its own default refusals on “safety challenges” if asked for jailbreak techniques (python string transformations, decompositions of the prompt, and reflection applied to multi-turn conversation).

    Under review: Sandbagging/Benchmark evasion in mainstream evals (submission author)

    • My eval proposal (also submitted to OpenAI) compares the attack success rates of existing safety evals under traditional evaluation scaffolding vs production/human-like request patterns (i.e., preventing the model from detecting that it is being evaluated)

    • Rooted in evasion techniques for bot detection (from my work at Akamai)

    • I plan to follow this research with contributions to the inspec_ai framework to provide mitigations against sandbagging