AI safety research agenda

Overview

As the industry increasingly builds and trusts LLM-based systems to become part of all technologies, AI safety and security is a clear priority for governments, the Tech sector, and for me. While leading cybersecurity teams for 7 years, I’ve also focused on ML since 2018, from detection algorithms to safety prototypes for LLMs. The urgency of this intersection of AI and security led me to conduct more formal projects in 2024:

UK AISI submissions

Primarily as a member of Arcadia Impact’s AI Safety Engineering Taskforce, I submitted several contributions

Current: Cyber capability elicitation (TPM)

  • Revisiting Google DeepMind’s Intercode_CTF evaluation for higher elicitation of capabilities on the Princeton CTF dataset using Reflexion and Learning from Mistakes

CyberSecEval_2 (engineer)

  • Meta’s original benchmark brought evaluations of LLM susceptibility to prompt injection, interpreter abuse, and vulnerability identification/exploitation.

  • I integrated the system and dataset into the UK AISI inspect_evals framework to make it nearly effortless to assess new models/reproduce results.

  • As part of this integration, I added interpreter sandboxing and JSON improvements to the harness for safety and robustness.

  • Identified challenges CyberSecurity evals face with under-elicitation of models and underrated responses due to data formatting

Self-jailbreak methods (submission author)

  • Non-expert, LLM-aided LLM jailbreaking evaluation (in the line of IRIS/AutoDAN)

  • My eval proposal and prototype include three tasks in which a model attacks LLM safety measures, applied to itself.

  • In each task, the Attack Success Rate shows the model’s ability to bypass its own default refusals on “safety challenges” if asked for jailbreak techniques (python string transformations, decompositions of the prompt, and reflection applied to multi-turn conversation).

Sandbagging/Benchmark evasion in mainstream evals (submission author)

  • My eval proposal (also submitted to OpenAI) compares the attack success rates of existing safety evals under traditional evaluation scaffolding vs production/human-like request patterns (i.e., preventing the model from detecting that it is being evaluated)

  • Rooted in evasion techniques for bot detection (from my work at Akamai)

  • I plan to follow this research with contributions to the inspec_ai framework to provide mitigations against sandbagging

Earlier research:

Opportunities & upcoming projects (contact me):

  • Trust Pressure Ratio: labeling commensurate safety testing for AI-powered applications

  • LoRA and other ablation based model architectures for hardened system prompts/instruction hierarchy

  • Backdoor hunting in open source models ITW