Senior Research Scientist – Science of Evaluation

London, UK

About the AI Security Institute

The AI Security Institute is the largest team in a government dedicated to understanding AI capabilities and risks in the world. 

Our mission is to equip governments with an empirical understanding of the safety of advanced AI systems. We conduct research to understand the capabilities and impacts of advanced AI and develop and test risk mitigations. We focus on risks with security implications, including the potential of AI to assist with the development of chemical and biological weapons, how it can be used to carry out cyber-attacks, enable crimes such as fraud, and the possibility of loss of control. 

The risks from AI are not sci-fi, they are urgent. By combining the agility of a tech start-up with the expertise and mission-driven focus of government, we’re building a unique and innovative organisation to prevent AI’s harms from impeding its potential. 

This role sits outside of the DDaT pay framework given the scope of this role requires in depth technical expertise in frontier AI safety, robustness and advanced AI architectures. 

The deadline for applying this role is Sunday 5th August 2025, end of day, anywhere on Earth.

About the Team 

AISI’s Science of Evaluation team develops and applies rigorous techniques for measuring and forecasting AI system capabilities, ensuring evaluation results are robust, meaningful, and useful for governance decisions. 

Evaluations underpin both our scientific understanding and policy decisions around frontier AI systems. However, current evaluation designs, methodologies, and statistical techniques are poorly suited to extracting the insights we care about, such as underlying capabilities, dangerous failure modes, forecasts of future capabilities, and the robustness of model performance across varied settings. Our team addresses this by acting as an internal auditor, stress-testing the claims and methods in AISI’s testing reports, developing new tools for evaluation analysis, and advancing methodologies that help anticipate dangerous capabilities before they emerge. 

Our approach involves:  

(1) Methodological red teaming: we independently stress-test the evidence and claims made in AISI’s evaluations reports which are shared with model developers;  

(2) Consulting partnerships: collaborating with the evaluations teams in AISI to improve methodologies and best practices;  

(3) Targeted research bets: pursuing foundational work that enables new types of insights into model capabilities. 

Our research is problem-driven, methodologically grounded, and focused on impact. We aim to improve epistemic rigour and increase confidence in the claims drawn from evaluation data and translate those conclusions into actionable insights for model developers and policymakers. 

Role Summary 

This is a senior research scientist position focused on developing and applying evaluation methodologies to frontier AI systems. We’re also excited to hear from earlier-career researchers with 2–3 years of hands-on experience with LLMs, especially those who’ve shown creative or rigorous empirical instincts. 

 

As model capabilities scale rapidly, evaluations are becoming a critical bottleneck for safe deployment. This role offers the opportunity to shape how capabilities are measured and understood across the frontier AI ecosystem. This role is for people who can identify flaws or hidden assumptions in evaluations and experimental setups. We care more about how you think about evidence than how many models you've fine-tuned. 

You’ll shape and conduct research on how to better extract signal from evaluation data, going beyond benchmark scores to uncover underlying model capabilities, safety-relevant behaviours, and emerging risks. You’ll work closely with engineers and domain experts across AISI, as well as external research collaborators. Researchers on this team have substantial freedom to shape independent research agendas, lead collaborations, and initiate projects that push the frontier of what evaluations can reveal. 

Example Projects 

  • Conduct adversarial quality assurance of frontier AI evaluation reports, including targeted analyses to uncover potential issues, blind spots, or hidden/unexplored assumptions. 
  • Support the design of evaluation suites that improve coverage, predictive validity, and robustness. 
  • Contribute to protocols and internal best-practices that help other teams produce better, more actionable evaluation results. 
  • Build tools for quantitatively analysing agent evaluation transcripts, for failure modes or proxy signals of capabilities. 
  • Develop new methodologies for understanding capability emergence, e.g., milestones or partial progress analysis on complex agent-based evaluations, intervention-based probing of agent behaviours, and predictive models of agent performance based on task and model characteristics. 

Responsibilities 

  • Lead and conduct applied research into evaluation methodology, including the design of new techniques and tools. 
  • Analyse evaluation results in depth to stress-test claims, understand the structure of model capabilities, and inform policy-relevant assessment against capability thresholds. 
  • Develop predictive models of LLM capabilities, including through observational scaling laws, agent skill decomposition, or other techniques. 
  • Develop and validate new evaluation methodologies (e.g. transcript analysis, milestones or partial progress, hinting interventions). 
  • Collaborate with policy, safety, and research teams to translate empirical results into governance insights. 
  • Staying well informed of the details of evaluations across domains in AISI and the state of the art in frontier AI evaluations research more broadly, including attending ML conferences. 
  • Write and edit scientific reports, internal memos, and other materials that synthesise results into actionable guidance. 

Person Specification 

We’re flexible on the exact profile and expect successful candidates will meet many (but not necessarily all) of the criteria below. Depending on experience, we will consider candidates at either the RS or Senior RS level. 

Essential 

  • Strong track record in applied ML, evaluation science, or equivalent experimental sciences with fraught methodological challenges. Ideally including multiple publications, projects, or real-world deployments (e.g. PhD in a technical field and/or spotlight papers at top-tier conferences). 
  • Deep interest in methodology and measurement: strong instincts for finding flaws in experimental designs, and how to build methods that generalise. 
  • Excellent scientific writing skills and the ability to clearly communicate complex ideas to technical and policy audiences. 
  • Strong motivation to do impactful work at the intersection of science, safety, and governance. 
  • Ability to work autonomously and in a self-directed way with high agency, thriving in a constantly changing environment and a steadily growing team. 

Nice to Have 

  • Excellent understanding of the literature and hands on experience with large language models, including experience with designing and running evaluations, fine-tuning, scaffolding, prompting. 
  • Experience with experimental design, diagnostics, or tooling in other scientific disciplines (e.g. psychometrics, behavioural economics). 
  • Understanding of (observational) scaling laws or predictive modelling for capabilities. 

Core requirements  

  • You should be able to spend at least 4 days per week on working with us.  
  • You should be able to join us for at least 24 months.  
  • You should be able work from our office in London (Whitehall) for parts of the week, but we provide flexibility for remote work.  

  
Salary & Benefits 

We are hiring individuals at all ranges of seniority and experience within this research unit, and this advert allows you to apply for any of the roles within this range. Your dedicated talent partner will work with you as you move through our assessment process to explain our internal benchmarking process. The full range of salaries are available below, salaries comprise of a base salary, technical allowance plus additional benefits as detailed on this page. 

  • Level 3 - Total Package £65,000 - £75,000 inclusive of a base salary £35,720 plus additional technical talent allowance of between £29,280 - £39,280 
  • Level 4 - Total Package £85,000 - £95,000 inclusive of a base salary £42,495 plus additional technical talent allowance of between £42,505 - £52,505 
  • Level 5 - Total Package £105,000 - £115,000 inclusive of a base salary £55,805 plus additional technical talent allowance of between £49,195 - £59,195 
  • Level 6 - Total Package £125,000 - £135,000 inclusive of a base salary £68,770 plus additional technical talent allowance of between £56,230 - £66,230 
  • Level 7 - Total Package £145,000 inclusive of a base salary £68,770 plus additional technical talent allowance of £76,230 

This role sits outside of the DDaT pay framework given the scope of this role requires in depth technical expertise in frontier AI safety, machine learning, and empirical research experience. 

Government Digital and Data Profession Capability Framework - Government Digital and Data Profession Capability Framework 

There are a range of pension options available which can be found through the Civil Service website.  

The Department for Science, Innovation and Technology offers a competitive mix of benefits including:  

  • A culture of flexible working, such as job sharing, homeworking and compressed hours.  
  • A minimum of 25 days of paid annual leave, increasing by 1 day per year up to a maximum of 30.  
  • An extensive range of learning & professional development opportunities, which all staff are actively encouraged to pursue.  
  • Access to a range of retail, travel and lifestyle employee discounts.  
  • The Department operates a discretionary hybrid working policy, which provides for a combination of working hours from your place of work and from your home in the UK. The current expectation for staff is to attend the office or non-home based location for 40-60% of the time over the accounting period. 

Selection Process  

In accordance with the Civil Service Commission rules, the following list contains all selection criteria for the interview process.  

The interview process may vary candidate to candidate, however, you should expect a typical process to include some technical proficiency tests, discussions with a cross-section of our team at AISI (including non-technical staff), conversations with your workstream lead. The process will culminate in a conversation with members of the senior team here at AISI.  

Candidates should expect to go through some or all of the following stages once an application has been submitted:  

  • Initial interview  
  • Technical take home test  
  • Second interview and review of take home test  
  • Third interview  
  • Final interview with members of the senior team 

 


Additional Information

Internal Fraud Database 

The Internal Fraud function of the Fraud, Error, Debt and Grants Function at the Cabinet Office processes details of civil servants who have been dismissed for committing internal fraud, or who would have been dismissed had they not resigned. The Cabinet Office receives the details from participating government organisations of civil servants who have been dismissed, or who would have been dismissed had they not resigned, for internal fraud. In instances such as this, civil servants are then banned for 5 years from further employment in the civil service. The Cabinet Office then processes this data and discloses a limited dataset back to DLUHC as a participating government organisations. DLUHC then carry out the pre employment checks so as to detect instances where known fraudsters are attempting to reapply for roles in the civil service. In this way, the policy is ensured and the repetition of internal fraud is prevented.  For more information please see - Internal Fraud Register.

Security

Successful candidates must undergo a criminal record check and get baseline personnel security standard (BPSS) clearance before they can be appointed. Additionally, there is a strong preference for eligibility for counter-terrorist check (CTC) clearance. Some roles may require higher levels of clearance, and we will state this by exception in the job advertisement. See our vetting charter here.

 

Nationality requirements

We may be able to offer roles to applicant from any nationality or background. As such we encourage you to apply even if you do not meet the standard nationality requirements (opens in a new window).

Working for the Civil Service

The Civil Service Code (opens in a new window) sets out the standards of behaviour expected of civil servants. We recruit by merit on the basis of fair and open competition, as outlined in the Civil Service Commission's recruitment principles (opens in a new window). The Civil Service embraces diversity and promotes equal opportunities. As such, we run a Disability Confident Scheme (DCS) for candidates with disabilities who meet the minimum selection criteria. The Civil Service also offers a Redeployment Interview Scheme to civil servants who are at risk of redundancy, and who meet the minimum requirements for the advertised vacancy.

Diversity and Inclusion

The Civil Service is committed to attract, retain and invest in talent wherever it is found. To learn more please see the Civil Service People Plan (opens in a new window) and the Civil Service Diversity and Inclusion Strategy (opens in a new window).

Apply for this job

*

indicates a required field

Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf


Select...
Select...
Select...

UK Diversity Questions

It's important to us that everyone at AISI feels an included part of the team, whoever they are and whatever their background. These questions will help us to identify the diversity of our applicants. Should you not wish to provide an answer, you will always have the option to not provide a response with a 'I don't wish to answer' option. Your answers will not impact your hiring outcomes whatsoever.

If there are any questions you would like to further discuss or want clarity on, we'd be happy to talk to you about this if you reach out to active.campaigns@dsit.gov.uk

Select...
Select...
Select...
Select...
Select...
Select...
Select...