An incident response framework for frontier AI models
30th September - 2023 AUTHORS
Joe O’Brien - Associate Researcher
Shaun Ee - Researcher
Zoe Williams - Research Manager
A comprehensive approach to addressing catastrophic risks from AI models should cover the full model lifecycle. This paper explores contingency plans for cases where pre-deployment risk management falls short: where either very dangerous models are deployed, or deployed models become very dangerous.
Informed by incident response practices from industries including cybersecurity, we describe a toolkit of deployment corrections that AI developers can use to respond to dangerous capabilities, behaviors, or use cases of AI models that develop or are detected after deployment. We also provide a framework for AI developers to prepare and implement this toolkit.
We conclude by recommending that frontier AI developers should (1) maintain control over model access, (2) establish or grow dedicated teams to design and maintain processes for deployment corrections, including incident response plans, and (3) establish these deployment corrections as allowable actions with downstream users. We also recommend frontier AI developers, standard-setting organizations, and regulators should collaborate to define a standardized industry-wide approach to the use of deployment corrections in incident response.
Caveat: This work applies to frontier AI models that are made available through interfaces (e.g., APIs) that provide the AI developer or another upstream party means of maintaining control over access (e.g., GPT-4 or Claude). It does not apply to management of catastrophic risk from open-source models (e.g., BLOOM or Llama-2), for which the restrictions we discuss are largely unenforceable.
To manage catastrophic risks from frontier AI models that either (a) slip through pre-deployment safety filters, or (b) arise from improving the performance of deployed models, we recommend that leading AI developers establish the capacity for “deployment corrections” in response to dangerous behavior, use, or outcomes from deployed models, or significant potential for such incidents.
We argue that deployment corrections can be broken down into the following categories:
Frontier AI developers can mix and match these tools based on the threat model– for example, filtering outputs may be especially suited to preventing the spread of dangerous biological or chemical designs, while access frequency limits could be used to reduce the scale of some model-based incidents by limiting the rate of a model’s outputs (for example, by reducing the speed of misinformation production). We envision deployment corrections as a toolbox that can be adjusted according to the type and severity of risks presented by each case.
We then describe a high-level deployment correction framework for AI developers, outlining a four-part process inspired by incident response practices from the cybersecurity field: preparation, monitoring & analysis, execution, and recovery & follow-up.
Figure 1. An end-to-end process for implementing deployment corrections for frontier AI models
We then review several challenges to effectively using deployment corrections. There are several unique challenges that frontier AI poses to this process.
While we recommend some tentative ideas for mitigating these challenges, we believe these challenges will require significant work to solve, and encourage further work to do so.
We close by recommending actions that frontier AI developers, policymakers, and other relevant actors can take to lower the barrier for making decisive, appropriate deployment corrections, namely:
Some applied research on managing catastrophic risk from frontier AI models has focused on model risk assessments prior to public/commercial deployment (ARC Evals, 2023). However, there has been very little public work on post-deployment interventions. While several authors have discussed evaluation and monitoring for deployed models (Mökander et al., 2023; Shevlane et al., 2023), they have done so within broader discussions of model risk assessment and evaluation, and do not describe in depth the process for responding to situations in which deployed models fail evaluations or otherwise exhibit undesired behavior.
The attention on pre-deployment risk assessment for frontier AI is warranted–modern best practices in engineering safety prioritize designing out hazards, rather than responding to accidents (Leveson, 2020). Still, because frontier AI poses potentially extreme impacts, and because model capabilities and behaviors are hard to foresee even with pre-deployment testing, it requires a defense-in-depth approach (Ee, 2023). To address this, this paper looks at contingency plans for cases where pre-deployment risk assessment falls short: when either very dangerous models are deployed, or the continued availability of deployed models becomes very dangerous.
Some potentially catastrophic risks may not be identified until after a model is deployed. Recent history is full of cases where models have behaved or been used in unintended ways after model deployment (Labenz, 2023; Vincent, 2016; Heaven, 2022; Roose, 2023; Lanz, 2023). While pre-deployment red-teaming and risk assessment is likely to help, AI developers should anticipate that some issues will only be identified in the post-deployment phase (Shevlane et al., 2023). As models become more capable, such issues will present more significant risks. We envision two main sources of post-deployment risk:
(a) Risks that are not identified in pre-deployment risk assessments
(b) Risks arising from improving the performance of deployed models
On (a): Pre-deployment model risk assessment is unlikely to identify all catastrophic risks, for several reasons. First, model risk assessment tools are in the early stages and will take time to develop; additionally, the broad space of applications for frontier AI models poses a significant challenge to assessing all potential significant risks. Second, certain risks may only exist in a less-bounded context than pre-deployment testing, such as adverse interactions with other systems, models, or organizations, unexpected forms of misuse from malicious actors, and adversarial attacks on systems integrated with critical infrastructure. Third, power-seeking and/or deceptive AI might successfully infer the existence of evaluation or monitoring environments, and “play along” until it can successfully evade such filters (Hendrycks et al., 2023, p. 41). Fourth, adverse or risky outcomes may take time to develop (e.g., increased vulnerability due to job loss in critical industries, or the development of new methods of misuse). It also isn’t clear that risk assessment will focus on systemic risks of widespread AI adoption, in addition to more acute risks.
On (b): Pre-deployment model risk assessment may be infeasible for assessing risks emerging from improving the performance of deployed models. Major ways this could happen include:
These updates and capability extensions could occur rapidly, relative to the months-long process for developing base models. They may be hard to predict and not fully accounted for in pre-deployment risk assessments.
The following scenario gives one example of how a potentially catastrophic risk could pass through pre-deployment checks, and how the framework we'll discuss could be applied to mitigate the negative outcomes.
Case 1: Partial restrictions in response to user-discovered performance boost and misuse
Includes: improving performance of deployed models; misuse; reversion to allowlisting; restricting access quantity.
To manage the above risks, we recommend frontier AI developers establish the capacity to rapidly restrict access to a deployed model, for all or part of its functionality and/or users. This would facilitate appropriate and fast responses to a) dangerous capabilities or behaviors identified in post-deployment risk assessment and monitoring, and b) serious incidents. We also recommend practices that can lower the barrier for making decisive, appropriate access restriction decisions–see the recommendations in Section 5.
The current section lays out access restriction options which allow for granular and scalable targeting based on the threat model (Section 2.1), and discusses additional considerations regarding cases of emergency shutdown (Section 2.2).
Frontier AI developers which make their models available to downstream users via an API have a number of tools at their disposal to limit access to the model. At a high level, this toolkit includes user-based restrictions, access frequency restrictions, capability restrictions, use case restrictions, and full shutdown. These tools can be used in a broad range of scenarios, from cases in which risks from the model are fairly limited, to scenarios in which the harms are potentially severe and can arise even from proper use by an authorized (allowlisted) user.
As discussed in Section 4, restricting model access may be difficult in practice, as downstream users may become dependent on capabilities of newly-deployed models. To minimize these harms, and to lower the barrier for developers to institute deployment corrections as a precaution, we outline a space of deployment corrections to allow a scalable and targeted approach. AI developers can opt for combinations of user-based or capability-based restrictions, and tailor these choices to respond effectively to specific incidents, while minimizing downstream harms.
While we expect some staff at AI companies are familiar with these tools, we review them here because they will be referenced throughout this piece. The following table draws heavily on Shevlane et al. (2023)–particularly the appendix on deployment safety controls.
Table 1: Taxonomy of deployment corrections
Blocklisting individuals or groups
Imposing IP or other verification-based restrictions on users based on anticipated or historical misuse.
Allowlisting individuals or groups
The inverse of blocklisting. Providing specific users or user groups expanded forms of access; this can be imposed at the time of deployment, or be imposed retroactively. Maintaining an allowlist opens up the option to retain access for allowlisted users even when removing access for all others (e.g., due to widespread or unknown threat actors).
Access frequency limits
Throttle number of calls
Place a hard cap on the number of function calls (e.g., JSON documents sent to an external API) that a single model can output in a given amount of time.
Throttle number of prompts
Place a hard cap on the number of prompts that can be submitted to a model in a given amount of time.
Throttle number of end users
Place a hard cap on the total number of end users a model can have.
Throttle number of applications
Place a hard cap on the total number of applications that can be built on top of a model.
Capability or feature restrictions
Reduce context windows
Reduce the number of tokens a model is capable of processing in relation to one another. This curbs a model’s capabilities by reducing its ability to “remember” earlier information (Stern, 2023).
Reset sessions after a certain number of prompts or outputs. This might accomplish a similar goal to the above point.
Limit user ability to fine-tune
“Fine-tuning,” or re-training a base model to perform better at a particular task, might increase a model’s capabilities in certain domains to the extent that such capabilities are dangerous. Frontier AI developers could remove this functionality for users, or retract specific fine-tuned instances.
Monitor and automatically filter out dangerous outputs, such as code that appears to be malware, or viral genome sequences.
Removal of dangerous capabilities
Attempt to remove specific capabilities (e.g., pathogen design) via fine-tuning, reinforcement learning from human feedback (Lowe & Leike, 2022), concept erasure (Belrose et al., 2023), or other methods.
Global planning limits
Adjust whether the same model instance has access to a large number of users, or is limited to more narrow sets of interactions (Shevlane et al., 2023).
For example, restricting the ability for a model to define new actions (e.g., via assigning itself new sub-goals in an iterative loop), or to execute tasks (versus solely responding to queries) (Shevlane et al., 2023).
Use case restrictions
Prohibiting high-stakes applications
Setting a use policy that restricts the model from being used in high-stakes applications, and allows banning or otherwise penalizing users that breach this policy. Requires Know-Your-Customer procedures.
“Narrowing” a model
Producing fine-tuned/application-specific narrower models to reduce a model’s capacity for general-purpose use.
Tool use limits
Limit the ability of a model to interact with downstream tools (e.g., to use other APIs), make function calls, browse the web, etc.
Full market removal
Pull the current model from the market. Can also include pulling one or more previous versions, in cases where it is unclear whether reverting to a previous model would solve the issue.
Disconnecting power to the relevant parts of the data center or cluster where the model is hosted.
Decommission the model, including destroying data, systems, or assets associated with the model, whether through deletion of data or physical destruction.
Institute a moratorium on re-deployment until approval via independent review.
The above options are not mutually exclusive–instead, they can be viewed as a toolbox that developers can mix and match to address different threat models or incidents. For example, an AI lab might [1b] allowlist certain users (e.g., external auditors) for [3c] the ability to fine-tune a model and [4b] full model generality, while allowing other users access to the model but without those two capabilities. A developer may also wish to establish [3g] autonomy limits just in [4a] high-stakes applications.
These options can be imposed manually, or triggered automatically. It may make sense for certain restrictions to trigger automatically, such as in cases where the speed of failure is rapid, or to ensure that restrictions are imposed reliably in accordance with thresholds set forth in standards or pre-commitments. Similarly, manual triggers may be appropriate to ensure that human operators can act in cases where monitoring, automated response, or pre-determined thresholds fail to identify issues, or where careful deliberation is required. See Section 3 for further discussion on how the balance of manual and automatic decision-making can be managed.
Emergency shutdowns are common in areas where continued operation can result in catastrophic harm, such as nuclear energy (Operating Reactor Scram Trending, 2021), finance (Circuit Breaker, n.d.), and even in elevators (Palmer, 2023). The purpose is typically to intervene quickly to prevent an existing failure from resulting in a catastrophic outcome, by shutting down the affected system completely.
In the case of frontier AI, companies may want to shut down models for a broad range of reasons–some cases may be due to more obviously-dangerous issues, such as certain model-originating risks (e.g., deception or power-seeking), catastrophic forms of misuse, or severe social or economic effects; however, it is possible that an AI company might want to shut down models in cases of sub-catastrophic harm as well., 
While developing fallbacks may mitigate some downstream harm, shutdown is more likely than targeted restrictions to have severe repercussions for downstream users, up to and including breaking their applications (and leading them to switch over to the company’s competitors). In certain industries, these impacts may lead to loss of life or significant economic harms. Due to the potential scale of downsides for users, the reputational and financial costs to the AI developer, and the risk that safety-conscious companies will fall behind less safety-conscious competitors, additional support structures may be needed to incentivize appropriate risk management practices around shutdown. These could include regulatory oversight, industry standards, and/or financial incentives.
The following scenario, which features a temporary model shutdown, describes how AI developers might weigh this option against other deployment corrections.
Case 2: Full market removal due to improved prompt injection techniques
Includes: Prompt injection; input-output monitoring; multi-agent interactions;
full market removal
This section reviews implementation procedures for deployment corrections, drawing on tools and best practices from other industries as appropriate. We believe that staff at frontier AI companies will be familiar with much of the following, but that it is nevertheless valuable for us to describe in detail what is required for deployment corrections to function.
We will frame deployment correction as a four-part process, consisting of preparation, monitoring & analysis, incident response, and post-incident recovery & follow-up.
The process may also involve coordination and information-sharing with governments and industry partners (where such activities are likely to support effective incident response and do not violate relevant laws).
Figure 1 (provided in the Executive Summary) provides an overview of this section–a tentative blueprint for the process that AI developers can adopt to integrate deployment corrections into their deployment process.
In the remainder of this section, we describe practices that will help AI developers to navigate each stage in the deployment correction process.
The process of incident response is complex, and will require the involvement of actors throughout the AI company (including product teams, business operations, safety engineers, and C-suite), as well as external parties, such as third-party auditors, other frontier AI developers, and government agencies. To improve coordination and allow for decisive action, we recommend centralizing the process under a clear owner.
Security Operations Centers (“SOCs”) may be an appropriate institutional home for much of this work. Considering the complexity surrounding the deployment correction process, and the rapid rate of change in the field of AI, we believe that unifying security operations under one roof is sensible.
SOCs at frontier AI companies should include some prominent functions from large SOCs in cybersecurity, including:
Note: Throughout this section, we largely use the term “AI developers” rather than “Security Operations Centers” to refer to the acting entity, to leave to the discretion of specific developers who in their organization is assigned responsibility over which tasks; nevertheless, an SOC may be a reasonable owner for many functions related to mitigating risks from frontier AI models.
Here, we describe in more detail how AI developers can make and maintain documented response plans for effectively using the toolbox of options for deployment corrections as outlined in Section 2. Policymakers and/or standard-setting organizations may also have a role in mandating or setting standards for AI developers to prepare tools and protocols for deployment correction, in order to overcome initial inertia and to incentivize adoption across the frontier AI industry.
The preparation stage should involve:
AI developers should model potential catastrophic threats, and regularly update these threat models. Threat models should trace high-level catastrophic risks to specific vulnerabilities (such as insider threats, poor handling of AI models, and cybersecurity vulnerabilities [e.g., authorization bypass]), and include mitigations for these vulnerabilities. To aid this process, AI developers may want to consider employing a set of risk assessment techniques from other industries (Koessler & Schuett, 2023). They may also want to involve external domain experts in the process of identifying specific threat models, such as was done with Anthropic’s recent work on “frontier threats red teaming,” which focused on biological risk (Anthropic, 2023). Risk identification, analysis, and evaluation are high-priority steps for risk management, and it is important that frontier AI companies adopt a defense-in-depth approach that employs multiple overlapping techniques (Ee, 2023).
AI developers should establish thresholds for initiating deployment corrections, informed by the threat modeling process. Thresholds might be set by the AI developer and/or industry standards. An example case:
AI developers should create and maintain a documented incident response plan to guide the incident response process. This document should clearly define the following aspects:
The response plan should be circulated to the developers or teams that handle the triggers for risks, and operators should be trained using these protocols to respond to a set of high-likelihood and/or high-consequence events. This training should include procedures for risks or unusual behaviors that have not yet been identified, covering factors such as generic thresholds for severity, temporary mitigations that can be used during investigation, and appropriate escalation points. The response plan should also be updated periodically to incorporate changes in teams, personnel, AI technology, deployment correction tools, and risk models.
While we expect that Trust & Safety teams at top AI companies will have experience in maintaining part of this suite of tools (as such companies already have some infrastructure for certain deployment corrections, as demonstrated by past actions and documentation), we are not aware that such tools are sufficient for a range of potentially catastrophic scenarios. This technical work is outside the scope of this piece.
The response plan should explicitly define decision-making authority for deployment corrections, with the design goal of ensuring that these actions are executed when needed but otherwise do not happen. Recommending how authority should be divided is out of scope for this piece; however, we recommend that developers consider the following:
We expect that setting up the authorities and mechanisms described above will depend on up-to-date information on threat modeling, available interventions, use cases, and more. As such, the design process will require input from security experts and buy-in from top decision makers within an organization, and may be delineated in industry standards and/or regulation.
AI developers should share safety practices relevant to deployment corrections with government and industry partners. A number of top AI developers recently committed to information-sharing on safety practices, and on strategies used by malicious users to subvert safeguards (The White House, 2023); not long afterward, OpenAI, Anthropic, Google, and Microsoft formed the Frontier Model Forum (FMF), with the aim of “identifying best practices for the responsible development and deployment of frontier models” among other objectives (Google, 2023). To the extent that information regarding deployment corrections qualify as part of these arrangements, developers should consider sharing this information (such as threat models, triggers for deployment corrections, and tools for executing deployment corrections) via the FMF and other appropriate channels. Furthermore, developers should establish communication lines and develop incident response plans with relevant partners in government, based on threat models.
Here, we briefly note how AI developers could monitor deployed AI models to quickly, accurately, and comprehensively detect potential catastrophic risks. Because we expect that monitoring is already a familiar activity to lab actors, we keep this section brief.
The monitoring & analysis stage should involve:
AI developers should extend their existing monitoring tools to gather data on selected triggers for deployment corrections. The post-deployment monitoring regime could draw on a wide range of inputs, such as:
As part of the monitoring scheme, AI developers should design thresholds for automatic alerts to human operators. Such thresholds could be assigned as part of the same threshold-setting process described in Section 3.1. In particular, automated alert thresholds should be carefully designed to avoid incurring “alert fatigue”; see Section 4.1.2 for further discussion on this point. Where automated monitoring systems fall short, human operators may fill the gap in raising alerts (such as a Security Operations Center analyst monitoring use trends, or an engineer that identifies a vulnerability in an existing product).
Analysis & Prioritization. Once an alert is triggered, it needs to be triaged. Outcomes can include true positives (real incidents), benign positives (such as penetration tests or other known approved activities), and false positives (i.e., false alarms). In case of benign or false positives, AI developers will need to regularly fine-tune monitoring rules to reduce false alarms in the future. In the case of true positives, the ‘Execution’ phase should start.
The team that reviews triggers must also prioritize them based on the expected impact they will have; while NIST SP 800-61 Revision 2 (3.2.6) provides general guidance on incident prioritization, security teams at frontier AI companies will need analysis tools suited to their organizations’ and AI systems’ threat models in order to successfully prioritize between the large space of potential incidents.
Escalation may be required. Having an escalation process in place may allow AI developers to respond in a more timely and effective manner. This process might involve escalation from an analyst to an SOC director, from an SOC director to the CISO, or might grant permissions for an SOC director to convene emergency meetings with relevant members across the organization. Typically cybersecurity analysis involves having a human in the loop to determine the impact of certain responses, and to assess what the best course of action is from a cyber perspective. In certain cases, however, alerts might be piped directly to automated responses (see further discussion in Section 3.1: “The extent to which decisions are automated”).
AI developers should feed information from the monitoring process back into threat models. The threat modeling process should be regularly updated based on data regarding the current capabilities and uses of AI models, as well as threat intelligence produced by security personnel (both within the company, and also by security partners in industry and government). For more information on the process of continuous monitoring and updating of risk assessments, see NIST SP 800-137 and related publications.
Once a potentially catastrophic risk is identified, what are the series of steps a company should perform? Here, we describe at a high level these steps for implementing deployment corrections.
This stage should involve:
The AI developer should immediately alert key stakeholders, such as relevant government entities and/or industry partners. Federal agencies, such as CISA, often coordinate with private entities during cybersecurity incident response, and could assist to mitigate the spread of the incident and secure affected critical infrastructure. Depending on the nature of the threat, the involvement of additional agencies may be warranted as well. Information may need to be shared with other industry partners, especially when similar models could be affected by similar issues. In instances where the threat arises from malicious actors, one format for information sharing could be information sharing and analysis centers (ISACs), member-driven nonprofit organizations that share intelligence about cyber threats between member companies and organizations. Where risks arise from the design of the system itself, another useful format could be the coordinated vulnerability disclosure (CVD) process, which aims to distribute relevant information on cyber vulnerabilities (including mitigation techniques, if they exist) to potentially affected vendors prior to full public disclosure, in order to provide vendors time to remedy the issue (Coordinated Vulnerability Disclosure Process, n.d.).
The security team and other relevant experts should ascertain the impact and/or severity. Varying types of impact (e.g., AI-originating biorisk; failure of AI in critical systems, and so on) and degrees of severity (e.g., critical, high, medium, low) will necessitate distinct forms of response. It is possible that only events above a certain severity level would be escalated to this stage.
Initiating deployment correction procedures. Once a trigger is determined as a true positive, and is ascertained to be of critical impact, the AI developer and associated security experts enter a race to eliminate the root cause of the incident with a high degree of confidence. “Containment” and “remediation” steps must be considered.
Fallback systems are implemented as appropriate. In the case of deployment corrections that are likely to break downstream tools, safety-critical customers should be contacted immediately to fail back to systems that can provide critical support until the automated system is repaired. It is also possible that an agency like CISA could coordinate this process where critical infrastructure is involved.
Here, we describe follow-up actions that AI developers may want to take in the wake of an incident.
The recovery & follow-up stage may include:
There should be a process for authorizing re-deployment, or for alternative plans. This recovery process should go through extensive testing and validation, ideally involving external parties (such as auditors and red teams). There should be an extremely high bar for re-deploying a model that is demonstrably capable of producing catastrophic failure. Where fixes are not possible or sufficiently robust, alternative plans to re-deployment should be pursued (such as decommissioning the model, and/or coordinating with other actors in government or industry to manage industry-wide responses). In extreme cases, recovery may not be possible–for example, if a base model is shown to behave in catastrophically dangerous ways (e.g., power-seeking) when given access to external resources.
Regulators and/or other AI developers should be alerted as appropriate. As described previously, some of this communication may start earlier in the incident response stage (such as contacting relevant federal authorities or domain experts). In certain scenarios, it may be necessary to expand engagement with government and industry partners in order to determine appropriate industry-wide responses.
Customers should be notified of the issue. AI developers may want to prioritize alerting certain high-stakes downstream users first, so it may be useful for developers to maintain data on customers that allows for tiering of notices. Customer groups could be broken down, for example, into individual API users (e.g., monthly API subscribers); commercial users (e.g., Slack, Khan Academy); and safety-critical users (such as downstream developers of mental health service apps, or cybersecurity apps). Developers should prioritize contacting safety-critical users first, explaining the issue and the options for replacement (in cases where such replacements have not been predetermined). For non-commercial API subscribers, it may be sufficient to publish a public announcement, email customers, and provide an update when on the website. It is unclear, legally, what requirements should lie on AI developers in terms of notification, and requirements will likely differ by jurisdiction.
AI service providers should also consider the possibility of refunds or other forms of remedy for customers. Service-level agreements may stipulate financial refunds or service credits if the agreement is broken. There may also be tiers of remedy based on the customer group. AI service providers should clarify these costs prior to deployment, and ensure that financial costs would not become a barrier to making appropriate deployment correction decisions. For downstream applications and their users, best practices for refunds and remedies are unclear.
Service contracts may require appropriate response or resolution times for incidents, or mandate a minimum percentage of uptime; developers should consider carve-outs for exceptional scenarios when drafting such agreements in order to avoid pressure to re-deploy a dangerous model.
AI developers should perform after-action reviews, and integrate lessons learned into security processes. This should include a special focus on what the root cause of the incident was, and why the incident was not caught by initial threat modeling and risk management processes, which should be updated accordingly. Several sets of guidelines describe best practices for post-incident reviews, such as NIST SP 800-61r2 (Sec. 3.4); developers could refer to these to craft their own practices. Industry-relevant findings should be shared with industry partners via secure channels. It may also be advisable to bring in external parties (such as auditors, or even competitors, due to their domain expertise) to ensure the review is accurate. Depending on the legal context and the severity of the incident, state bodies may also be involved in incident investigations; while the law is not yet clear in the case of AI, this is the case in other high-risk industries, such as chemical manufacturing and aviation (U.S. Chemical Safety and Hazard Investigation Board, n.d.; Office of Accident Investigation & Prevention, n.d.).
We provide an additional hypothetical scenario here in an attempt to tie together the concepts in this section.
Case 3: Emergency shutdown in response to hidden compute-boosting behavior by model
Includes: uncertainty in cause, power-seeking, automated limits, emergency shutdown.
Implementing deployment corrections to AI models might be challenging in practice. Here, we focus on two categories of issues that may lead AI developers to fail to act:
Identifying threats, monitoring deployed models for anomalous behavior, and responding to incidents appropriately may be particularly difficult in the frontier AI industry, due to the unique threat profile presented by frontier AI models.
First, catastrophic risks from AI are complex and are marked by high uncertainty (i.e., involve interactions between various entities and events, and do not currently have direct precedents) (Koessler & Schuett, 2023). This means that threat identification for frontier AI cannot rely solely on narrow threat modeling, or benefit from years of precedent and iterative learning. Inaccurate or insufficient threat identification may lead to gaps in risk coverage.
Mitigation: Robust risk assessment and threat modeling may be needed to address this. See (Koessler & Schuett, 2023) for a review of risk assessment techniques that may help to overcome this challenge. Given that other researchers have identified risk assessment as a high-priority risk management step, frontier AI companies should use a defense-in-depth approach that employs multiple overlapping risk assessment techniques (Ee, 2023).
Second, the landscape of frontier AI is rapidly changing. The past year has seen significant news in several areas relevant to threat identification. Rapid development of frontier models will challenge efforts to track and respond to emerging capabilities; rapid commercialization will challenge efforts to stay atop novel uses and misuses; and rapidly-growing interest in AI capabilities may lead malicious or competitive actors, including Advanced Persistent Threats, to challenge the cybersecurity practices of frontier AI companies.
Mitigation: Performing capabilities evaluations, and adding such evaluations into external auditing schemes, may help relevant actors to stay aware of emerging capabilities; risk assessment and threat modeling practices (as described above) may help to predict novel uses and misuses; investing in state-of-the-art security practices and leveraging external security expertise may help to stay ahead of traditional (though highly-capable) cyber threats. The US Cybersecurity and Infrastructure Security Agency (CISA) could potentially own and lead the development of a mechanism to assess and monitor effects of frontier AI systems on the top ten most vulnerable National Critical Functions.
Third, it is unclear how to assess deployed AI models for less acute risks–a broad category of impacts that others have described as “social impact,” “structural risks,” and/or “systemic risks” (Solaiman et al., 2023; Zwetsloot & Dafoe, 2019; Maham & Küspert, 2023). Nevertheless, such risks could be catastrophic in nature. In other words, some risks of deployed AI models may not register as clear or obvious incidents, and so may be harder to identify, and therefore harder to act on.
Mitigation: To inform evaluation for these impacts, we recommend reviewing Solaiman et al. (2023). We are currently unsure what interventions would be warranted in different scenarios in this bucket, and it is also unclear whether deployment corrections would be an effective response to this class of risks.
First, achieving monitoring coverage across the digital infrastructure may be a complex task. The relevant infrastructure includes not only infrastructure within the AI company, but also within partner companies (such as Microsoft to OpenAI), compute providers (such as AWS or Google Cloud), and potentially even downstream developers and applications (such as Khanmigo, Slack, or Gmail). Also, this coverage must consider whether different entities in these categories are hosting model instances themselves, or receiving model access via API.
Mitigation: Maintain comprehensive records of the relevant infrastructure for deployed models, including tracking: what entities are accessing the model, and by what means; whether any additional parties have full model access; and where a model is being hosted for inference purposes. Consider the operational security of all parties involved when developing threat models, and develop secure data-sharing practices across the digital infrastructure to allow security teams to access relevant information.
Second, frontier AI developers may face data overload when trying to monitor downstream use risks. The quantity of data generated by the aforementioned ecosystem for any given frontier model may be significant. Besides making it more difficult to correctly identify alerts, this information overload is also a significant contributor to “SOC burnout,” a phenomenon in cybersecurity that has been linked to high turnover, poor performance, and mental health difficulties among employees.
Mitigation: Guides like The Art of Recognizing and Surviving SOC Burnout describe this phenomenon in more detail and recommend options for reducing this burden. Automated tools for parsing this data may also help, but require careful setup.
Third, designers of monitoring and alert systems must avoid the “boy-who-cried-wolf” issue. Automated systems that trigger either (a) alerting human operators to risks, or (b) deployment corrections directly, must be careful to avoid setting the thresholds too low, which can lead to a high number of false positives. In case (a), a high number of false positives may lead to “alert fatigue,” which can lead human operators to view alerts not as emergencies, but as likely to just be false alarms. In case (b), a high number of false positives can lead to pulling the model unnecessarily; because deployment corrections will be costly for both users and for the AI company, this should be avoided to a reasonable degree.
Mitigation: Investment in well-calibrated monitoring tools, threat modeling, and automated data analysis; logging false positives and false negatives and feeding that data back into monitoring tool calibration; developing a gradient of alerts, from “gentle” (i.e., likely to be low-risk and are easily dismissed) to “code red”; developing a scale of response intensity, with a low threshold for triggering gentle responses (e.g., output filtering), and high threshold for more intense responses (e.g., shutdown).
Fourth, advanced threat actors, and/or frontier AI models, may be able to evade standard monitoring mechanisms. Cybersecurity experts have already documented multiple ways that attackers can subvert existing defenses. Patient attackers can also conduct extended campaigns where individual events that might normally trigger an alert are too separated by time for defenders to correlate.
Moreover, new software vulnerabilities and new attack techniques are constantly being discovered: for example, the SolarWinds attack involved a “software supply chain attack” where attackers hijacked the supposedly secure software update process for cybersecurity logging software, and used it to distribute malicious code to thousands of users (Temple-Raston, 2021). While there is no evidence that current AI models could independently develop such sophisticated attacks, there exist attacks that can be especially difficult to defend against–and some experts predict that AI has the potential to “increase the accessibility, success rate, scale, speed, stealth, and potency of cyberattacks” (Hendrycks et al., 2023).
The same principle may apply to other offensive capabilities, such as prompt injection, or planning misuse approaches. While the cyber element is an important aspect of this issue, these other threat models should also be given attention.
Mitigation: Invest especially heavily in preventing both cyber issues and model vulnerability issues (such as prompt injection); learn from best practices in cyberdefense for other high-value targets (e.g., NSA cybersecurity); consider avoiding (in order from lowest to highest risk) training, releasing, or open-sourcing models that advance cyber and other offensive capabilities without substantially better risk mitigations than are currently available.
There are a number of challenges that may complicate the process of incident response, even if frontier AI developers perform due diligence in preparing for incidents.
First, automated systems can fail rapidly. For example, the 2012 Knight Capital trading software glitch caused the firm to lose $440 million in value in under an hour (Popper, 2012); AI failure in high-speed environments like driving can also lead to disastrous results, near-instantaneously–as was the case when a Tesla autopilot system malfunctioned, killing its driver in an accident (Incident 353, 2016). In the absence of automated response mechanisms, keeping pace with rapid failures may be extremely difficult.
Mitigation: Managing failure at speed is not a new issue–many other industries must contend with the same problem. For example, the fields of finance and nuclear energy have developed tools and protocols to respond in real time to relatively fast-paced escalating failures. Real-time monitoring and risk assessment, and rapid intervention capacity seem especially critical: some notable practices include circuit breakers in finance; and automated shutdown mechanisms in nuclear power plants.
Second, deployment corrections can only address issues if model access remains under control of the organization. Both open-sourcing and model exfiltration remove this control. Open-sourcing may be hard to prevent, as there are good reasons for enabling external access to frontier AI models at more than a superficial level. In terms of exfiltration: an attacker could potentially exfiltrate a model or reverse-engineer it (e.g., via model extraction attacks [Liu, 2022]). Moreover, while hypothetical, there is some chance that frontier AI models could demonstrate or be induced to display self-propagating behavior similar to a computer worm, exfiltrating copies of themselves to other devices and data centers without authorization.,  The original developer would likely have no control over the exfiltrated copies if this happened.
Mitigation: In order to maintain control over model use, we recommend exploring alternatives to open-source that still accomplish the benefits of open-source to some extent (such as enabling broader research on the model’s risks and benefits). In terms of model exfiltration, we believe security experts will be best suited to answer this challenge.
Frontier AI developers will face disincentives to restrict access to their models, which may lead to issues such as under-designing relevant infrastructure, or establishing too high a bar for implementing deployment corrections. Disincentives include potential harms to the company, and coordination problems.
Reputational risks may emerge when the process of pulling a model breaks downstream applications. Financial risks may emerge due to loss of profit during the outage, or if customers choose to migrate to competitors. This migration may be a result of loss of reputation, or (especially in the case of a long downtime period) due to customers migrating over to an alternative working service. Legal risks may emerge if frontier AI developers fail to effectively cover their ability to rescind a model in service contracts.
Mitigations: For managing reputational and financial risks, we largely point to best practices for customer relations and recovery– e.g., providing a substitute (such as a fallback to a previous model), especially in safety-critical cases; transparently communicating the reason for reduced availability (when possible); and/or reimbursing customers for harms or providing service credits. Companies may want to have transparent licensing agreements which allow themselves sufficient breathing room to restrict a model’s availability, especially in extraordinary circumstances.
The frontier AI industry may struggle to coordinate around deployment corrections, which could reduce any specific firm’s willingness to execute these actions when required. There are a number of concerns here.
First, there is no guarantee that competitor companies will act with the same level of caution as the company rolling back a model due to safety concerns. There are potentially perverse incentives here, in which safety-conscious firms may incur reputational and financial costs of deployment corrections, while less cautious firms reap the short-term benefits of forging ahead (until and unless a high-profile incident occurs).
Second, firms may worry about the potential for open-source models to quickly catch up to the same capability levels that prompt deployment corrections for more closed models–making such corrections less effective on a longer timescale in preventing catastrophic risks.
Third, significant competitive pressures in the frontier AI industry may incentivize AI developers toward downplaying pre-deployment risks, so that models can be released earlier. This increases the risk of AI incidents happening in the first place, placing undue reliance on deployment corrections as a defense layer against catastrophic incidents. The extent of this effect is unclear, and there is evidence on both sides–staff from leading AI companies today have publicly described delaying model commercialization in order to perform safety evaluations (OpenAI, 2023; Perrigo, 2023); however, there is also evidence of companies rushing frontier AI products to market (Dotan & Seetharaman, 2023; Alba & Love, 2023).
Mitigations: Leading AI companies have undertaken voluntary commitments on risk management, and are pursuing industry information-sharing on safety via channels like the Frontier Model Forum (The White House, 2023; Google, 2023). While work remains to identify what an ideal industry response to news of a dangerous deployed model looks like, for now we recommend frontier AI developers use these mechanisms as a platform to collectively explore this question. Looking overseas, an international governance regime may also be needed to reduce competitive pressures with developers in other nations.
It is worth noting that prevention is the best cure–robust pre-deployment safety practices, such as pre-deployment risk assessment (Koessler & Schuett, 2023), red teaming (Anthropic, 2023), and dangerous capability evaluations (ARC Evals, 2023; Shevlane et al., 2023), will ideally reduce the number of events that require deployment corrections. Additionally, the making and enforcement of commitments surrounding incident response plans will ideally increase the likelihood that such plans are followed.
To build capacity for deployment correction of frontier models, we recommend the following:
Significant work remains to be done for effective management of catastrophic risk in the post-deployment phase. While we have attempted here to describe basic considerations AI developers should build on, we recognize that the bulk of work required to operationalize these ideas remains to be done by actors in industry, academia, and government. In this section, we flag major unresolved issues that we hope will inspire further research.
Responsibility and authority
Risk models and thresholds
Competition and coordination
Standards and regulations
AI models are becoming increasingly capable, and more deeply integrated into society. As these trends continue, failures of deployed AI models will likely become higher-stakes. We should anticipate that, even in best-case governance scenarios, it will be difficult to remove all risk from models prior to deployment. To meet this challenge, it will be critical to strengthen the capacity of existing AI developers to quickly and efficiently remove model features, or models in their entirety, from broader access. At the same time, companies must make efforts to minimize the harms of this process.
While this piece attempts to lay out the high-level picture of this process, much work remains to be done. We look forward to seeing AI developers, civil society, security experts, governments, and other stakeholders work together to develop practical solutions to the problems discussed here.
We are grateful to the following people for providing valuable feedback and insights: Onni Aarne, Ashwin Acharya, Steven Adler, Michael Aird, Jide Alaga, Markus Anderljung, Bill Anderson-Samways, Renan Araujo, Tony Barrett, Nick Beckstead, Ben Bucknall, Marie Buhl, Chris Byrd, Siméon Campos, Carson Ezell, Tim Fist, Andrew Gillespie, Alex Grey, Oliver Guest, Olivia Jimenez, Leonie Koessler, Jam Kraprayoon, Yolanda Lannquist, Patrick Levermore, Seb Lodemann, Jon Menaster, Richard Moulange, Luke Muehlhauser, David Owen, Chris Painter, Jonas Schuett, Rohin Shah, Ben Snodin, Zach Stein-Perlman, Risto Uuk, Moritz von Knebel, Gabe Weil, Peter Wildeford, Caleb Withers, and Gabe Wu. Special thanks to: Lennart Heim for his contributions on compute governance; Rohit Tamma for providing a thorough and excellent review; and Adam Papineau for copy-editing. All errors are our own.
While this paper largely focuses on actions that frontier AI developers can take to mitigate post-deployment risks, cloud compute providers (such as Microsoft Azure or Amazon Web Services) also have a significant role to play in the oversight of deployed AI models, as they may provide large-scale inference compute for both proprietary and open-source models.
The majority of all AI deployments, particularly those at scale, occur on large compute clusters owned by cloud compute providers. This implies that the governance capacities of compute can be integrated into a post-deployment governance scheme–in particular, by mobilizing large-scale compute providers as an additional governance node for detecting harmful deployments, identifying who deployed the model in the case that this is unclear (e.g., if the model in question is open-source rather than proprietary), and enforcing shutdown.
While the technical arrangements around model hosting between compute providers and frontier AI developers may vary, we anticipate that generally, some tools for deployment correction will be shared across the infrastructure between these two types of organizations. At a high level, frontier AI developers, regulators, and compute providers should work together to develop a shared playbook for deployment corrections and incident response. This could include, for various potential incidents, detailing each of their a) information sources, b) deployment corrections in their toolbox, c) areas of responsibility/liability, d) instances when they are required to inform each other of incidents or actions, and e) decision-making procedures.
Some tools that compute providers may either possess alongside frontier AI developers, or possess as complementary tools that these developers lack, include:
While none of these interventions should be impossible, some of them may require additional work to develop as practical options: in particular, the ability to trace incidents back to compute providers, and the ability to verify whether hosted models adhere to certain standards (there may be additional important prerequisites for realizing the above interventions, though this is out of scope for this report).
To maintain the ability to respond to risks arising from their AI models, frontier AI developers’ most high-leverage actions include (a) not open-sourcing their models, and (b) maintaining strong security against model theft or leaks. For models that have been open-sourced intentionally or via theft or a leak, compute providers have a complementary role to play, in the form of post-incident attribution and shutdown. As described above, compute providers may be uniquely positioned to identify who deployed the model, understand the model’s origin, and stop the incident by turning it off.
With other open-source software, governance practices similar to this are common. For example, the hosts of malicious websites, such as ones where illegal drugs are sold, often remain anonymous, and a key available governance intervention is to shut down the servers hosting these websites. Government access and close contact with the host—similar to the role of the compute provider we are discussing here—can be advantageous to acting promptly.
Anderljung, M., Barnhart, J., Korinek, A., Leung, J., O’Keefe, C., Whittlestone, J., Avin, S., Brundage, M., Bullock, J., Cass-Beggs, D., Chang, B., Collins, T., Fist, T., Hadfield, G., Hayes, A., Ho, L., Hooker, S., Horvitz, E., Kolt, N., … Wolf, K. (2023). Frontier AI Regulation: Managing Emerging Risks to Public Safety (arXiv:2307.03718). arXiv. https://doi.org/10.48550/arXiv.2307.03718
Barrett, A. M., Hendrycks, D., Newman, J., & Nonnecke, B. (2023). Actionable Guidance for High-Consequence AI Risk Management: Towards Standards Addressing AI Catastrophic Risks (arXiv:2206.08966). arXiv. https://doi.org/10.48550/arXiv.2206.08966
Belrose, N., Schneider-Joseph, D., Ravfogel, S., Cotterell, R., Raff, E., & Biderman, S. (2023). LEACE: Perfect linear concept erasure in closed form (arXiv:2306.03819). arXiv. https://doi.org/10.48550/arXiv.2306.03819
Bluemke, E., Collins, T., Garfinkel, B., & Trask, A. (2023). Exploring the Relevance of Data Privacy-Enhancing Technologies for AI Governance Use Cases (arXiv:2303.08956). arXiv. https://doi.org/10.48550/arXiv.2303.08956
Cichonski, P., Millar, T., Grance, T., & Scarfone, K. (2012). Computer Security Incident Handling Guide (NIST Special Publication (SP) 800-61 Rev. 2). National Institute of Standards and Technology. https://doi.org/10.6028/NIST.SP.800-61r2
Heaven, W. D. (2022, November 18). Why Meta’s latest large language model survived only three days online. MIT Technology Review. https://www.technologyreview.com/2022/11/18/1063487/meta-large-language-model-ai-only-survived-three-days-gpt-3-science/
Koessler, L., & Schuett, J. (2023). Risk assessment at AGI companies: A review of popular risk assessment techniques from other safety-critical industries (arXiv:2307.08823). arXiv. https://doi.org/10.48550/arXiv.2307.08823
Labenz, N. (2023, July 25). “I have your child” “he is currently safe” “My demand is ransom of $1 million” “any attempt to involve the authorities or deviate from my instructions will put your child’s life in immediate danger” “Await further instructions” “Goodbye” WTF @BelvaInc? An important 🧵👇 https://t.co/f7gro7M6Cx [Tweet]. Twitter. https://twitter.com/labenz/status/1683947449323229186
Li, K., Patel, O., Viégas, F., Pfister, H., & Wattenberg, M. (2023). Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (arXiv:2306.03341). arXiv. https://doi.org/10.48550/arXiv.2306.03341
Mafael, A., Raithel, S., & Hock, S. J. (2022). Managing customer satisfaction after a product recall: The joint role of remedy, brand equity, and severity. Journal of the Academy of Marketing Science, 50(1), 174–194. https://doi.org/10.1007/s11747-021-00802-1
Palmer, B. (2023, May 18). Elevator plunges are rare because brakes and cables provide fail-safe protections. Washington Post. https://www.washingtonpost.com/national/health-science/elevator-plunges-are-rare-because-brakes-and-cables-provide-fail-safe-protections/2013/06/07/e44227f6-cc5a-11e2-8845-d970ccb04497_story.html
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools (arXiv:2302.04761). arXiv. https://doi.org/10.48550/arXiv.2302.04761
Schuett, J., Dreksler, N., Anderljung, M., McCaffary, D., Heim, L., Bluemke, E., & Garﬁnkel, B. (2023). Towards best practices in AGI safety and governance: A survey of expert opinion. https://doi.org/10.48550/arXiv.2305.07153
Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., Kokotajlo, D., Marchal, N., Anderljung, M., Kolt, N., Ho, L., Siddarth, D., Avin, S., Hawkins, W., Kim, B., Gabriel, I., Bolina, V., Clark, J., Bengio, Y., … Dafoe, A. (2023). Model evaluation for extreme risks (arXiv:2305.15324). arXiv. https://doi.org/10.48550/arXiv.2305.15324
Solaiman, I. (2023). The Gradient of Generative AI Release: Methods and Considerations (arXiv:2302.04844). arXiv. https://doi.org/10.48550/arXiv.2302.04844
Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., McCain, M., Newhouse, A., Blazakis, J., McGuffie, K., & Wang, J. (2019). Release Strategies and the Social Impacts of Language Models (arXiv:1908.09203). arXiv. https://doi.org/10.48550/arXiv.1908.09203
Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S. L., Daumé III, H., Dodge, J., Evans, E., Hooker, S., Jernite, Y., Luccioni, A. S., Lusoli, A., Mitchell, M., Newman, J., Png, M.-T., Strait, A., & Vassilev, A. (2023). Evaluating the Social Impact of Generative AI Systems in Systems and Society (arXiv:2306.05949). arXiv. https://doi.org/10.48550/arXiv.2306.05949
The White House. (2023, July 21). FACT SHEET: Biden-Harris Administration Secures Voluntary Commitments from Leading Artificial Intelligence Companies to Manage the Risks Posed by AI. The White House. https://perma.cc/5CG6-ZFCR
Trager, R., Harack, B., Reuel, A., Carnegie, A., Heim, L., Ho, L., Kreps, S., Lall, R., Larter, O., hÉigeartaigh, S. Ó., Staffell, S., & Villalobos, J. J. (2023). International Governance of Civilian AI: A Jurisdictional Certification Approach (arXiv:2308.15514). arXiv. https://doi.org/10.48550/arXiv.2308.15514
USNRC HRTD. (2020). Reactor Protection System – Reactor Trip Signals. In Westinghouse Technology Systems Manual (Rev 042020). Retrieved September 20, 2023, from https://www.nrc.gov/docs/ML2116/ML21166A218.pdf
Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., & Ba, J. (2023). Large Language Models Are Human-Level Prompt Engineers (arXiv:2211.01910). arXiv. https://doi.org/10.48550/arXiv.2211.01910
 “Catastrophic risk” from AI models can be defined in several ways; Barrett et al. (2023) (p.22-23) includes the term in a tentative impact assessment scale for AI model development or deployment: “A severe or catastrophic adverse effect means that, for example, the threat event might: (i) cause a severe degradation in or loss of mission capability to an extent and duration that the organization is not able to perform one or more of its primary functions; (ii) result in major damage to organizational assets; (iii) result in major financial loss; or (iv) result in severe or catastrophic harm to individuals involving loss of life or serious life-threatening injuries.”; according to Koessler & Schuett (2023), “By the term ‘catastrophic risk’ we loosely mean the risk of widespread and significant harm, such as several million fatalities or severe disruption to the social and political global order [...] This includes ‘existential risks’, i.e. the risk of human extinction or permanent civilizational collapse.” For this paper, we follow the latter definition by Koessler and Schuett.
 Definition drawn from Anderljung et al. (2023): “highly capable foundation models for which there is good reason to believe could possess dangerous capabilities sufficient to pose severe risks to public safety [...] Any binding regulation of frontier AI, however, would require a much more precise definition.”
 Several pieces review methods that allow for improving the performance of deployed models: See Anderljung et al. (2023) (p.12) for an overview; Villalobos & Atkinson (2023) also reviews methods for improving an existing model’s capabilities (at the cost of increasing inference compute use). Importantly, such discoveries can happen long after a model is initially deployed–meaning that systems warranting deployment corrections may be integrated into many downstream systems. Developers should therefore be careful to manage expectations, liability, and risk for downstream systems, especially in safety-critical use cases. See further discussion on this point in Sec. 3.1: Preparation, and Sec. 3.4: Recovery & follow-up.
 As a general note, relevant best practices have already been developed over years by organizations working in incident response and cybersecurity, such as the National Institute of Standards and Technology (NIST); we have noted throughout the document where specific guidance documents may be of use, and recommend that frontier AI developers and policymakers should draw on those resources when determining approaches to incident response for frontier AI.
 Security Operations Centers (SOCs) are dedicated security teams, typically running 24/7, tasked with a number of functions related to the security of an organization and its assets.
 Notably, the NIST AI RMF Playbook describes certain recommendations for AI developers who intend to “supersede, disengage, or deactivate AI systems that demonstrate performance or outcomes inconsistent with intended use” (NIST AIRC Team, n.d.). We believe the NIST playbook and associated resources will be useful.
 Such as spurring new pandemics or eroding society’s ability to tell fact from fiction (Piper, 2023; Horvitz, 2022). For a far more extensive overview of catastrophic AI risks, see Hendrycks et al. (2023).
 For additional concrete examples, one could look to some of the risks invoked in the recent White House AI lab commitments announcement: "Bio, chemical, and radiological risks, such as the ways in which systems can lower barriers to entry for weapons development, design, acquisition, or use; Cyber capabilities, such as the ways in which systems can aid vulnerability discovery, exploitation, or operational use, bearing in mind that such capabilities could also have useful defensive applications and might be appropriate to include in a system; [...] The capacity for models to make copies of themselves or ‘self-replicate’” (The White House, 2023).
 Furthermore, it is also not guaranteed that such tools will be robustly designed and reliably used.
 While we believe that developers and/or external watchdogs should monitor for such effects, identifying techniques for this is outside the scope of this piece. We recommend Solaiman et al. (2023), which begins to lay out an approach to evaluating generative AI models for social impact.
 It is worth placing this piece in the context of the recent Senate hearing on “Principles for AI Regulation,” in which Stuart Russell (UC Berkeley), Dario Amodei (Anthropic), and Senator Richard Blumenthal discussed the necessity of developing and enforcing mechanisms to recall dangerous AI models from the market (Oversight of A.I., 2023).
 For example: banning individual problem users; or in cases of embarrassing (but not catastrophic) model failures.
 For example: if the new model turns out to have reliability/security issues in critical infrastructure; the model has dangerous interactions with other autonomous agents or platforms; or if the model’s capability is augmented in a relevant domain.
 The dependency problem will worsen over time as models are (a) adopted by more users, and (b) adopted in more sensitive use cases. In such cases, AI developers may face stronger disincentives from customers, shareholders, and possibly from regulators to impose deployment corrections on their models.
 There is a question of how 'individuals or groups' are identified. Paid users are easier to identify and group, while second-order users (i.e., users of downstream applications) might be harder to identify, and require Know-Your-Customer and data-sharing policies.
 Restrictions within this category may be imposed with a range of parameters, such as time spans (per day, per hour, etc.), user limits (e.g., number of prompts per user per hour), etc.
 It is worth noting that RLHF does not in fact directly remove dangerous capabilities, but instead can be used to steer models away from dangerous outputs. To the extent that this and other techniques effectively remove a model capability for downstream users, it may be reasonable to group such techniques in this category.
 I.e., applications where the failure or removal of the model could result in significant harm (for example, self-driving cars).
 For decommissioning, developers might also turn to sources like the M3 Playbook Sec. 2.8: Develop a Decommission Plan or the CIO Decommissioning Template (though resources on specifically decommissioning AI models are scarce).
 For example, one can look at existing cases of model shutdown, such as Microsoft’s Tay (shut down due to toxicity), or Meta’s Galactica (shut down due to hallucinations). While these cases illustrate that shutdown is not an uncommon response to AI model malfunction, one concern we have is that companies may become less willing to pull their models when such models are more deeply integrated into a broad set of downstream applications (for reasons discussed below). For contrast, Tay and Galactica were both pulled in 16 hours and three days, respectively, and so had not accumulated significant downstream dependencies.
 As discussed in the NIST AI RMF 1.0, organizations should define “reasonable” risk tolerances in areas where established guidelines do not exist; such tolerances might inform where the bar for shutdown should be. However, work in this area is nascent, especially for frontier AI models.
 It’s possible that these tasks might not be housed in an SOC per se; for example, Trust & Safety teams may be positioned to tackle large parts of this process. Nevertheless, AI developers should be able to answer who within their company is responsible for these tasks and capable of handling them.
 Organizations with more mature cybersecurity practices may also engage in “threat hunting,” which typically involves a specialized team using threat intelligence and other resources to proactively search for signs of an intrusion.
 For example, Barrett et al. (2023) lists several high-priority measures relating to risk assessment under Section 2.3 “High Priority Risk Management Steps and Profile Guidance Sections,” such as “Identify whether a GPAIS could lead to significant, severe or catastrophic impacts” (guidance associated with Map 5.1 of the NIST AI RMF), or “Use red teams and adversarial testing as part of extensive interaction with GPAIS to identify dangerous capabilities, vulnerabilities or other emergent properties of such systems” (guidance associated with Measure 1.1 of the NIST AI RMF).
 However, it is worth noting that the fallbacks approach could, in some areas, be riskier and less advisable than limiting AI model involvement in the first place. For example, this may be true in the case of deciding whether to launch a nuclear weapon (Buck, Beyer, Markey, and Lieu Introduce Bipartisan Legislation to Prevent AI From Launching a Nuclear Weapon, 2023).
 For any given threat model, there may need to be multiple thresholds; for example, this threat model might also include thresholds around AI model capability in certain relevant domains (such as protein folding or virology).
 Additionally, AI developers should ensure that the response plan takes into account additional parties that may have access to the model, or otherwise have leverage over how the model is used–and potentially develop tools and protocols with these parties where appropriate. Relevant parties may include partnering tech companies that have access to model weights, and providers of computational resources used for model inference. The latter may have unique leverage over some aspects of monitoring and shutdown; for more on this, see Appendix I.
 ISO/IEC 27035 and NIST Special Publication 800-61 Revision 2 provide additional guidance on incident response, and emphasize the importance of planning and training, among other supporting factors.
 Trust & Safety (T&S) teams typically work to maintain safe user experiences, often by addressing issues including privacy, bias, misuse, and harmful or illegal content, among other issues.
 A helpful resource here may be found in Schuett (2022)–The author suggests a framework that AI developers can use to assign risk management roles and responsibilities, focusing on assigning responsibilities across product teams, risk & compliance teams, internal and external assurance parties, and at the board level.
 Examples from other industries where system failure could rapidly lead to catastrophic results include Reactor Protection Systems for nuclear power plants, which involve an intricate network of sensors and protocols designed to monitor for abnormal reactor signals and automatically trigger safe shutdown procedures as quickly as possible (USNRC HRTD, 2020), and failsafe systems for elevators, which trigger automatically in the case of loss of power (Palmer, 2023).
 Such processes should address: how and when information is escalated to C-suite actors (such as from a Security Operations Center to the Chief Information Security Officer [CISO]); what thresholds should be met for manual deployment corrections to be initiated; and chains of command in the case that top-level decision makers are unavailable to fulfill their duties.
 Other cases of shared authority may be relevant here as well, such as the COVID-19 vaccine trials. AstraZeneca, Johnson & Johnson, and Eli Lilly paused trials–all due to ‘adverse events’ (cases where a participant got sick, and it may or may not have been vaccine/drug related). The process for these decisions may be informative: in the case of an adverse event, the study’s investigator must report it to the sponsoring company, which must report to FDA, and to independent advisors (data and safety monitoring boards). If the board or the company judges the event concerning, the trial is put on pause. The safety board then conducts an investigation, and then makes a recommendation (e.g., restart, stay stopped, or start slowly with more testing). This recommendation is reviewed by regulators, who can accept it or ask for more info. This process can be cumbersome–for example, AstraZeneca needed approval from regulators in Brazil, India, Japan, South Africa, and the UK to continue one of its trials (Carl Zimmer, 2020). (While we understand that this specific case was controversial, we use it here primarily for illustration–we imagine there may be cases with AI where the costs of recalling/restricting/pausing are far lower, and the benefits far higher).
 (Administrative Safeguards, 45 CFR § 164.308(a)(6)(i-Ii), 2013): “A covered entity or business associate must [...] Implement policies and procedures to address security incidents [and] Identify and respond to suspected or known security incidents; mitigate, to the extent practicable, harmful effects of security incidents that are known to the covered entity or business associate; and document security incidents and their outcomes.”
 (Standards for Safeguarding Customer Information, 16 CFR 314.3, 2002): “You shall develop, implement, and maintain a comprehensive information security program [...] The information security program shall include the elements set forth in § 314.4” [...]
(Elements, 16 CFR 314.4(h), 2021): “Establish a written incident response plan designed to promptly respond to, and recover from, any security event materially affecting the confidentiality, integrity, or availability of customer information in your control.”
 (Emergency Planning and Preparedness for Production and Utilization Facilities, Appendix E to Part 50, Title 10, 2021): “Each applicant for an operating license is required by § 50.34(b) to include in the final safety analysis report plans for coping with emergencies.”
 (Site Security Plans, 6 CFR 27.225-245, 2021): “Covered facilities must submit a Site Security Plan to the Department [...] The Department will review, and either approve or disapprove, all Site Security Plans.”
 As long as such sharing satisfies considerations regarding information security, protection of intellectual property, and does not violate antitrust law.
 For example, relevant partners may involve agencies that can respond to cybersecurity incidents (e.g., CISA), biological incidents (e.g., CDC), and disinformation/propaganda incidents (e.g., DHS).
 Such as code outputs that resemble malware, or viral genome sequences.
 For some precedent, OpenAI’s API data usage policies explain that abuse and misuse monitoring may involve both automated flagging and human evaluation (API Data Usage Policies, 2023). We also believe the content classifier development process as described in the GPT-4 Technical Report (OpenAI, 2023, p. 66) could be extended to encompass new forms of dangerous content as model capabilities increase.
 Developers could incentivize users to report anomalous or concerning behavior via a reporting mechanism on their API portal.
 This is itself a broad category which includes mainstream media, Twitter, hacker forums, etc.
 E.g., rather than imposing an automatic response, sometimes it's important to let the attacker not realize that you've caught them, so that you can figure out who they are and what they want, and study them to figure out how to stop them from getting back in.
 In exchange for sharing their own observations about threat actors, ISAC members gain access to information from the wider ecosystem; a similar mechanism would likely apply to threat intelligence sharing even between competing frontier AI developers. For more details on ISACs, see here; or for a concrete example, see FS-ISAC, the ISAC for global financial services.
 However, one dissimilarity between CVD and vulnerability-sharing processes for frontier AI developers is that software developers mainly use CVD to inform downstream users of vulnerabilities and mitigations to maintain trust in their products and avoid liability, while frontier AI developers may need to discuss mitigations as competitors (e.g., for classes of possible attacks like prompt injection attacks). Ensuring effective cooperation between competing frontier AI developers may require external incentives, e.g., via regulation, which could be a topic for further research.
 While catastrophic risks will of course be critical in severity, it can be assumed that security centers at AI companies will be tracking non-catastrophic risks as well.
 Such scenarios could include high-profile incidents that warrant industry-wide changes or swift regulatory intervention, or incidents that reveal particularly concerning information about the behavior or use of frontier AI models. In the case that certain discovered dangerous capabilities are likely to also be present in most models above a certain size, or of a certain design, that discovery may be relevant across the frontier AI industry; in these cases, coordination will be necessary to ensure that other developers do not create similar conditions that led to the initial incident.
 Insofar as the AI model to be rolled back or shutdown is defined as a “consumer product,” AI developers could look to guidelines for recall notices such as (in the US) 16 CFR Part 1115 Subpart C. This section of the federal code provides some notes that may be useful, such as forms of recall notice, and recommended content for notices.
 Multilevel service-level agreements may allow for companies to break down customer bases with more granularity and stipulate different service agreements based on the customer (Adobe Communications Team, 2022). For example, individual API subscribers might receive future credits as compensation for downed service time, while commercial users might receive monetary compensation for business losses attributable to the deployment correction. Such agreements might also stipulate different requirements per customer type, such as the percentage of minimum uptime.
 Some of these challenges are shared to some extent by some other industries, such as biosecurity (complexity and high uncertainty, albeit not as much) and cybersecurity (data overload, false positives, and, APTs), but we have noted them here because all are somewhat atypical and unusually challenging.
 However, threat assessment may be able to learn from threat models in other relevant areas, such as disinformation studies, cybersecurity, and biosecurity.
 Section 2.3 “High Priority Risk Management Steps and Profile Guidance Sections” of Barrett et al. (2023) lists one high-priority measure as “Identify whether a GPAIS could lead to significant, severe or catastrophic impacts.” This guidance is associated with Map 5.1 of the NIST AI RMF.
 For example, Basra & Kaushik (2020), a CLTC report that draws on interviews with 10 senior cybersecurity professionals, says: “...the challenge of performing ongoing analysis from all sources and correlation is a major cause of SOC burnout. These security events generate a large amount of data, and our interviewees highlighted the urgent need to implement automation.”
 This challenge is twofold: both (a) setting appropriate parameters for monitoring and distilling data, and (b) setting appropriate delineations of responsibility between human and computer intelligence analysis. For some exploration of (b), see Knack et al. (2022).
 It is worth noting that the risks of setting the bar too high may also be catastrophic, via causing AI developers to fail to recognize or intervene on actually-catastrophic risks. There is a balance to be struck here.
 As part of the evaluation suite for GPT-4 and Claude, ARC Evals tested this capability (and found that these models did not appear to have the ability to self-replicate, though were capable of completing many relevant sub-tasks) (ARC Evals, 2023).
 This may seem far-fetched, but it is worth noting that one of the first computer worms–the Morris Worm, developed in 1988–was created by a graduate student who allegedly intended mainly to develop a proof-of-concept rather than deliberately cause a major cyber incident (Morris Worm, n.d.). Developers today could cause similar “cyber accidents” unintentionally while experimenting with frontier AI models.
 Still, some substitutions may not be possible without downstream application failure; for example, reducing context window size would inevitably invalidate prompts above a certain number of tokens.
 While contracts may be used to address liability, it is worth noting that they may not fully address actual downstream harm: even in cases where an AI developer designs use contracts to soften downstream product failure (e.g., requiring downstream applications develop backup systems in the case that deployment corrections are applied), downstream servicers may fail to follow best practices, or may develop insufficient backup systems.
 Such channels might be useful for achieving a number of incident-response goals, such as: identifying the model that’s causing the incident and communicating that information, sharing know-how on incident response, and allowing relevant parties to quickly coordinate a response. Channels might include hotlines to enable frontier AI developers to make immediate contact with regulatory agencies, and/or secure information-sharing platforms.
 Such as fiscal incentives for companies investing in deployment correction processes; liability and enforcement for non-compliance (in the case that a company fails to sufficiently and promptly pull a dangerous model); companies may also be able to develop useful mechanisms absent government intervention, such as pooling large loss exposure via a protection & indemnity club (such as is used in the maritime industry), which could cover some of a company’s losses in the case that they are required to pull a model (though rules for membership and payout would need to be set to prevent free riders).
 By inference, we mean individual input-output prompts from a trained model. By compute, we mean computational resources available for (in this case) hosting a trained model.
 This section draws heavily on unpublished work from Lennart Heim, a research fellow at the Centre for the Governance of AI.
 This is because (a) large scale deployment by definition requires significant compute resources, (b) large models have high memory requirements, so there is a benefit to distributing such models across many GPUs (which are mostly owned by data centers), and (c) could/data center compute typically provides the cheapest $/FLOP ratio (outside of self-hosting models in a large data center).
 Questions such as whether the model is a derivative of another, who the original model creator is, whether the model has been stolen, and who is liable, are of importance.