Annex 22 Draft: Regulatory Guidance on AI Use in GMP

GMP Insiders Expert Team
August 11, 2025

Scope and Regulatory Boundaries of Annex 22

Annex 22 applies to computerised systems used in the manufacturing of medicinal products and active substances when these systems incorporate Artificial Intelligence models in critical applications.

These are applications that have a direct impact on patient safety, product quality, or data integrity. The annex provides specific guidance on AI functionality within the broader framework of Annex 11.

The scope is deliberately limited to static AI models. These are models that do not continue to learn or adapt after deployment. Their performance remains fixed, and they generate deterministic outputs, meaning they always produce the same result when provided with the same input. This predictability is essential in a GMP environment where reproducibility and control are fundamental regulatory requirements.

Annex 22 explicitly excludes certain types of AI models from critical GMP applications:

Dynamic or adaptive models that modify their behavior during operation based on new data are not permitted. These models lack the fixed behavior necessary to ensure traceability and consistent validation.
Probabilistic models that may produce different outputs for the same input are also excluded. This introduces variability that cannot be reconciled with GMP expectations for reliability and control.
Generative AI and Large Language Models (LLMs) are not acceptable in critical applications. These tools, by design, produce novel outputs and operate with probabilistic reasoning, making them unsuitable for environments that demand validated and repeatable performance.

However, these excluded AI types may be used in non-critical GMP applications, provided their output does not influence decisions affecting safety, quality, or data integrity. In such cases, there must be a qualified human responsible for evaluating the output, following the “human-in-the-loop” principle.

Even in non-critical scenarios, companies are encouraged to apply elements of the Annex 22 guidance to manage risk appropriately.

From a regulatory perspective, this clear boundary-setting reflects a risk-based philosophy. AI can only be used in critical processes when it is fully understood, validated, and controlled in a manner consistent with existing GMP principles. Annex 22 makes it clear that AI must not introduce uncertainty or reduce accountability in pharmaceutical manufacturing.

Aspect	Annex 11 (Traditional Systems)	Annex 22 (AI Models)
Validation	Based on fixed logic	Based on trained data patterns
Output	Deterministic or logic-based	Must be deterministic
Change Control	Code/document tracked	Model behavior + training data traced
Explainability	Not explicitly required	Mandatory for critical apps
Confidence Scores	Not applicable	Required where relevant

Foundational Principles for AI in GMP Systems

The foundational principles in Annex 22 reflect core GMP expectations: clarity of roles, documented oversight, and risk-based control. When Artificial Intelligence is used in regulated environments, these principles must guide its implementation across all stages of the model lifecycle.

Key regulatory expectations include:

Multidisciplinary involvement: SMEs, QA, IT, and data science personnel must collaborate on model design, training, testing, and deployment. Each must have defined roles and appropriate qualifications.
Documentation and traceability: All activities related to the AI model, including training, validation, and testing, must be documented, regardless of whether performed internally or outsourced. Records should be reviewed by the regulated user and maintained in line with GMP documentation practices.
Access control and responsibility assignment: Role-based access must be defined and enforced. Access levels should align with the individual’s responsibilities and must support segregation of duties.
Application of Quality Risk Management (QRM): The level of oversight and control should correspond to the potential impact on patient safety, product quality, and data integrity. Decisions must be supported by documented risk assessments consistent with ICH Q9(R1).

These principles emphasize that AI systems are not exempt from GMP controls. On the contrary, their complexity requires enhanced diligence in governance and validation to ensure compliance throughout their lifecycle.

Intended Use of Annex 22 – Defining and Documenting

A central requirement of Annex 22 is the formal definition and documentation of an AI model’s intended use. This ensures the model is applied within a well-understood, justified scope and is not introduced into GMP-critical operations without regulatory control.

Describing the Intended Use

The model’s intended function must be clearly stated and supported by documented process knowledge. This includes:

The specific task to be automated or supported by the AI model
The process context in which it will operate
A detailed description of the input data, including all common and rare variations, potential limitations, and bias risks

This documentation must be reviewed and approved by the process subject matter expert before acceptance testing begins.

Subgroup Identification and Justification

When applicable, the input data space should be divided into relevant subgroups. This supports more accurate performance assessment and validation. Subgroups may be based on:

Output decisions (e.g. accept or reject)
Site or equipment-specific process variations
Product or material characteristics
Task-specific classifications, such as defect types or severity levels

Human-in-the-Loop (HITL) Configurations

If the AI model serves as a decision-support tool and a human operator is responsible for the final decision, this interaction must be documented. The operator’s responsibilities must be:

Clearly defined in procedural documents
Supported by adequate training
Monitored for consistency and performance, like any manual GMP operation

This structure ensures the AI model does not bypass human accountability and remains under full regulatory oversight.

Responsibility	GMP Requirement
Role definition	Documented in SOPs
Training	Verified and aligned with model function
Oversight	Output must be reviewed when confidence is low
Records	Human decisions must be traceable and retained
Accountability	Final responsibility must remain with the human operator

Establishing Acceptance Criteria

The Draft Annex 22 places significant emphasis on the performance evaluation of AI models through predefined, documented acceptance criteria. These criteria must be defined before any testing begins and must reflect both the intended use and the risk to product quality, data integrity, and patient safety.

Defining Performance Metrics

The first step in establishing acceptance criteria is selecting appropriate test metrics. These must be tailored to the specific task the model is intended to perform. For classification models, common metrics may include:

Sensitivity and specificity
Accuracy
Precision and recall
F1 score
Confusion matrix parameters

The chosen metrics must be relevant, measurable, and capable of demonstrating that the model performs reliably across all defined input variations and subgroups.

Setting Acceptance Thresholds

Once metrics are selected, acceptance thresholds must be established. These thresholds define what level of performance is considered acceptable for the model to be used in its intended GMP application. The process subject matter expert is responsible for defining and approving these thresholds prior to testing.

Where input subgroups have been identified, separate acceptance criteria may be applied to ensure the model performs adequately across all relevant process conditions.

No Decline in Performance

Critically, the performance of the AI model must be equal to or better than the process it replaces. This means the regulated company must first understand and quantify the existing process performance before introducing the model.

This expectation aligns with Annex 11, which requires companies to demonstrate that new systems do not introduce additional risk or reduce control.

Validation Element	GMP Expectation
Intended use definition	Detailed, SME-approved
Performance metrics	Relevant and risk-based
Acceptance thresholds	Justified and documented
Test dataset	Representative, stratified, labeled
Independence	Separated from training, with access control
Explainability	Input features influencing outputs must be reviewable
Confidence scoring	Logged and thresholded for use

Test Data Requirements

Annex 22 sets clear expectations regarding the selection, composition, and governance of test data used to evaluate AI models in GMP-regulated environments. These requirements are designed to ensure that test results are meaningful, reproducible, and representative of real-world process conditions.

Representative and Stratified Data

Test data must reflect the full range of input conditions defined in the model’s intended use. This includes not only typical inputs but also edge cases and process variability. The dataset should be stratified to include subgroups where relevant, such as:

Differences between manufacturing sites or equipment
Variations in raw materials or product characteristics
Categories of classification outcomes (e.g. accept/reject, defect type)
Task-specific operational ranges

This structure allows for more granular performance assessment and supports decisions regarding model suitability and reliability.

Adequate Volume and Statistical Confidence

The dataset must be large enough to support statistically valid performance evaluations. This applies to the full set as well as to each identified subgroup. Without sufficient data, it is not possible to draw reliable conclusions about the model’s behavior under different operating conditions.

Accurate and Verified Labelling

All test data must be accurately labelled, and the method of verification must ensure a high level of confidence in the reference results. Acceptable approaches include:

Independent verification by multiple trained experts
Use of validated analytical methods or equipment
Cross-checking against controlled historical datasets

The integrity of test results depends directly on the reliability of the data labels used during evaluation.

Controlled Pre-processing and Data Exclusion

Any data transformation prior to model input, such as normalization, scaling, or encoding, must be predefined and justified. Pre-processing should reflect the conditions under which the model will be used.

Similarly, if any data are excluded from the test set, the reason for exclusion must be documented. Arbitrary removal of data points is not acceptable and may be viewed as compromising the objectivity of the evaluation.

Restrictions on Generated or Synthetic Data

Annex 22 advises against the use of artificially generated data, such as those produced by generative AI, for testing purposes in GMP-critical applications. If such data are used, their inclusion must be justified, and the rationale clearly documented. In most cases, real-world data from controlled environments will provide a more reliable and auditable basis for validation.

Test Data Independence

Annex 22 introduces rigorous expectations for maintaining the independence of test data used in model evaluation. This requirement is fundamental to ensuring that performance results are unbiased and that the AI model is not inadvertently validated on data it has already been exposed to during training or development.

Data Separation and Access Control

To preserve test integrity, test data must be entirely independent of the data used during model training or validation. This can be achieved through two primary approaches:

Capturing test data only after model training and validation are complete
Splitting a dataset into separate training, validation, and test portions before any development activity begins

Regardless of the approach, access to test data must be tightly controlled. Personnel involved in model development must not have access to the test set. Where full segregation is not feasible, a four-eyes principle must be applied, ensuring that any work involving test data is performed jointly with a team member who has had no prior exposure to it.

Auditability and Traceability

Test data must be traceable and auditable. This includes:

Documenting which data have been used for testing
Recording when and how the data were accessed
Maintaining audit trails for any changes made to test datasets

Suppose physical samples are used as part of the test data. In that case, companies must ensure that these items were not previously used in model training or validation, unless it can be demonstrated that the features influencing the model are independent of prior use.

Controlling Staff Involvement

To minimize bias, Annex 22 emphasizes the need to separate personnel responsibilities. Individuals with access to test data should not participate in model development unless procedural safeguards are in place. If complete separation cannot be achieved, collaboration must be structured in a way that preserves independence and prevents result manipulation.

Explainability Requirements

One of the core regulatory expectations introduced in Annex 22 is that AI models used in GMP-critical applications must be explainable. This means that the basis for any prediction or decision must be transparent and scientifically interpretable, not just to data scientists, but also to quality assurance, auditors, and regulatory authorities.

This principle addresses a key challenge in AI deployment: the use of so-called “black box” models, where the logic behind outcomes is not easily accessible. Annex 22 moves to restrict this by requiring that traceable justifications support critical decisions made or influenced by AI.

Identification of Influential Features

AI systems must be capable of capturing and recording the specific input features that influenced a given outcome. Whether the model accepts, rejects, classifies, or flags a product, the rationale must be visible. This is particularly important for models used in areas like visual inspection, defect classification, or process control, where regulatory accountability is high.

Techniques to support feature identification may include:

Feature attribution tools, such as SHAP (Shapley Additive Explanations) or LIME (Local Interpretable Model-Agnostic Explanations)
Visual methods, such as heat maps or overlays, in cases where image-based models are used

The use of such tools must be appropriate to the application and integrated into the model’s evaluation and approval process.

Review and Justification of Feature Relevance

The features identified as influencing model outcomes must be reviewed to ensure they are scientifically and process-relevant. This review process should be performed by qualified subject matter experts and documented as part of model approval.

The aim is to confirm that the model is basing its decisions on valid process or product characteristics, and not on irrelevant or misleading patterns that could introduce regulatory risk. For example, a model classifying tablets based on surface quality should not rely on background lighting artifacts or unrelated visual cues.

Confidence Scores and Thresholds

In addition to accuracy and explainability, Annex 22 requires that AI models used in GMP-critical applications provide insight into the level of confidence associated with each prediction or classification. This requirement introduces a quantitative dimension to decision-making transparency and supports risk-based evaluation of model outputs.

Recording Confidence Scores

When an AI model classifies or predicts data, the system must, where applicable, log the confidence level for each outcome. This score reflects how particular the model is about a specific prediction and becomes essential for determining whether that output is suitable for use in a GMP context.

Confidence scores must be:

Captured as part of the model’s output
Retained in system logs alongside the decision
Available for review during validation, routine use, and audits

This enables quality assurance teams and decision-makers to trace not only what the model decided but also how reliable that decision was, based on the model’s internal assessment.

Setting Confidence Thresholds

A critical expectation in Annex 22 is the implementation of thresholds that define when a prediction is considered acceptable. If a model produces a confidence score below a defined threshold, the system should be configured to:

Flag the outcome as uncertain or undecided
Escalate the result for manual review by a qualified operator
Prevent the use of that result for critical decision-making

This mechanism prevents potentially unreliable outputs from being used in GMP-relevant operations and ensures that low-confidence results do not bypass human oversight.

The threshold level must be justified during validation and documented as part of the model’s acceptance criteria. It should be aligned with the model’s intended use and the criticality of the decisions it supports.

Operational Control and Lifecycle Management

Annex 22 emphasizes that the responsibilities associated with AI models do not end once validation is completed. Ongoing control, monitoring, and lifecycle management are required to maintain a validated state, ensure traceability, and preserve regulatory compliance throughout the system’s use.

Change Control and Configuration Management

Before deployment, the model and the system it is implemented within must be placed under change control. This applies not only to the model architecture but also to:

The software platform and infrastructure
The broader process that the model supports or automates
Any hardware or physical elements used as inputs (e.g. imaging systems or sensors)

Changes to any of these components must trigger an evaluation of potential impact on the model’s performance. Where relevant, partial or full revalidation may be required. Annex 22 expects documented justification for any decision not to re-test a model following changes that could affect its output.

In addition, the model must be protected under configuration control, with systems in place to detect unauthorized changes. This includes version control, access restriction, and audit trail functionality.

Performance Monitoring and Data Drift Detection

Ongoing performance monitoring is essential. The model’s output must be periodically reviewed using the same metrics defined during initial validation. This helps detect gradual degradation in performance or shifts caused by environmental factors such as lighting, process variations, or changes in upstream systems.

The input data itself must also be monitored to ensure it remains within the model’s validated sample space. If incoming data begins to diverge from the characteristics seen during validation, corrective action must be taken. This may involve retraining, revalidation, or restricting model use until alignment is re-established.

Metrics should be defined to monitor for:

Performance deviation over time
Shifts in input data patterns (data drift)
Unexpected classification distributions

Human Oversight and Record Keeping in HITL Applications

Where the model supports human-in-the-loop decision-making, the human operator must retain documented responsibility for the final decision. In such cases, Annex 22 requires that records be kept of:

The AI-generated recommendation or output
The operator’s review and final decision
Any deviation from the model’s suggestion

Depending on process criticality and the extent of initial testing, it may be necessary to review each model output consistently, especially during early deployment.

Final Thoughts

Annex 22 represents a significant and deliberate step forward in aligning Artificial Intelligence technologies with the core principles of GMP. It does not promote or accelerate the adoption of AI in pharmaceutical manufacturing. Instead, it defines the regulatory boundaries and responsibilities that must be in place if AI models are to be used in GMP-critical applications.

The annex is clear in its intent: AI must not compromise patient safety, product quality, or data integrity. It may support or enhance certain decisions or classifications, but it cannot do so without being fully validated, transparent, and subject to continuous oversight. In practice, this means:

Models must be static, deterministic, and fully characterized
Validation must be rigorous, with documented acceptance criteria and data governance
Performance must be monitored continuously after deployment
The role of the human, whether as approver, reviewer, or decision-maker, remains central

For GMP professionals, the release of Annex 22 signals the beginning of a new compliance area. While many pharmaceutical companies may still be exploring the feasibility of AI, those who do pursue implementation will now have a clear regulatory framework guiding the development, validation, and lifecycle management of these systems.

Artificial Intelligence has the potential to improve precision, consistency, and throughput in pharmaceutical operations. However, Annex 22 ensures that such advancements are pursued responsibly, under controlled conditions, with the same discipline and documentation expected of any other system impacting product quality or patient safety.

USP-NF Pharmacopeial Forum 51 - Draft proposals

USP–NF Proposed Drafts in PF 51(2): Key Highlights and Industry Impact

4 March 2025

USP–NF PF 51(6) draft updates announced for November 2025, including revisions to chapters ⟨1225⟩ Analytical Lifecycle, ⟨1771⟩ Ophthalmic Products, ⟨1664.1⟩ and ⟨1664.3⟩ Leachables for OINDP and Eye Drops, and ⟨825⟩ Radiopharmaceutical Standards.

USP–NF PF 51(6) Draft: What is New?

5 November 2025

USP–NF PF 51(5) Draft RELEASED - Featured Image

USP–NF PF 51(5) Draft: Updates in Microbiology, Distribution, and Dosage Form Guidance

2 September 2025

USP Releases New Draft Proposals in Pharmacopeial Forum 51(4) - Featured Image

USP–NF PF 51(4) Draft Published

2 July 2025

USP–NF PF 51(3) Draft Published

12 May 2025

GMP Insiders Expert Team

The GMP Insiders Expert Team is a group of seasoned professionals specializing in quality assurance, regulatory compliance, and operational innovation in pharmaceutical and biotech manufacturing. United by a passion for advancing Good Manufacturing Practices, they transform complex compliance challenges into clear, actionable strategies that drive operational excellence.

Author Articles

Subscribe to our Newsletter

Latest GMP Posts

Articles, Computerized Systems

December 4, 2025

GAMP 5 in CSV: Definition, Categories, and Pharma Guidelines

Understand GAMP 5 with simple guidance on lifecycle control, validation deliverables, software categories, and risk-based decision-making in GMP environments.

Articles, QC

December 1, 2025

Primary vs Secondary Reference Standards in GMP Labs

Primary vs secondary reference standards explained: definitions, qualification, traceability, uncertainty, and regulatory expectations in GMP labs.

Articles, Microbiology

November 18, 2025

ISO Class 5 Cleanroom Requirements According to ISO 14644-1

Overview of ISO 5 cleanroom standards, airflow control, monitoring requirements, and qualification steps under ISO 14644-1

Articles, QA

November 13, 2025

Change Control in Pharmaceutical Industry

Learn how GMP change control in pharma manages modifications through risk assessment, approval, and documentation to maintain regulatory compliance.

Articles, QC

November 11, 2025

5 Key Differences Between OOS, OOT and OOE Results

Learn the key differences between OOS, OOT, and OOE results, how each directs investigation depth, CAPA actions, and regulatory response in GMP.

News, Regulatory Updates

November 5, 2025

USP–NF PF 51(6) Draft: What is New?

USP–NF PF 51(6) draft updates modernize analytical validation, ophthalmic standards, leachables guidance, and radiopharmaceutical practices.

GMP Insiders - Your trusted source for GMP compliance!

Subscribe

Annex 22 Draft: Regulatory Guidance on AI Use in GMP

Scope and Regulatory Boundaries of Annex 22

Foundational Principles for AI in GMP Systems