GMP Insiders - Your trusted source for GMP compliance!

Annex 22 Draft: Regulatory Guidance on AI Use in GMP

Related topics

Annex 22 Draft: Regulatory Guidance - Featured Image

The integration of Artificial Intelligence (AI) into pharmaceutical manufacturing has prompted regulatory authorities to take a proactive stance in defining its acceptable use under GMP. As AI models become increasingly capable of supporting decision-making processes, the need for regulatory clarity has grown urgent, particularly in areas impacting patient safety, product quality, and data integrity.

To address this, the European Commission has introduced Annex 22 to Volume 4 of the EU GMP Guidelines, a new and standalone annex dedicated to the governance of AI models within GMP-regulated environments. 

Which AI models are allowed and which AI models are not allowed in the Draft Annex 22

Annex 22 represents a targeted response to a disruptive technology. Its focus is narrow yet critical: it applies to AI systems embedded in computerised systems used in manufacturing, but only when these systems are involved in critical operations that directly affect compliance or risk.

While Annex 11 already outlines expectations for computerised systems, Annex 22 expands upon it by introducing AI-specific requirements, particularly for machine learning (ML) models that are trained, rather than explicitly programmed, to classify or predict data. 

This article provides a regulatory interpretation of Annex 22, intended for GMP professionals who must evaluate, validate, and govern AI technologies under compliance frameworks. It dissects the draft’s scope, expectations, and risk management approach, and offers practical guidance for regulated companies seeking to navigate this evolving space.

Scope and Regulatory Boundaries of Annex 22

Annex 22 applies to computerised systems used in the manufacturing of medicinal products and active substances when these systems incorporate Artificial Intelligence models in critical applications. 

These are applications that have a direct impact on patient safety, product quality, or data integrity. The annex provides specific guidance on AI functionality within the broader framework of Annex 11.

The scope is deliberately limited to static AI models. These are models that do not continue to learn or adapt after deployment. Their performance remains fixed, and they generate deterministic outputs, meaning they always produce the same result when provided with the same input. This predictability is essential in a GMP environment where reproducibility and control are fundamental regulatory requirements.

Annex 22 explicitly excludes certain types of AI models from critical GMP applications:

  • Dynamic or adaptive models that modify their behavior during operation based on new data are not permitted. These models lack the fixed behavior necessary to ensure traceability and consistent validation.
  • Probabilistic models that may produce different outputs for the same input are also excluded. This introduces variability that cannot be reconciled with GMP expectations for reliability and control.
  • Generative AI and Large Language Models (LLMs) are not acceptable in critical applications. These tools, by design, produce novel outputs and operate with probabilistic reasoning, making them unsuitable for environments that demand validated and repeatable performance.

However, these excluded AI types may be used in non-critical GMP applications, provided their output does not influence decisions affecting safety, quality, or data integrity. In such cases, there must be a qualified human responsible for evaluating the output, following the “human-in-the-loop” principle. 

Even in non-critical scenarios, companies are encouraged to apply elements of the Annex 22 guidance to manage risk appropriately.

From a regulatory perspective, this clear boundary-setting reflects a risk-based philosophy. AI can only be used in critical processes when it is fully understood, validated, and controlled in a manner consistent with existing GMP principles. Annex 22 makes it clear that AI must not introduce uncertainty or reduce accountability in pharmaceutical manufacturing.

Aspect Annex 11 (Traditional Systems) Annex 22 (AI Models)
Validation Based on fixed logic Based on trained data patterns
Output Deterministic or logic-based Must be deterministic
Change Control Code/document tracked Model behavior + training data traced
Explainability Not explicitly required Mandatory for critical apps
Confidence Scores Not applicable Required where relevant

Foundational Principles for AI in GMP Systems

The foundational principles in Annex 22 reflect core GMP expectations: clarity of roles, documented oversight, and risk-based control. When Artificial Intelligence is used in regulated environments, these principles must guide its implementation across all stages of the model lifecycle.

Key regulatory expectations include:

  • Multidisciplinary involvement: SMEs, QA, IT, and data science personnel must collaborate on model design, training, testing, and deployment. Each must have defined roles and appropriate qualifications.
  • Documentation and traceability: All activities related to the AI model, including training, validation, and testing, must be documented, regardless of whether performed internally or outsourced. Records should be reviewed by the regulated user and maintained in line with GMP documentation practices.
  • Access control and responsibility assignment: Role-based access must be defined and enforced. Access levels should align with the individual’s responsibilities and must support segregation of duties.
  • Application of Quality Risk Management (QRM): The level of oversight and control should correspond to the potential impact on patient safety, product quality, and data integrity. Decisions must be supported by documented risk assessments consistent with ICH Q9(R1).

These principles emphasize that AI systems are not exempt from GMP controls. On the contrary, their complexity requires enhanced diligence in governance and validation to ensure compliance throughout their lifecycle.

Intended Use of Annex 22 – Defining and Documenting

A central requirement of Annex 22 is the formal definition and documentation of an AI model’s intended use. This ensures the model is applied within a well-understood, justified scope and is not introduced into GMP-critical operations without regulatory control.

Describing the Intended Use

The model’s intended function must be clearly stated and supported by documented process knowledge. This includes:

  • The specific task to be automated or supported by the AI model
  • The process context in which it will operate
  • A detailed description of the input data, including all common and rare variations, potential limitations, and bias risks

This documentation must be reviewed and approved by the process subject matter expert before acceptance testing begins.

Subgroup Identification and Justification

When applicable, the input data space should be divided into relevant subgroups. This supports more accurate performance assessment and validation. Subgroups may be based on:

  • Output decisions (e.g. accept or reject)
  • Site or equipment-specific process variations
  • Product or material characteristics
  • Task-specific classifications, such as defect types or severity levels

Human-in-the-Loop (HITL) Configurations

If the AI model serves as a decision-support tool and a human operator is responsible for the final decision, this interaction must be documented. The operator’s responsibilities must be:

  • Clearly defined in procedural documents
  • Supported by adequate training
  • Monitored for consistency and performance, like any manual GMP operation

This structure ensures the AI model does not bypass human accountability and remains under full regulatory oversight.

Responsibility GMP Requirement
Role definition Documented in SOPs
Training Verified and aligned with model function
Oversight Output must be reviewed when confidence is low
Records Human decisions must be traceable and retained
Accountability Final responsibility must remain with the human operator

Establishing Acceptance Criteria

The Draft Annex 22 places significant emphasis on the performance evaluation of AI models through predefined, documented acceptance criteria. These criteria must be defined before any testing begins and must reflect both the intended use and the risk to product quality, data integrity, and patient safety.

Defining Performance Metrics

The first step in establishing acceptance criteria is selecting appropriate test metrics. These must be tailored to the specific task the model is intended to perform. For classification models, common metrics may include:

  • Sensitivity and specificity
  • Accuracy
  • Precision and recall
  • F1 score
  • Confusion matrix parameters

The chosen metrics must be relevant, measurable, and capable of demonstrating that the model performs reliably across all defined input variations and subgroups.

Setting Acceptance Thresholds

Once metrics are selected, acceptance thresholds must be established. These thresholds define what level of performance is considered acceptable for the model to be used in its intended GMP application. The process subject matter expert is responsible for defining and approving these thresholds prior to testing.

Where input subgroups have been identified, separate acceptance criteria may be applied to ensure the model performs adequately across all relevant process conditions.

No Decline in Performance

Critically, the performance of the AI model must be equal to or better than the process it replaces. This means the regulated company must first understand and quantify the existing process performance before introducing the model. 

This expectation aligns with Annex 11, which requires companies to demonstrate that new systems do not introduce additional risk or reduce control.

Validation Element GMP Expectation
Intended use definition Detailed, SME-approved
Performance metrics Relevant and risk-based
Acceptance thresholds Justified and documented
Test dataset Representative, stratified, labeled
Independence Separated from training, with access control
Explainability Input features influencing outputs must be reviewable
Confidence scoring Logged and thresholded for use

Test Data Requirements

Annex 22 sets clear expectations regarding the selection, composition, and governance of test data used to evaluate AI models in GMP-regulated environments. These requirements are designed to ensure that test results are meaningful, reproducible, and representative of real-world process conditions.

Test Data Requirements for AI Models in Annex 22 Draft

Representative and Stratified Data

Test data must reflect the full range of input conditions defined in the model’s intended use. This includes not only typical inputs but also edge cases and process variability. The dataset should be stratified to include subgroups where relevant, such as:

  • Differences between manufacturing sites or equipment
  • Variations in raw materials or product characteristics
  • Categories of classification outcomes (e.g. accept/reject, defect type)
  • Task-specific operational ranges

This structure allows for more granular performance assessment and supports decisions regarding model suitability and reliability.

Adequate Volume and Statistical Confidence

The dataset must be large enough to support statistically valid performance evaluations. This applies to the full set as well as to each identified subgroup. Without sufficient data, it is not possible to draw reliable conclusions about the model’s behavior under different operating conditions.

Accurate and Verified Labelling

All test data must be accurately labelled, and the method of verification must ensure a high level of confidence in the reference results. Acceptable approaches include:

  • Independent verification by multiple trained experts
  • Use of validated analytical methods or equipment
  • Cross-checking against controlled historical datasets

The integrity of test results depends directly on the reliability of the data labels used during evaluation.

Controlled Pre-processing and Data Exclusion

Any data transformation prior to model input, such as normalization, scaling, or encoding, must be predefined and justified. Pre-processing should reflect the conditions under which the model will be used. 

Similarly, if any data are excluded from the test set, the reason for exclusion must be documented. Arbitrary removal of data points is not acceptable and may be viewed as compromising the objectivity of the evaluation.

Restrictions on Generated or Synthetic Data

Annex 22 advises against the use of artificially generated data, such as those produced by generative AI, for testing purposes in GMP-critical applications. If such data are used, their inclusion must be justified, and the rationale clearly documented. In most cases, real-world data from controlled environments will provide a more reliable and auditable basis for validation.

Test Data Independence

Annex 22 introduces rigorous expectations for maintaining the independence of test data used in model evaluation. This requirement is fundamental to ensuring that performance results are unbiased and that the AI model is not inadvertently validated on data it has already been exposed to during training or development.

Test Data Independance in Annex 22 Draft

Data Separation and Access Control

To preserve test integrity, test data must be entirely independent of the data used during model training or validation. This can be achieved through two primary approaches:

  • Capturing test data only after model training and validation are complete
  • Splitting a dataset into separate training, validation, and test portions before any development activity begins

Regardless of the approach, access to test data must be tightly controlled. Personnel involved in model development must not have access to the test set. Where full segregation is not feasible, a four-eyes principle must be applied, ensuring that any work involving test data is performed jointly with a team member who has had no prior exposure to it.

Auditability and Traceability

Test data must be traceable and auditable. This includes:

  • Documenting which data have been used for testing
  • Recording when and how the data were accessed
  • Maintaining audit trails for any changes made to test datasets

Suppose physical samples are used as part of the test data. In that case, companies must ensure that these items were not previously used in model training or validation, unless it can be demonstrated that the features influencing the model are independent of prior use.

Controlling Staff Involvement

To minimize bias, Annex 22 emphasizes the need to separate personnel responsibilities. Individuals with access to test data should not participate in model development unless procedural safeguards are in place. If complete separation cannot be achieved, collaboration must be structured in a way that preserves independence and prevents result manipulation.

Explainability Requirements

One of the core regulatory expectations introduced in Annex 22 is that AI models used in GMP-critical applications must be explainable. This means that the basis for any prediction or decision must be transparent and scientifically interpretable, not just to data scientists, but also to quality assurance, auditors, and regulatory authorities.

This principle addresses a key challenge in AI deployment: the use of so-called “black box” models, where the logic behind outcomes is not easily accessible. Annex 22 moves to restrict this by requiring that traceable justifications support critical decisions made or influenced by AI.

Identification of Influential Features

AI systems must be capable of capturing and recording the specific input features that influenced a given outcome. Whether the model accepts, rejects, classifies, or flags a product, the rationale must be visible. This is particularly important for models used in areas like visual inspection, defect classification, or process control, where regulatory accountability is high.

Techniques to support feature identification may include:

  • Feature attribution tools, such as SHAP (Shapley Additive Explanations) or LIME (Local Interpretable Model-Agnostic Explanations)
  • Visual methods, such as heat maps or overlays, in cases where image-based models are used

The use of such tools must be appropriate to the application and integrated into the model’s evaluation and approval process.

Review and Justification of Feature Relevance

The features identified as influencing model outcomes must be reviewed to ensure they are scientifically and process-relevant. This review process should be performed by qualified subject matter experts and documented as part of model approval.

The aim is to confirm that the model is basing its decisions on valid process or product characteristics, and not on irrelevant or misleading patterns that could introduce regulatory risk. For example, a model classifying tablets based on surface quality should not rely on background lighting artifacts or unrelated visual cues.

Confidence Scores and Thresholds

In addition to accuracy and explainability, Annex 22 requires that AI models used in GMP-critical applications provide insight into the level of confidence associated with each prediction or classification. This requirement introduces a quantitative dimension to decision-making transparency and supports risk-based evaluation of model outputs.

Recording Confidence Scores

When an AI model classifies or predicts data, the system must, where applicable, log the confidence level for each outcome. This score reflects how particular the model is about a specific prediction and becomes essential for determining whether that output is suitable for use in a GMP context.

Confidence scores must be:

  • Captured as part of the model’s output
  • Retained in system logs alongside the decision
  • Available for review during validation, routine use, and audits

This enables quality assurance teams and decision-makers to trace not only what the model decided but also how reliable that decision was, based on the model’s internal assessment.

Setting Confidence Thresholds

A critical expectation in Annex 22 is the implementation of thresholds that define when a prediction is considered acceptable. If a model produces a confidence score below a defined threshold, the system should be configured to:

  • Flag the outcome as uncertain or undecided
  • Escalate the result for manual review by a qualified operator
  • Prevent the use of that result for critical decision-making

This mechanism prevents potentially unreliable outputs from being used in GMP-relevant operations and ensures that low-confidence results do not bypass human oversight.

The threshold level must be justified during validation and documented as part of the model’s acceptance criteria. It should be aligned with the model’s intended use and the criticality of the decisions it supports.

Operational Control and Lifecycle Management

Annex 22 emphasizes that the responsibilities associated with AI models do not end once validation is completed. Ongoing control, monitoring, and lifecycle management are required to maintain a validated state, ensure traceability, and preserve regulatory compliance throughout the system’s use.

Lifecycle of AI model in GMP draft Annex 22

Change Control and Configuration Management

Before deployment, the model and the system it is implemented within must be placed under change control. This applies not only to the model architecture but also to:

  • The software platform and infrastructure
  • The broader process that the model supports or automates
  • Any hardware or physical elements used as inputs (e.g. imaging systems or sensors)

Changes to any of these components must trigger an evaluation of potential impact on the model’s performance. Where relevant, partial or full revalidation may be required. Annex 22 expects documented justification for any decision not to re-test a model following changes that could affect its output.

In addition, the model must be protected under configuration control, with systems in place to detect unauthorized changes. This includes version control, access restriction, and audit trail functionality.

Performance Monitoring and Data Drift Detection

Ongoing performance monitoring is essential. The model’s output must be periodically reviewed using the same metrics defined during initial validation. This helps detect gradual degradation in performance or shifts caused by environmental factors such as lighting, process variations, or changes in upstream systems.

The input data itself must also be monitored to ensure it remains within the model’s validated sample space. If incoming data begins to diverge from the characteristics seen during validation, corrective action must be taken. This may involve retraining, revalidation, or restricting model use until alignment is re-established.

Metrics should be defined to monitor for:

  • Performance deviation over time
  • Shifts in input data patterns (data drift)
  • Unexpected classification distributions

Human Oversight and Record Keeping in HITL Applications

Where the model supports human-in-the-loop decision-making, the human operator must retain documented responsibility for the final decision. In such cases, Annex 22 requires that records be kept of:

  • The AI-generated recommendation or output
  • The operator’s review and final decision
  • Any deviation from the model’s suggestion

Depending on process criticality and the extent of initial testing, it may be necessary to review each model output consistently, especially during early deployment.

Final Thoughts

Annex 22 represents a significant and deliberate step forward in aligning Artificial Intelligence technologies with the core principles of GMP. It does not promote or accelerate the adoption of AI in pharmaceutical manufacturing. Instead, it defines the regulatory boundaries and responsibilities that must be in place if AI models are to be used in GMP-critical applications.

The annex is clear in its intent: AI must not compromise patient safety, product quality, or data integrity. It may support or enhance certain decisions or classifications, but it cannot do so without being fully validated, transparent, and subject to continuous oversight. In practice, this means:

  • Models must be static, deterministic, and fully characterized
  • Validation must be rigorous, with documented acceptance criteria and data governance
  • Performance must be monitored continuously after deployment
  • The role of the human, whether as approver, reviewer, or decision-maker, remains central

For GMP professionals, the release of Annex 22 signals the beginning of a new compliance area. While many pharmaceutical companies may still be exploring the feasibility of AI, those who do pursue implementation will now have a clear regulatory framework guiding the development, validation, and lifecycle management of these systems.

Artificial Intelligence has the potential to improve precision, consistency, and throughput in pharmaceutical operations. However, Annex 22 ensures that such advancements are pursued responsibly, under controlled conditions, with the same discipline and documentation expected of any other system impacting product quality or patient safety.

Subscribe to our Newsletter

Sign up to recieve latest news, GMP trends and insights from our industry experts

Latest GMP Posts

BECOME A GMP INSIDER

Stay in touch and be the first to get the latest GMP News