April 23, 2026

The Health

Your health, your choice

The Way Toward Responsible AI Innovation in Health Care

The Way Toward Responsible AI Innovation in Health Care

1. Components of a Postmarket Surveillance System

For approved devices already deployed in the market, the FDA already mandates that undesirable experiences (e.g., permanent injury, hospitalization, death) be reported by device manufacturers, device user facilities, and device importers. The agency further “encourages health care professionals, patients, caregivers and consumers to submit voluntary reports about serious adverse events that may be associated with a medical device, and use errors, product quality issues, and therapeutic failures.” A new postmarket surveillance framework enhancing the effectiveness of the FDA’s existing medical device reporting efforts should be comprised of the following four components, two of which (outcome monitoring and adverse event reporting) are already part of hospital safety surveillance protocols (Figure 3):

  1. Documenting adverse events (e.g., patient was prescribed the wrong medicine by an AI device)
  2. Monitoring outcome s(e.g., the rate of re-hospitalizations increased after AI technology was introduced)
  3. Identifying AI implementation issues (e.g., erroneous AI outputs occurring after an update to the device or the IT systems in which it operates)
  4. Detecting troublesome performance issues (e.g., model discrimination degradation) through periodic revalidations and/or performance monitoring of AI devices at risk for unpredictability

Instead of needlessly increasing industry costs by recommending all AI devices be subject to the same level of postmarket surveillance, a new framework should increase its prospects for adoption by concentrating new surveillance interventions only on those clinical use cases that portend the highest risk to patients and health care delivery organizations and those AI devices whose structural design and/or training data characteristics present a reasonable prospect for output unpredictability given that unpredictability may not be observed during premarket review. A risk-based process for resource allocation is critical here to avoid AI technologies that present low risk (or no risk) for patient harm competing for the same limited resources with those devices whose failures have far worse repercussions. Although, in theory, the postmarket surveillance discussed in this paper may be used for AI devices beyond its proposed scope, such use would operate outside the original priorities shaping the process.

The above table differentiates three distinct forms of AI postmarket surveillance: existing safety practices, periodic revalidation, and performance monitoring. Their respective recommendations are decided according to six AI medical device attributes.

  1. AI model/algorithm type. This field in the above matrix has three possibilities: deterministic, probabilistic adaptive, or probabilistic nonadaptive. An algorithm is a procedure by which a function (e.g., a categorization or prediction) is accomplished. An AI model is a set of algorithms after training data has parameterized them (e.g., determined their weights and biases). A deterministic model always delivers the same output for a specific input. Its underlying computations may be rules-based or employ another algorithm type (such as linear regression) where there are fixed relations between inputs and outputs. A probabilistic adaptive model, in contrast, may potentially produce different outputs for the same input because, as an adaptive system, its model alters over time. Such alterations, and their associated effect on outputs, is absent for a probabilistic nonadaptive model where the probabilities generated are not subject to any input randomization or structural stochasticity (e.g., randomized data sampling).
  2. Training dataset characterization. This field has two possibilities: closed and open. A closed dataset indicates that the training data has a finite number of elements and has parameterized the AI system prior to its deployment in the market. An open dataset, in contrast, is a training data collection that not only expands after deployment but can also change AI system performance after its original training.
  3. Synthetic data in training data. This field has two possibilities: yes or no. Synthetic data is derived from AI generation as opposed to real-world data collection. When used in training data, synthetic data increases certain risks, including the possibility of model collapse (i.e., discontinuation of desired functionality).
  4. Training data representative of patient cohorts. This field has three possibilities: yes, no, or not applicable. Training data, when derived from information produced from human beings (e.g., medical images, test results, etc.), can be representative or unrepresentative. “Yes” indicates the training data proportionally resembles the principal demographic characteristics of the patient populations served by the AI system.
  5. LLM input complexity or semantic ambiguity. This field has two possibilities: yes or no. “Yes” indicates that the AI system is or contains an LLM that can receive a verbal or textual prompt that is either complicated or semantically vague. A textual or verbal prompt to an AI system that is not an LLM, and where the prompt must match a predetermined value in order to initiate a function, would not have the potential for complexity or semantic ambiguity.
  6. Structural output unpredictability. This field has two possibilities: yes or no. “Yes” indicates that the AI system has a programming architecture (e.g., generative AI or LLM) whose structure may produce inconsistent outputs for the same inputs.

AI devices whose attributes correspond to one or more cells colored orange or red within Table 2 are the ones where the justification for postmarket surveillance is most compelling.

2. Adapting Existing Safety Practices

The way in which AI technologies are deployed, as well as the local population health context, can have a significant bearing on AI performance. AI technologies, when deployed by a health system, are governed by protocols guiding their use. These protocols may extend beyond direct technology interaction with a patient and include staff training as well as oversight and audits. Likewise, the protocols may operate alongside multiple competing protocols pertaining to the physician and other medical devices and, thus, be integrated within a larger workflow. Having access to, and the prospect of modifying, protocols is a prerequisite for realizing opportunities for AI-facilitated health care spending reductions. This also recognizes that any negative outcomes related to AI technologies may be the result of the AI technology itself or they may have been affected by the way the AI was implemented.

Health care organizations have general mechanisms for safety reporting and patient outcome monitoring. Given the anticipated ubiquity of AI technologies (they might soon be a part of most technology systems deployed by health care providers), the most sensible and efficient approach is to adapt these existing systems to capture safety events, adverse patient outcomes, and other malfunctions related to the deployment of AI into the workflow. Such adaptation, in the context of reliance on existing systems, would treat those AI technologies that are not expected to be of higher risk on par with other potential causes of adverse patient and health system experiences.

3. Periodic Revalidation

Periodic revalidation (contemplated in the FDA draft guidance as periodic re-evaluation) is the simpler of the two modes of proposed postmarket surveillance and is envisioned for adaptive AI with an open dataset. Being probabilistic, adaptive AI models make determinations based on likelihoods, and in the case of open training datasets, these likelihoods change (ideally improving) over time through use of real-world data. These changes manifest in the market without formal FDA review as opposed to traditional software, where a programming update that modifies a medical device’s effectiveness is typically obligated to file a new 510(k) submission to the FDA. According to the agency:

If a manufacturer modifies their device with the intent to significantly affect the safety or effectiveness of the device (for example, to significantly improve clinical outcomes, to mitigate a known risk, in response to adverse events, etc.), submission of a new 510(k) is likely required. A change intended to significantly affect the safety or effectiveness of the device is considered to be a change that “could significantly affect the safety or effectiveness of the device” and thus requires submission of a new 510(k) regardless of the considerations outlined below.

After adaptive AI has been deployed, periodic revalidation would repeat the testing submitted to the FDA in connection with the premarket review process. Because the device manufacturer would have already supplied the test data and the acceptance criteria for outputs corresponding to the test data, this low-effort revalidation does not require additional data collection expense nor the consulting labor and informatics expertise to determine what the proper outputs should be for new test data. If, however, the health system supplements (or replaces) the test data with its own, then this would not be the case.

Periodic revalidation would be performed at scheduled intervals by the manufacturer working in collaboration with the provider (a health system, academic medical center, etc.). If possible (given workplace constraints as well as the nature of the AI device), the first test could be conducted a month after deployment, followed by progressively longer intervals—the third month, the sixth month, the twelfth month, and annually thereafter. This schedule would identify data drift problems early in the case of very unstable adaptive models while safely moving toward lower frequency surveillance for models that demonstrate ongoing accuracy with respect to the testing. As such, this surveillance activity can inexpensively reduce the incidence of adverse outcomes due to data drift and the liabilities that attend such events.

The manufacturer, with the health system’s approval, would re-execute the testing on the deployed AI with the health system having full access to the results of the testing. The capture of data related to these revalidations is described in a later section, where it can be related to both modes of postmarket surveillance outlined in Table 2. If, for some reason, periodic revalidation requires additional health system data, then the periodic revalidation could employ software privacy measures (e.g. access permissions assigned at the user level) so that health systems do not see the AI code and the manufacturers do not see the patient data. This would preserve the confidentiality of patient data as well as the manufacturer’s intellectual property and acceptance criteria.

Given the various conditions that can spawn irregular outputs, there is need for a second mode of postmarket surveillance tailored for unpredictable AI systems that would not be adequately safeguarded by periodic revalidations. This second mode, performance monitoring, distinguishes itself from periodic revalidations by continuous monitoring of outputs generated from real-world inputs (as opposed to test data). Unlike a CLIA-like certification process based on the results obtained from a single point in the AI device’s history, performance monitoring focuses on unpredictability throughout an AI system’s product life cycle (as encouraged by the FDA). This life-cycle bias, along with the use of real-world data from at-risk AI systems, makes performance monitoring more practical to implement. Specifically, this performance monitoring approach avoids the need for:

  • new test regimes for every type of AI device in health care,
  • monitoring systems whose algorithms and datasets would not produce unpredictability, and
  • external informatics specialists to consult on test data as well as results analysis.

Performance monitoring mitigates the risk for AI deployment delays due to a lack of availability of certification specialists that would emerge if all health care AI devices were subject to certification.

At a very basic level, performance monitoring would extend AI surveillance beyond the Sentinel Initiative and the Safe Medical Devices Act’s existing requirement on manufacturers and device user facilities for reporting adverse events. As every possible valuable data point cannot be conceived of (let alone preemptively stipulated, given AI’s numerous clinical settings), performance monitoring would, at a minimum, track trends for two subsets of anonymized clinical outcomes: false positives and false negatives. A false positive (for most AI systems) would be an incorrect positive diagnosis, prediction, or classification of a condition or disease. Given the issues surrounding AI output unpredictability and the diversity of AI applications, the definition of false positive should be expanded to also include errors in prediction, decision-making, and recommendations. For an LLM, however, a false positive would coincide with the previously discussed categories of hallucinations:

  • Unintelligible language outputs
  • Plausible, but factually inaccurate, claims
  • Answers that are accurate but are misaligned with the intent of the end user’s questions
  • Citations of resources that do not exist

As evidenced above, there is not a direct correlate of false negatives for LLMs, while for many other types of AI the phrase would retain its canonical definition: an incorrect determination that a condition or disease is absent. Though, in the case of an LLM, the definition of false negative would still include inaccurate prescriptions or diagnoses.

4. Aggregated Outcome Data Registry

The full value of the proposed process will not be realized unless the outcomes collected at the local level (i.e., the hospital system deploying the AI) can be aggregated and fed back to health system users and device manufacturers. Moreover, creating a standardized data architecture that is common to all (or many) AI users, while desirable, would be labor-intensive. Instead, we propose to utilize AI agents that would sit on top of outcome data collected by local users, extract the relevant information in aggregated data form, and feed it into an aggregated outcome data registry (see Figure 4). As a fundamental first step, the agents would start with extracting and aggregating data from the existing safety reports. Then, they would be trained to extract and aggregate data from periodic revalidations and performance monitoring. Although relevant data could be manually transferred from a health system user to the registry, a more automated process (whether by Application Programming Interface or AI agent) is a preferable alternative.

link