Headline

Using RHEL confidential virtual machines to protect AI workloads on Microsoft Azure

Artificial intelligence (AI) workloads are revolutionizing the industry, impacting healthcare, finance services, national security and autonomous systems. As part of this revolution, organizations are increasingly moving their AI workloads to the cloud, taking advantage of its scalability, flexibility and cost-effectiveness. Of course this transition to the cloud brings new challenges around data privacy, intellectual property and regulation compliance. Existing virtual machines (VMs) provide isolation between workloads, but they cannot protect workloads from privileged users and software com

2 months ago

Red Hat Blog

Open in Source

#mac #microsoft #linux #red_hat #git #intel #aws #amd #auth #ssh #ssl

Existing virtual machines (VMs) provide isolation between workloads, but they cannot protect workloads from privileged users and software components, including hypervisors, systems admins or people with access to the physical hardware. This is a problem for any workloads running or using sensitive data including AI workloads, which are often used to process financial transactions, patient records or any other secret sauce an organization has.

Confidential virtual machine (CVM) technology helps address this security gap. It uses hardware-based memory encryption and isolation to protect workloads and data from any privileged entity, even the cloud’s infrastructure owner. CVMs are a powerful tool to protect the integrity and confidentiality of AI workloads running in cloud virtual environments.

In this article, we will explore the CVM solution for protecting AI workloads in public clouds and provide an example of a fraud detection AI workload running in a CVM that uses sensitive transaction data.

For additional information on Red Hat confidential VMs we recommend reading: Learn about Red Hat Confidential Virtual Machines.

**An overview of confidential virtual machines **

CVMs are VMs that leverage confidential computing technologies. Traditional VMs rely on software-based isolation and trust the underlying infrastructure such as cloud infrastructure. CVMs, on the other hand, leverage hardware-based technologies, offering a stronger guarantee for confidentiality and integrity.

Confidential computing focuses on protecting data in use.

In general, data can change between three different states:

Data at rest: This is data stored in persistent storage. Currently, in order to protect data at rest, various technologies offer disk encryption mechanisms like LUKS, that keep anyone without the key to decrypt the disk from reading its content
Data in transit: This is data travelling through the network. Currently, in order to protect data in transit, various technologies offer secure network connections like OpenSSL, TLS and so on. Such mechanisms mean that anyone without a proper key cannot read others’ network communication
Data in use: This is data being loaded in memory to be used by the CPU. This is the target of confidential computing, as there currently is no other solution or standard to protect this data. As an example, such a mechanism means that if malicious users perform a memory dump of a running application (for example an AI workload) they will not be able to access any secret or sensitive information being used

Note that data in use is strictly connected with the other two kinds of data, as whatever gets loaded from disk or downloaded by the network eventually ends up in memory in order to be utilized by the CPU. Therefore having encrypted storage and/or network does not offer comprehensive protection if the attacker is able to dump the memory when data is being used.

CVMs leverage Trusted Execution Environments (TEEs). A TEE is at the heart of a confidential computing solution. TEEs are isolated environments with enhanced security (e.g. runtime memory encryption, integrity protection), provided by confidential computing-capable hardware. CVMs provide memory encryption where all the data stored in RAM is encrypted with keys managed by the CPU not accessible from outside the VM. CVMs also provide execution isolation where the hypervisor and host operating systems can’t inspect or interfere with the code running in the VM. Lastly, CVMs leverage attestation tools which provide cryptographic proof that the VM is running genuine, untampered software.

Hardware implementations of confidential computing include AMD SEV-SNP and Intel TDX.

The following diagram helps clarify the difference between VM and CVM when it comes to isolation and confidentiality:

We have three different use cases:

Isolation between workloads: Provided today by VMs and protects a workload from other workloads running on the same machine or cloud infrastructure
Isolating the host from the workload: Provided today by VMs and protects the machine or cloud infrastructure from the workload if it’s malicious by design or due to some error
Isolate the workload from the host: This is a CVM which helps isolate the workload itself from the machine or cloud infrastructure which could potentially try and access it

By leveraging CVMs, organizations can run their AI workloads in shared cloud environments while protecting data confidentiality and model integrity while simplifying compliance with emerging strict regulations such as Digital Operational Resilience Act (DORA).

Running an AI workload on public cloud

The public cloud has become the key platform for AI workload development and deployment. Public clouds offer immense computing power, GPU acceleration, distributed storage and more which make it attractive for training and serving AI workloads.

Running AI workloads on public clouds offers several key benefits compared to building on-premises solutions for organizations.

One of the advantages is the ability to rapidly prototype and test AI models using all the necessary hardware both for performance and for security and confidentiality reasons, without the need to continuously upgrade your organization’s infrastructure.

Another benefit is cost reduction. Where cloud providers offer resources on demand, allowing for auto-scaling and rights-sizing to help reduce costs when compared with on-prem deployments. Monitoring, alerting and budgeting tools are also built into cloud offerings, helping further reduce costs.

Cloud scaling makes it easy to shift from smaller to larger resources as organizations grow, and workloads can also be moved between regions to enable automated disaster recovery and backups.

There is one trade off to consider, of course: you don’t own the hardware. This severely limits your control over data in use. Malicious insiders, compromised administrators or competitors could potentially access data in memory or CPU through various privilege escalation attacks.

One solution would be to run your sensitive AI workloads using confidential computing capabilities in the public cloud. This enables protected computing on untrusted environments by preventing cloud providers or third parties from accessing sensitive workloads running on hyperscalers. With confidential computing you are also able to meet various regulatory compliance since data remains encrypted even when processed.

The following diagram visualizes the difference between a VM and a CVM which takes advantage of the TEE:

Red Hat Confidential Virtual Machines

Starting from Red Hat Enterprise Linux (RHEL) 9.6, Microsoft Azure offers RHEL confidential virtual machines (CVM) as General Availability.

With this release, running a RHEL CVM has never been easier—with just a few clicks, you can deploy a confidential instance on the cloud.

RHEL CVMs now support the following key features on Azure:

Full support for RHEL Unified Kernel Image (UKI) including FIPS and kdump support
Support for both AMD SEV-SNP and Intel TDX
Trustee attestation client

For additional information on UKI and AMD/TDX support we recommend reading our previous articles in Learn about Red Hat Confidential Virtual Machines.

As described in the previous section, CVMs provide memory encryption, isolation execution and they leverage attestation tools.

The work described in this blog and which RHEL CVMs use on Azure cloud relies on the Trustee attestation solution which is part of the confidential containers CNCF project.

A crash course on Trustee attestation

The Trustee project provides the capability of attestation. It’s responsible for performing the attestation operations and delivering secrets after successful attestation. For additional information on Trustee we recommend reading our previous article, Introducing Confidential Containers Trustee: Attestation Services Solution Overview and Use Cases.

Trustee contains, among other things, the following key components:

Trustee agents: These components run inside the CVM. This includes the Attestation Agent (AA), which is responsible for sending the evidence (claims) from the TEE to prove the environment’s trustworthiness. The trustee-attestor is an example of an Attestation Agent used in the CVM.
Key Broker Service (KBS): This service is the entry point for remote attestation. It forwards the evidence (claims) from AA to the Attestation Service (AS) for verification and, upon successful verification, enables the delivery of secrets to the TEE.
Attestation Service (AS): This service validates the TEE evidence.

The following diagram shows how Trustee components interact with a RHEL CVM performing attestation and providing a secret:

The Trustee server is required to run in a trusted environment which the organization controls. For additional information about deployment aspects for Trustee we recommend reading Deployment considerations for Red Hat OpenShift Confidential Containers solution since both solutions are based on the same Trustee attestation solution.

**AI CVM use cases using Trustee attestation **

Let’s look into a few use cases for AI leveraging CVMs and Trustee.

Protecting an AI model

In this use case, access to the AI model is tightly controlled based on the trustworthiness of the VM environment. You can use Trustee to enforce policies that verify the integrity and security posture of the CVM before granting access to the model.

For example, using Trustee, you can verify the measurements of the guest kernel, firmware, etc., with pre-configured reference values as part of the remote attestation process. On success, the Trustee server issues a token or decryption key, allowing the model to be downloaded or decrypted securely inside the CVM. This approach protects against unauthorized or untrusted environments from accessing proprietary or sensitive AI models.

**Protecting training and inference data **

In this scenario, remote attestation via Trustee is used to make sure that a confidential VM is authorized to access and decrypt sensitive training or inference data.

The trustee-attestor performs remote attestation with the Trustee server to retrieve the data decryption key. Upon successful remote attestation, the Trustee server releases the decryption key or token.

A variation of the above scenario is when you want to interactively work with the AI workload and send plaintext data for inference. You must establish a secure channel with the CVM before sending sensitive data to it. Using Trustee remote attestation, you can confirm the trustworthiness of the CVM, verify that it conforms to expected software and hardware configurations, get the keys to establish the secure channel and then send any data to the VM.

A typical example is running the trustee-attestor as a oneshot systemd service unit at startup. Upon successful attestation, the VM retrieves credentials such as an SSH public key or a TLS certificate. You can then establish a secure, authenticated connection with the AI workload running in the VM and use that connection to send inference data.

The diagram below depicts this variation.

**Demo: Protecting inferencing data for a fraud detection model **

In this section we look at the Trustee use case of protecting inferencing data focusing on a concrete AI use case: fraud detection.

A fraud detection model performs offline evaluation of credit card transactions. The goal of this model is to analyze transactions by observing:

Geographical distance from the last transaction
Price compared with the median transaction prices
Whether the user completed the transaction using the hardware chip of the credit card, the PIN number and if it was an online transaction

In this demo, we will be running a fraud detection AI model in a RHEL CVM on Azure cloud focusing on inferencing with this model (we assume the model was already trained). We will leverage CVMs and Trustee attestation to obtain encrypted credit card transactions (stored data) which will then be consumed by the fraud detection model to decide if the transaction is valid or fraud.

High level workflow

To simulate a concrete scenario, and given the model assumptions (it’s publicly available), we will focus on securely storing, transferring and processing the credit card transaction datasets.

A few important points for the fraud detention use case:

The datasets are encrypted in a trusted environment before being uploaded in any remote or local storage, achieving data at rest security. Anyone (cloud provider included) trying to mount the storage will not be able to read what’s inside since the data is encrypted.
The dataset is sent to the CVM using an encrypted authenticated connection, achieving data in transit security. Anyone trying to intercept the data transit between the remote storage and the CVM will not be able to make use of the data since it’s encrypted.
The dataset is processed by the model running inside a CVM, taking advantage of memory encryption and addressing the data in use security aspect.
To prove that the CVM is actually confidential, we leverage attestation. The goal of attestation is to decrypt the downloaded dataset in the CVM only if the VM is truly confidential.

The following diagram shows the flow:

Let’s go over the steps described above:

Credit card transactions are first encrypted in a user-defined secure environment
The key used to encrypt the model is stored in the Trustee remote attester. In this example, in order to simplify a bit, we used a single key for all datasets
Trustee also contains the expected measurements that a CVM should produce to prove that it is actually confidential
Credit card transactions are uploaded in the cloud remote storage. In this demo we used two datasets, one stored as a private Azure blob storage and the other as a private AWS S3 bucket
The CVM on Azure public cloud, configured with root disk encryption, runs a Jupiter Notebook that contains the fraud detection model and has the credentials to download the datasets from the two clouds. Datasets are downloaded, but at this point the key is missing, so at this point they are just a blob of encrypted data
The Jupiter Notebook asks the key to Trustee. Here is where attestation takes place: the CVM provides measurements of its components to the Trustee remote attester, which analyzes them and compares them with the expected measurements
If the measurements match, then the environment is secure, Trustee sends the key to decrypt the data back to the CVM
The Jupiter Notebook also has the logic to decrypt the data with the received key
Data is now decrypted and loaded to the memory, to be fed to the fraud detection model

**Demo **

Final thoughts

In the article, we discussed why the public cloud is essential for running AI workloads and defined the benefits of using cloud vendor infrastructure. We also introduced several important trade-offs to consider when storing and processing sensitive data on hardware you do not own.

Using the latest hardware supports confidential computing capabilities via TEEs, it’s now possible to protect data in use. Combined with encryption for data at rest and in transit, this enables you to achieve full confidentiality and integrity for AI workloads. By incorporating a remote attestation solution like Trustee, you can verify that the CVM you’re using in the public cloud has not been tampered with or compromised in any way.

To illustrate the concept, we presented a straightforward AI fraud detection use case, which you can use as inspiration for securely building your own sensitive data workloads.

Red Hat Blog: Latest News

Now available: Red Hat Enterprise Linux Security Select Add-On

5 days ago

Red Hat Blog

Model Context Protocol (MCP): Understanding security risks and controls

12 days ago

Red Hat Blog

Red Hat Advanced Cluster Security 4.8 simplifies management, enhances workflows and offers deeper external IP visibility

17 days ago

Red Hat Blog

Updates to Red Hat Advanced Cluster Security for Kubernetes Cloud Service strengthen your security posture

27 days ago

Red Hat Blog

The open source paradox: Unpacking risk, equity and acceptance

1 month ago

Red Hat Blog