Cloud Cartography & Side Channel Attacks

August 31, 2009

By Craig Balding

Last week saw the release of a research paper from the University of California and MIT called "Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds" [PDF].

The abstract reads:

Third-party cloud computing represents the promise of outsourcing as applied to computation. Services, such as Microsoft’s Azure and Amazon’s EC2, allow users to instantiate virtual machines (VMs) on demand and thus purchase precisely the capacity they require when they require it. In turn, the use of virtualization allows third-party cloud providers to maximize the utilization of their sunk capital costs by multiplexing many customer VMs across a shared physical infrastructure. However, in this paper, we show that this approach can also introduce new vulnerabilities. Using the Amazon EC2 service as a case study, we show that it is possible to map the internal cloud infrastructure, identify where a particular target VM is likely to reside, and then instantiate new VMs until one is placed co-resident with the target. We explore how such placement can then be used to mount cross-VM side-channel attacks to extract information from a target VM on the same machine.

After introducing the main concepts, the threat model of the research is defined:

In our threat model, adversaries are non-provider-affiliated malicious parties. Victims are users running confidentiality-requiring services in the cloud. A traditional threat in such a setting is direct compromise, where an attacker attempts remote exploitation of vulnerabilities in the software running on the system. Of course, this threat exists for cloud applications as well. These kinds of attacks (while important) are a known threat, and the risks they present are understood.

We instead focus on where third-party cloud computing gives attackers novel abilities; implicitly expanding the attack surface of the victim. We assume that, like any customer, a malicious party can run and control many instances in the cloud, simply by contracting for them. Further, since the economies offered by third-party compute clouds derive from multiplexing physical infrastructure, we assume (and later validate) that an attacker’s instances might even run on the same physical hardware as potential victims. From this vantage, an attacker might manipulate shared physical resources (e.g., CPU caches, branch target buffers, network queues, etc.) to learn otherwise confidential information.

In this setting, we consider two kinds of attackers: those who cast a wide net and are interested in being able to attack some known hosted service, and those focused on attacking a particular victim service. The latter’s task is more expensive and time-consuming than the former’s, but both rely on the same fundamental attack.

Section (5) “Network Probing” lays the foundation for (6) “Cloud Cartography” through describing:

…an empirical measurement study focused on understanding VM placement in the EC2 system and achieving co-resident placement for an adversary. To do this, we make use of network probing both to identify public services hosted on EC2 and to provide evidence of co-residence (that two instances share the same physical server).

They then go on to describe the tools and techniques in:

“Enumerating public EC2-based web servers using external probes” (hping2, nmap, wget).
“Translating responsive public IPs to internal IPs (via DNS queries within the cloud)” (DNS).
“Launching a number of EC2 instances of varying types and surveying the resulting IP address assigned” (I assume this was scripted).

Using data from booting 4,499 (!) EC2 instances across different EC2 accounts under their control (remember that by default, an EC2 account has a 20-instance soft limit), they develop the following heuristics for mapping EC2 instances:

All IPs from a /16 are from the same EC2 availability zone (e.g., US).
A /24 inherits any included sampled instance type (e.g., small, large, x-large, etc.). If there are multiple instances with distinct types, then we label the /24 with each distinct type (i.e., it is ambiguous).
A /24 containing a Dom0 IP address only contains Dom0 IP addresses. We associate this /24 with the type of the Dom0’s associated instance (note: Dom0 is the first domain started by the hypervisor after boot).
All /24s between two consecutive Dom0 /24s inherit the former’s associated type.

They conclude this section with “cartography countermeasures.” This boils down to making local IP address assignment (from the address pool) random across instance types and availability zones and/or restricting the customer’s view.

Section (6) describes how to determine co-residence:

Given a set of targets, the EC2 map from the previous section educates choice of instance launch parameters for attempting to achieve placement on the same physical machine. Recall that we refer to instances that are running on the same physical machine as being co-resident.

In this section, we describe several easy-to-implement co-residence checks.

Looking ahead, our eventual check of choice will be to compare instances’ Dom0 IP addresses.

We confirm the accuracy of this (and other) co-residence checks by exploiting a hard-disk-based covert channel between EC2 instances.

They identify 3 network-based co-residence checks:

Matching Dom0 IP addresses.
Small packet round-trip times.
Numerically close internal IP addresses (e.g., within 7).

Now able to determine that a given EC2 instance under their control is on the same physical hardware as the target, section (7) analyzes EC2’s VM placement strategy and explores techniques the attacker can use to achieve reliable co-residence with a target VM. In the context of a brute force placement attack, they note that “even a very naive attack strategy can successfully achieve co-residence against a not-so-small fraction of targets.”

For individual or small sets of targets, a more efficient strategy is “instance flooding” (spinning up numerous VMs) immediately after the target has booted to “take advantage of the parallel placement locality exhibited by the EC2 placement algorithms.” This is where the dynamic nature of the cloud comes into play:

But why would we expect that an attacker can launch instances soon after a particular target victim is launched? Here, the dynamic nature of cloud computing plays well into the hands of creative adversaries. Recall that one of the main features of cloud computing is to only run servers when needed. This suggests that servers are often run on instances, terminated when not needed, and later run again. So, for example, an attacker can monitor a server’s state (e.g., via network probing), wait until the instance disappears, and then if it reappears as a new instance, engage in instance flooding. Even more interestingly, an attacker might be able to actively trigger new victim instances due to the use of auto-scaling systems. These automatically grow the number of instances used by a service to meet increases in demand. (Examples include scalr [30] and RightGrid [28]. See also [6].) We believe clever adversaries can find many other practical realizations of this attack scenario.

With co-residence achieved, section (8) assesses the practicality of side-channel attacks in a VM environment. As you’d expect, a number of possibilities exist; however, the holy grail of cryptographic key extraction does not appear plausible (at this time). One notable quote:

The side-channel attacks we report on in the rest of this section are more coarse-grained than those required to extract cryptographic keys. While this means the attacks extract fewer bits of information, it also means they are more robust and potentially simpler to implement in noisy environments such as EC2.

Side-channel attacks discussed include:

Denial of Service.
Measure cache usage (measure CPU utilization on the physical machine; or “how busy are their servers?”).
Load-based co-residence detection (detecting co-residence without relying on sending any network probes).
Estimating traffic rates (used to deduce target activity patterns, peak trading times for maximal DoS effect, etc.).
Keystroke timing attack (remote keystroke monitoring).

As with each of the other sections, the authors suggest potential countermeasures.

Overall, the paper makes a very interesting read. There’s no EC2 “0-day,” but that’s not the intent of the paper. Rather, we are reminded that cloud platforms and technologies do bring some novel attacks that thus far have not really figured in much of the security conversation to date. We need more of this type of research to better understand what we are getting ourselves into.