Written on August 31, 2009 by Craig Balding

Cloud Cartography & Side Channel Attacks

Last week, saw the release of a research paper from the University of California and MIT called Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds [PDF].

The abstract reads:

Third-party cloud computing represents the promise of outsourcing as applied to computation. Services, such as Microsoft’s Azure and Amazon’s EC2, allow users to instantiate virtual machines (VMs) on demand and thus purchase precisely the capacity they require when they require it.  In turn, the use of virtualization allows third-party cloud providers to maximize the utilization of their sunk capital costs by multiplexing many customer VMs across a shared physical infrastructure. However, in this paper, we show that this approach can also introduce new vulnerabilities.
Using the Amazon EC2 service as a case study, we show that it is possible to map the internal cloud infrastructure, identify where a particular target VM is likely to reside, and then instantiate new VMs until one is placed co-resident with the target. We explore how such placement can then be used to mount cross-VM side-channel attacks to extract information from a target VM on the same machine.

After introducing the main concepts the threat model of the research is defined:

In our threat model, adversaries are non-provider-affliated malicious parties. Victims are users running confidentiality-requiring services in the cloud. A traditional threat in such a setting is direct compromise, where an attacker attempts remote exploitation of vulnerabilities in the software running on the system. Of course, this threat exists for cloud applications as well. These kinds of attacks (while important) are a known threat and the risks they present are understood.

We instead focus on where third-party cloud computing gives attackers novel abilities; implicitly expanding the attack surface of the victim. We assume that, like any customer, a malicious party can run and control many instances in the cloud, simply by contracting for them. Further, since the economies offered by third-party compute clouds derive from multiplexing physical infrastructure, we assume (and later validate) that an attacker’s instances might even run on the same physical hardware as potential victims. From this vantage, an attacker might manipulate shared physical resources (e.g., CPU caches, branch target buffers, network queues, etc.) to learn otherwise confidential information.  In this setting, we consider two kinds of attackers: those who cast a wide net and are interested in being able to attack some known hosted service and those focused on attacking a particular victim service. The latter’s task is more expensive and time-consuming than the former’s, *but both rely on the same fundamental attack.

Section (5) “Network Probing” lays the foundation for (6) “Cloud Cartography” through describing…

…an empirical measurement study focused on understanding VM placement in the EC2 system and achieving co-resident placement for an adversary. To do this, we make use of network probing both to identify public services hosted on EC2 and to provide evidence of co-residence (that two instances share the same physical server)

They then go on to describe the tools and techniques in:

  • “enumerating public EC2-based web servers using external probes” (hping2, nmap, wget)
  • “translating responsive public IPs to internal IPs (via DNS queries within the cloud)” (DNS)
  • “launching a number of EC2 instances of varying types and surveying the resulting IP address assigned” (I assume this was scripted)

Using data from booting 4,499 (!) EC2 instances across different EC2 accounts under their control (remember that by default, an EC2 account has a 20-instance soft-limit), they develop the following heuristics for mapping EC2 instances:

  • All IPs from a /16 are from the same EC2 availability zone (e.g. US).
  • A /24 inherits any included sampled instance type (e.g. small, large, x-large etc). If there are multiple instances with distinct types, then we label the /24 with each distinct type (i.e., it is ambiguous).
  • A /24 containing a DomO IP address only contains Dom0 IP addresses. We associate to this /24 the type of the DomO’s associated instance (note: DomO is the first domain started by the hypervisor after boot)
  • All /24’s between two consecutive Dom0 /24’s inherit the former’s associated type.

They conclude this section with “cartography countermeasures”.  This boils down to making local IP address assignment (from the address pool) random across instance types and availability zones and/or restricting the customers view.

Section (6) describes how to determine co-residence:

Given a set of targets, the EC2 map from the previous section educates choice of instance launch parameters for attempting to achieve placement on the same physical ma- chine. Recall that we refer to instances that are running on the same physical machine as being co-resident.

In this section we describe several easy-to-implement co-residence checks.

Looking ahead, our eventual check of choice will be to compare instances’ Dom0 IP addresses.

We confirm the accuracy of this (and other) co-residence checks by exploit- ing a hard-disk-based covert channel between EC2 instances.

They identify 3 network-based co-residence checks:

  • matching Dom0 IP address
  • small packet round-trip times
  • numerically close internal IP addresses (e.g. within 7).

Now able to determine that a given EC2 instance under their control is on the same physical hardware as the target, section (7) analyses EC2’s VM placement strategy and explores techniques the attacker can use to achieve reliable co-residence with a target VM.   In the context of a brute force placement attack they note that “even a very naive attack strategy can successfully achieve co-residence against a not-so-small fraction of targets”.  Thus if your target set is wide, brute force turns out to be a viable strategy.  For individual or small sets of targets, a more efficient strategy is “instance flooding” (spinning up numerous VMs) immediately after the target has booted to “take advantage of the parallel placement locality exhibited by the EC2 placement algorithms”.  This is where the dynamic nature of cloud comes into play:

But why would we expect that an attacker can launch instances soon after a particular target victim is launched?  Here the dynamic nature of cloud computing plays well into the hands of creative adversaries. Recall that one of the main features of cloud computing is to only run servers when needed. This suggests that servers are often run on instances, terminated when not needed, and later run again.  So for example, an attacker can monitor a server’s state (e.g., via network probing), wait until the instance disappears, and then if it reappears as a new instance, engage in instance flooding. Even more interestingly, an attacker might be able to actively trigger new victim instances due to the use of auto scaling systems. These automatically grow the number of instances used by a service to meet increases in demand. (Examples include scalr 30 and RightGrid 28.  See also 6.) We believe clever adversaries can find many other practical realizations of this attack scenario.

With co-residence achieved, section (8) assesses the practicality of side channel attacks in a VM environment.  As you’d expect, a number of possibilities exist, however the holy grail of cryptographic key extraction does not appear plausible (at this time).  One notable quote:

The side channel attacks we report on in the rest of this section are more coarse-grained than those required to extract cryptographic keys. While this means the attacks extract less bits of information, it also means they are more robust and potentially simpler to implement in noisy environments such as EC2.

Side channel attacks discussed include:

  • Denial of Service
  • Measure cache usage (measure CPU utilisation on the physical machine; or “how busy are their servers?”)
  • Load-based co-residence detection (aka detecting co-residence without relying on sending any network probes)
  • Estimating traffic rates (sounds harmless but can be used to deduce targets activity patterns, peak trading times for maximal DoS effect etc)
  • Keystroke timing attack (remote keystroke monitoring)

As with each of the other sections, the authors suggest potential countermeasures.

Overall the paper makes a very interested read.  There’s no EC2 “0-day”, but that’s not the intent of the paper.  Rather, we are reminded that cloud platforms and technologies do bring some novel attacks that thus far have not really figured in much of the security conversation to date.  We need more of this type of research to better understand what we are getting ourselves into.

Written on August 01, 2009 by Craig Balding

Introducing the Cloud Security Podcast…

That’s right…you thought you couldn’t get enough Cloud Security…well, now you can hear myself and co-host Chris Hoff of Rational Survivability discuss recent cloudsec happenings.

Cloud Security Podcast Episode 1

[If you don't see the player above, turn on Javascript]

Brief show notes:

  • Introductions
  • Quick recap of what we mean by ‘Cloud Computing’
  • Recent news & events (with a focus on FUD)
  • Groups developing cloud security guidance: Cloud Security Alliance, Enisa, Jericho
  • Wrap-up

Full show notes

As this was our first foray into making our own podcast, we’re seeking your feedback (we know about the audio drop outs).

Tell us what you think…

P.S Submitting to iTunes shortly

Written on July 31, 2009 by Craig Balding

Tease: Interesting Tweets from Black Hat

Despite what the cynics say, Twitter is  extremely valuable to track and participate in conversations about cloud computing/security, aswell as information security in general.

For those of us that didn’t make it to Black Hat USA/Defcon, the infosec twitter community gave us the next best thing - a running commentary of the presentations - many of which feature cutting edge security research.

I was particularly interested in following the Sensepost presentation called ‘Clobbering the Cloud’. From the write-up:

Cloud Computing dominates the headlines these days but like most paradigm changes this introduces new risks and new opportunities for us to consider. Some deep technical research has gone into the underlying technologies (like Virtualization) but to some extent this serves only to muddy the waters when considering the overall threat landscape. During this talk, SensePost will attempt to separate fact from fiction while walking through several real-world attacks on “the cloud.” The talk will focus both on attacks against the cloud and on using these platforms as attack tools for general Internet mayhem. For purposes of demonstration we will focus most of our demos and attacks against the big players…

In reverse order, check out the tweets from @GphreakX who was at BH and kindly tweeting proceedings:

GphreakX_%28GphreakX%29_on_Twitter-20090731-140119 Tease: Interesting Tweets from Black Hat

Some interesting tweets there for sure!  Hopefully this has whet your appetite for the upcoming cloudsecurity.org interview with Haroon and his Sensepost team…stay tuned.

Written on July 08, 2009 by Craig Balding

Google Native Client, Google Chrome OS & Coming Out of Beta

Google-20090708-145917 Google Native Client, Google Chrome OS & Coming Out of BetaGoogle just made three big announcements that reveals more about their cloud strategy, security & positioning with enterprises.

Google Chrome Operating System

Perhaps the biggest news is their plan to create a new operating system, based on the Linux kernel, running on X86 and ARM chipsets and targeted at the Netbook/Laptop/Desktop user:

Google Chrome OS is an open source, lightweight operating system that will initially be targeted at netbooks. Later this year we will open-source its code, and netbooks running Google Chrome OS will be available for consumers in the second half of 2010.”

Talking of their goals:

Speed, simplicity and security are the key aspects of Google Chrome OS. We’re designing the OS to be fast and lightweight, to start up and get you onto the web in a few seconds. The user interface is minimal to stay out of your way, and most of the user experience takes place on th web.”

And starting from a clean slate (and an obligatory swipe at Microsoft):

“And as we did for the Google Chrome browser, we are going back to the basics and completely redesigning the underlying security architecture of the OS so that users don’t have to deal with viruses, malware and security updates. It should just work.

<snip>

The software architecture is simple — Google Chrome running within a new windowing system on top of a Linux kernel. For application developers, the web is the platform. All web-based applications will automatically work and new applications can be written using your favorite web technologies. And of course, these apps will run not only on Google Chrome OS, but on any standards-based browser on Windows, Mac and Linux thereby giving developers the largest user base of any platform.

Google Chrome OS is a new project, separate from Android. Android was designed from the beginning to work across a variety of devices from phones to set-top boxes to netbooks. Google Chrome OS is being created for people who spend most of their time on the web, and is being designed to power computers ranging from small netbooks to full-size desktop systems. While there are areas where Google Chrome OS and Android overlap, we believe choice will drive innovation for the benefit of everyone, including Google.”

Wow, pretty big announcement with lots of potential market implications.  One way to look at this is they just described a system with “embedded OS” properties running as a mainstream desktop OS with services delivered via the web instead of relying on locally hosted applications.  I suppose in some ways this should come as no real surprise as it is entirely in-line with their cloud based strategy.

Whilst the target market would appear to be consumers, I can see enterprises jumping at a thin OS “that just works”.  Ultimately, this is moving us closer to  an age of disposable computing - low cost devices with low entry software footprints.

Organisations are keen to embrace smaller footprint client computers to cut costs and if the underlying hardware offers enterprise demanded features like full HD encryption (to protect that cached Cloud content), I could see enterprises taking a serious interest.

Do we *really* want to run the dozen endpoint agents we have today for configuration management, NAC, AV, HIPS (pah!) and bear all the costs they bring? With a static client, you won’t need many of these features.  From a security point of view, this could be a very good thing - no AV headaches, significantly less attack surface (enterprise apps often demonstrate “brittle” security) and less PII to lose.

To deliver on a low-update OS, they will need to ship a subset of the Linux kernel that is considered “mature”, otherwise their users will be back on the  “patch treadmill” - which is something they explicitly state they are trying to avoid.

I find it interesting they are designing a new windowing system when there are so many options available today (some with decent security too).  I suspect this is to take advantage of advances in graphical chipsets.  Perhaps they see this as a chance to boost Chrome browser page rendering speed even further.

Perhaps the more fundamental question is whether we want Google owning the last bastion - our desktops.

This brings us to the Chrome browser itself and associated technologies.

Google Native Client

Back in February, Google kicked off a security contest for a “research project” called Google Native Client (NaCl).  First a quick recap on Native Client:

“Native Client is an open-source research technology for running x86 native code in web applications, with the goal of maintaining the browser neutrality, OS portability, and safety that people expect from web apps. We’ve released this project at an early, research stage to get feedback from the security and broader open-source communities. We believe that Native Client technology will someday help web developers to create richer and more dynamic browser-based applications. ”

This is Google’s ambitious attempt to provide a high-speed, browser hosted application alternative to Java or Flash. To do this securely, they designed a new security architecture and NaCl is the implementation.

Announcing the security contest, Henry Bridge from Google wrote:

“Exploits, bugs, vulnerabilities, security holes — for most programmers these terms are synonymous with fire drills and coding all-nighters. However, for the next 10 weeks, the Native Client team is inviting you to bring them on! We’re challenging you to find security exploits in Native Client.”

The judges, led by respected academic Ed Felton (Princeton), assessed the vulnerablities reported by each of the 600 participants based on “a) Quality (Severity, Scope, Reliability and Style) and b) Quantity”.  Participants were limited to reporting on 10 bugs (Google claimed this was to avoid wasting the judges time).

Mark Dowd and Ben Hawkes won the contest, finding the bulk of the best bugs. Mark Dowd is well known in the security community - most often described as a humble genius (or a robot sent back in time :).  I followed along at home and it was great fun reading the bug descriptions as the competition progressed.  As this was a new security design, there were some unique vulnerabilities discovered along with novel exploit avenues.  Despite all the implementation snafus, Google is taking comfort that no underlying architectural weaknesses were found.

“This contest helped us discover implementation errors in Native Client and some areas of our codebase we need to spend more time reviewing. More importantly, that no major architectural flaws were found provides evidence that Native Client can be made safe enough for widespread use. Toward that end, we’re implementing additional security measures, such as an outer sandbox…”

In other posts, Google has indicated the plan to bundle NaCl with the browser, rather than offer as a end-user download.  There is some way to go before this happens, and the security contest is just one step on the journey before NaCl goes live.   The NaCl team also submitted a detailed technical design paper to the IEEE 2009 Symposium on Security and Privacy.  If anyone knows anything on the outcome of the peer review, please leave a comment.

Overall, it has to be said that the NaCl team at Google is doing a solid job trying to flush out security issues before “Primetime”.

Having said that, not all observers agree the architecture is a step in the right direction.  Noted reverse engineer Halvar Flake responded to a post by Dave Aitel on the Daily Dave mailing list remarking that:

The real beauty in NaCl is that it is certain to defeat DEP [Ed: Data Execution Prevention is an hardware and/or software enabled chipset technology design to throw an exception when an attempt is made to pass off data pages as code pages]. Not that DEP is much of an obstacle in browsers these days, but still. It’ll also almost certainly allow ASLR bypass.

Everyone who has even been to one of my classes has been tortured with the analogy that “writing an exploit is like trying to build a chair out of a number of random parts from the IKEA warehouse: Nothing ever fits, but the more pieces you have, the better your odds of success are.

The power to first execute Javascript to perform [Ed: memory] allocations/dealloctions, coupled with the ability to load arbitrary code into the address space that is only verified under alignment assumptions violated as soon as you can perform a control hijack, does look like a jar of superglue to me. And when you have a sufficiently large jar of superglue, you can essentially build a chair out of wood shavings.”

The point that Halvar is making is that the exploit coder has certain advantages when it comes to exploiting browser based weaknesses. Couple this with the very feature that NaCl introduces - loading Internet hosted native code - and any single implementation weakness makes way for reliable exploitation potential bypassing CPU anti-exploitation features.  This kind of dialogue is very constructive and I look forward to seeing how the thinking around NaCl develops.

Google Apps: Beta Out, Enterprise Features In

Back to the Google announcements, and the day finally came Google dropped ‘Beta’ from Google Apps, Gmail, Google Calendar, Google Docs and Google Talk. This is clearly to please enterprise folks who take the traditional interpretation of “beta == buggy”.  Its hard for a CIO to get buy-in with their own org to adopt a hosted service that has those 4 letters staring back at them (even if they agree with Google’s definition of “beta”.  “Premium beta” anyone? ;-).

Google also added email delegation, retention, DR features to Google Apps, along with “special handling of business users’ data in our data center operations.” If anyone has any details on that last point, do share.  Google is in catch-up mode and ticking the right boxes.

All in all, this was a big day for Google and their evolving Cloud strategy, enterprise security people should take note…


Written on June 28, 2009 by Craig Balding

Vulnerability Scanning and Clouds: An Attempt to Move the Dialog On…

Much has been said about public IaaS providers that expressly forbid customers from running network scans against their cloud hosted infrastructure.  Failure to comply with the Terms of Service can result in account suspension or termination (ouch!).  This post is my attempt to suggest a way forward.  I welcome your feedback…

As has been noted before, a blanket ban on legitimate scanning activity by customers of their own infrastructure (whether outsourced or not) undermines security assurance processes and can make regulatory compliance impossible; e.g. PCI DSS mandates network vulnerability scanning as a control.

Vulnerability scanning is a stalwart practice of the Information Security community. Enterprises invest considerable time and money developing vulnerability management programs to help assess IT security risk across applications and infrastructure. Specifically, vulnerability scanners help identify potential security weaknesses at scale; e.g. missing patches, default passwords, coding or configuration weaknesses.

Vulnerability scanning is front of mind for Internet exposed or partner connected infrastructure. However, when said infrastructure is owned and/or operated by a service provider, some of the existing challenges associated with vulnerability scanning are magnified:

  • Scans can cause outages.  This can happen if the scanning policy includes Denial of Service checks or the scanning engine is configured with “aggressive” settings; e.g. connection entries in firewall state tables get exhausted.  Its also possible for scans to tickle obscure bugs in the target - or devices enroute to the target.  Even without a full-on outage, poorly configured scans can still negatively impact performance or availability for other customers of shared infrastructure.
  • Identifying unauthorised scans. Without a trusted, robust process for “blessing” or approving source IP addresses of customer scan engines, service providers cannot distinguish legitimate scans from scans with the evil bit set.  Sure, they can use whois to determine source network ownership but even if the scan originates from a customer owned network, this does not necessarily mean it is authorised!  Given this, many providers take the stance that all scans are treated as hostile unless pre-agreed.
  • Scanning may trigger automated or manual actions by the provider. A common automated response from a provider is to apply traffic shaping to slow down the scan, or simply block the client IP address via an ACL update.  This can lead to false negatives; i.e. vulnerabilities present are not discovered as the scanner IP was automagically identified as a noisy vulnerability scanner and auto-throttled/blocked.  Even half smart attackers can quickly deduces the presence of auto-response mechanisms (”huh, no response now”) so either switches to slow probes from multiple sources or goes for gold with a one-shot exploit.

Enterprise customers on dedicated infrastructure at Tier 1 web hosting providers will either contract the hosting company (or their security partner) to perform vulnerability scans or do it themselves.  Either way, for scanning to happen, agreement will need to be reached on scan scope, types of scans to be run (scanning tools & policies), time windows and source IP addresses used.  Beyond that are the process issues of how results will be communicated, integration with ticketing systems etc.

The provider will limit the scan scope to the dedicated infrastructure allocated to the customer - the scanning of shared infrastructure by the customer is generally a ‘no no’.  This, along with management networks will be scanned by the provider to meet customer compliance mandates or security policies.

With Cloud “Infrastructure as a Service” providers, things get a little more complicated.

  • A cloud is multi-tenant; i.e. the cloud platform is shared to multiple customers through software abstraction.  The provider will naturally be concerned with the impact of any scanning activity, particularly if it causes any SLA violations.
  • Further, cloud customers can spin up infrastructure on demand.  New virtual servers can be  brought to life automagically to handle increased load.  This increased infrastructure footprint is still subject to the same compliance mandates though; i.e. it must be scanned within some time period of its appearance.  Even if spinning up copies of “known good/secure” virtual machine (VM), you still need to scan them.   New vulnerablities are published all the time, along with corresponding vulnerability checks - hence the need for both regular scans and representative scans.  Further, vulnerbility scanning isn’t just testing the VM, its also helping you verify the security controls outside the VM that are designed to protect it; e.g. a providers’ software firewall.  Picking and choosing which pieces of your hosted infrastructure to scan is a slippery slope to selective exposure if not handled with care.
  • Finally, we shouldn’t discount the “Clouding around” factor.  Credit card payments for “instant on” infrastructure changes the dynamic between cloud consumer and cloud provider.  Similar to low end, consumer oriented shared hosting before it, you may never speak with, let alone meet, an employee of your provider before you use their services.  There simply isn’t a conversation about scanning (the “conversation” today is a monologue found in the Terms of Service).  Plus, if the provider fails to meet your needs, you can drop them at a moments notice and switch to another (Cloud baggage permitting…).  In other words, its either not possible, or not convenient to call up your provider to agree the principle and logistics of scanning the services they host on your behalf.  Enterprise customers - or at least their security teams - will be wanting that conversation and can likely strike a deal with a modified ToS to allow scanning of some sort but this seems unncessarily exclusionist to me.

We can address these issues through a mix of provider open-mindedness, policy, process, technology and contract.

For cloud providers to attract certain customers, they may need to soften their policy on vulnerability scanning.  Taking a hardline “no” stance precludes some workloads from ever entering the cloudosphere (with bigger consequences for enterprises seeking a strategic cloud partner).  A preferred scenario has the cloud provider showing some understanding of enterprise prospects assurance needs and defining scanning parameters acceptable to their own operations risk tolerance.

Scanning is not an “unknown” risk, rather its a very well understood activity with quantifiable elements (packet rate, state table usage etc).  Normal rate limiting could be temporarily or permanently loosened for customer approved IP addresses to enable scans against a customers cloud IP addresses (not API endpoints or cloud providers websites!) to complete in a reasonable time window.  Besides, Internet systems are scanned, probed and attacked constantly by script kiddies, Internet surveyors and an assortment of bots and other lifeforms.  So the bad guys get to scan because they don’t care and yet the customer, who wants to do the “right thing”, is not allowed to.  Is that rational?

Assuming a cloud provider with a more measured approach towards vulnerability scanning of customer cloud infrastructure, we now need a simple, mutually trusted mechanism to agree scan sources, rate limits etc.  Something like an “ScanAuth” (Scan Authorize) API call offered by cloud providers that a customer can call with parameters for conveying source IP address(es) that will perform the scanning, and optionally a subset of their Cloud hosted IP addresses, scan start time and/or duration. This request would be signed by the customers API secret/private key as per other privileged API calls. The provider receiving the request can rely on the digital signature as proof that a scan is authorised with the associated parameters. After the provider has processed the scan authorisation request, the provider could return a status code approving or denying the request (with a possible reason code to allow resubmission with more acceptable parameters).  This response can optionally include rate limits which the customer can use to tune the intensity of their scanner.

The provider can now whitelist the customer provided scanner IP(s) for the duration of the requested scanning window such that active countermeasures like anti-DoS controls are not triggered, resulting in a ‘cleaner’ scan (and hence a more accurate report).

Should the scanning activity exceed any specified limits, or communicate with IP addresses not associated with customer virtual machines, the provider could instantly blacklist the scanning IP or apply traffic shaping.

The bottom line: when everyone is clear on the need, approval process, scan parameters and abuse policy, this can be done with very little fuss.

A “ScanAuth” API call empowers the customer (or their nominated 3rd party) to scan their hosted Cloud infrastructure confident in the knowledge they won’t fall foul of the providers Terms of Service. This avoids a situation where either a customers Cloud services are interrupted by an angry provider (availability fail!) or in the worst case, getting kicked off the Cloud entirely.  Clearly, a lose/lose scenario.

What do you think?

Stay up to date, subscribe by RSS or email