cloud computing
“Is colocation cheaper than using a cloud computing service to run the same workload?”
This quote comes from an analysis of the costs of cloud-based computing vs. traditional colocation as a function of the work load and duty cycle. This type of analysis is increasingly germane for companies that are looking to make a transition to cloud-based service providers in the hope that it will allow them to lower their overall IT costs. The results, while not surprising, do raise some interesting points. First, here’s the link:
https://vijaygill.wordpress.com/2010/08/09/cloud-economics/
The crux of the argument here is the concept of the duty cycle. The duty cycle for a deployed application is the percentage of available hardware resources that are being consumed at any given moment in time. Intuitively, a higher duty cycle corresponds to more efficient and cost-effective use of the underlying hardware. One of the fundamental promises of cloud computing is that it will allow you to run applications in a way that will produce a much higher duty cycle through elasticity, i.e. idle resources can be released without impacting the ability to scale back up in the future.
The analysis includes a spreadsheet with some hard data for one specific style of application, and the result is that an equivalent workload for this style of application would cost $118,248 at Amazon and $70,079 in a colocation facility, with the implication that a higher duty cycle can be achieved via colocation. However, this result is not as clear cut as it might seem, owing to the fact that the “application style” is an extremely subjective and important attribute for any given application. In my experience, it is rare to find an application that can be characterized in this way; instead, most applications that I have run across are inherently custom in some important way. I would think that this analysis would need to be performed on a case-by-case basis for any application that is considering a move to the cloud, and the specific result of that analysis would apply only to that application.
A subtle conclusion of this type of analysis is that duty cycle optimization for a given application should be a key criterion for cost reduction. And, somewhat conversely, if an application already has a high duty cycle then the opportunities for cost reduction through cloud-based resources will be limited at best. Or, more simply: if you already run highly-utilized servers then you might do better with collocation.
Hope this helps.
“How Big is Amazon’s Cloud Computing Business?”
Everyone seems to think that Amazon’s web services business (a.k.a. EC2, S3, and the rest of AWS) is very big and getting bigger, but Amazon stubbornly refuses to break out the AWS contribution to Amazon’s earnings. A recent blog post on GigaOm is the first that I have seen that includes some real data — both for Amazon and for the total accessible cloud services market — to estimate the size of the AWS business today and the size of the market going forward. First, here’s the link:
http://gigaom.com/2010/08/02/amazon-web-services-revenues/
The data comes from a UBS Investment Research paper, and it estimates that AWS is a $500M business in 2010. It further estimates that AWS will grow to $750M in 2011 — 50% year over year growth — reaching $2.5B, with a B, by 2014. Amazon as a whole does roughly $25B in revenue, growing in the 30%-40% range over the past year, so the numbers for AWS, while still small, are a high-growth component of Amazon’s overall business. Add in the fact the the UBS paper reports the gross margin of AWS at around 50%, vs. around 23% for Amazon as a whole, and one might draw the conclusion that the profit contribution of AWS will be a growing and very significant piece of Amazon’s pie in the years to come.
The question I have is this: Why aren’t more internet companies doing the same thing? Amazon’s results are a clear and undeniable validation of their AWS business strategy. That strategy, in a generic sense, was to build a very efficient cloud infrastructure for their own retail applications, and sell the excess capacity to the general public. They have proved that there is demand for the excess capacity and the service APIs that they provide for that capacity. And their list of customers has slowly transformed from penny-pinching startups and early adopters to a who’s who list of the largest, richest Fortune 500 companies in every business domain. And don’t forget that AWS has data centers around the world, and there is every reason to think that the demand for AWS from foreign companies will mirror growth in the US.
I can think of a bunch of older, larger internet companies that should definitely be trying to duplicate Amazon’s success. Some of them have already tried, albeit with a slightly different level of offering (e.g. Google’s App Engine, Microsoft’s Azure). But the barrier to entry, such as it is, requires only a large number of machines and the will to build the necessary cloud infrastructure systems. I’m sure someone will call BS on this statement and tell me that some serious skill is also required, and I would agree with that. But we live in a time when skill moves around a lot, and no company has a monopoly on talent. So why isn’t everybody trying to copy AWS?
Amazon EC2 I/O Performance: Local Ephemeral Disks vs. RAID 0 Striped EBS Volumes
I recently ran into an issue with I/O bandwidth in EC2 that produced some unexpected results. As part of my day job, I built a system to run clustered deployments of enterprise software within AWS. The enterprise software I chose for the prototype is, as it turns out, very sensitive to I/O bandwidth, and my first attempt at a clustered deployment using m1.large instances with only ephemeral disks did not provide enough I/O performance for the application to operate. After a few internet searches, I was convinced that I needed to use EBS volumes, probably in a RAID 0 striped volume configuration, to effectively increase the I/O performance through parallelism of the network volumes. We also got some guidance from Amazon to this effect, so it appeared to be, by far, our best bet to solve the problem.
So I converted my prototype to use RAID 0 4X-striped EBS volumes. And the I/O performance was different, but still not enough for the application. Uh-oh.
To better understand the problem, we decided that the next step would be to quantify the I/O performance of the different volume options, in the hope that we could ascertain how to tweak the configuration to get enough performance. A colleague of mine ran a variety of tests with iometer and produced the following data that captures read and write performance for six different volume configurations: a single EBS volume, a 4X EBS striped volume, a 6X EBS striped volume, an m1.large local ephemeral drive, an m1.xlarge local ephemeral drive, and “equivalent hardware” which, roughly speaking, corresponds to a server box that should be similar to a large EC2 instance in terms of performance and which is not heavily optimized for I/O. Some of these choices deserve some explanation. We looked at both m1.large and m1.xlarge instances because we were told that the m1.xlarge instances had a larger slice of I/O bandwidth from the underlying hardware, and we wanted to see how much larger that slice actually is, given the cost difference. The EBS-based volumes ran on m2.large instances, as we were told that these instances had a larger slice of real I/O bandwidth as well as a larger network bandwidth slice that would better support the EBS volumes.
So here is the hard data:
A few things jumped out at us when we saw these numbers:
- The performance of the 6X EBS striped volume was significantly worse than the 4X EBS striped volume. Going from 4X to 6X must saturate something that effectively causes contention and reduced performance. Some information online suggests that EBS network performance can be quite variable, but we got consistent results over several runs.
- The EBS-based volumes all had better read performance than the local ephemeral disk-based volumes in both IOPS and Mbps. The local ephemeral disk-based volumes had much better write performance in IOPS, and similar write performance in Mbps.
- Neither the EBS-based volumes nor the local ephemeral disk-based volumes could stand up to a real piece of dedicated hardware. Which is reasonable, given what EC2 is. But which also implies that it is not a good idea to run I/O intensive operations in EC2. And hardware optimized for I/O would probably blow away the performance of our unoptimized “equivalent hardware” box.
Hopefully this data can point you in the right direction if you are dealing with I/O issues in EC2. We’re hoping for a little magic from Amazon on this issue later this year (and I can’t say anything more about that). In the interim, I/O issues will continue to require some trial and error to find a configuration that best matches your particular I/O profile.
Hope this helps.
“We create 5 exabytes every two days.”
This quote comes to us courtesy of Google’s Eric Schmidt, who was offering his top 10 reasons why mobile is #1 in the following InformationWeek article:
http://www.informationweek.com/news/global-cio/interviews/showArticle.jhtml?articleID=224400178
To paraphrase the context of this quote, all of the information created between the beginning of time and 2003 was about 5 EB, give or take (exabyte – 1018 bytes, or 1 Billion GB ). But today, according to him, “we” create 5 EB every 2 days, and it is not clear if this is the majestic plural “we”, as in Google, or “we” as in the entire world. But 5 EB is an awful lot of data (those drunken cat videos on YouTube are really starting to add up…).
Which brings me to my point: the signal-to-noise ratio of that 5 EB is probably low and getting lower every day. And I think we’re starting to get some hard data to prove it. For example, at Chirp, Twitter founder Biz Stone mentioned that Twitter has 105M+ registered users, with 300k new users each day, and 3B+ requests to their API each day. That is a lot of users and a lot of traffic. But how many of those users are active? 20%? 80%? According to the following article, 5% of those Twitter users are responsible for 75% of all traffic, with a rapid drop off in activity beyond 5%:
http://www.readwriteweb.com/archives/twitters_most_active_users_bots_dogs_and_tila_tequila.php
If the premise of this article is true, then only about 20% of Twitter users are active in a measurable way. And the most active 5% are clogging the pipes with an enormous volume of Tweets — from which we might infer that the signal-to-noise ratio of those Tweets is not very high. Ask yourself this question: what percentage of Tweets that you see are relevant to you in any way? These numbers will only get worse, information-density-wise, as more people sign up for Twitter but otherwise don’t use it in a measurable (or relevant) way.
But back to the original quote. If Google is processing 5 EB of new data every two days, then they definitely need to keep building new data centers in the Columbia River gorge and elsewhere where power is inexpensive. But perhaps what they really need is better filtering technology to drop the 80% of that 5 EB that they don’t really need to keep. Maybe Google needs to keep 5 EB every 2 days to ensure that the 1 EB of good data buried within is not lost, but if they could separate the wheat from the chaff, then they could (1) build fewer data centers, which their investors would really like, and (2) deliver information to us with a much higher signal-to-noise ratio, which we would really, really like.
Google made a name for itself by making search results more relevant than any company that came before them; we can only hope that they (and Twitter, and Facebook, et al.) will continue to pursue information relevance with similar zeal as they process those 5 EB every two days. Otherwise, we should all plan for a whole lot more drunken cat videos.
“You are not Google. (or: you don’t really need NoSQL…)”
It’s great to see a thoughtful articulation of the other side of the “everybody needs to dump their SQL database” argument in this blog post:
http://teddziuba.com/2010/03/i-cant-wait-for-nosql-to-die.html
To paraphrase the author’s argument, the vast majority of applications out there will simply never see the load that would require a move to a NoSQL solution. Thinking about scalability is a very good thing to do, but the choice to make the move still needs to flow from a rational decision making process.
I see this all the time in discussions I have about scalability and cloud computing. Everyone wants to claim that they are scalable (because scalable == smart in the current lexicon), but they aren’t backing this up with a rational basis for why their (relatively small) application needs massive, search-engine-class scalability.
“The largest cloud providers are botnets.”
I ran across an interesting article that compares the largest of the botnets — the Conficker botnet — with the largest web application providers and the size of their infrastructure. The article uses the term “cloud” in a way that I don’t necessarily agree with, since they use it to describe the overall size of a company’s infrastructure (e.g. according to the article, Google has 500,000 servers) rather than the size of that company’s infrastructure that is available for use as part of a genuine cloud-based service. But, overall, the article does illuminate the fact that botnets have an architecture that is very similar to that of cloud-based service providers, leveraging a virtual datacenter comprised of compromised hardware around the globe. And the sheer size of the Conficker botnet, which, according the article, is over ten times the size of the largest web application provider, would make it, by far, the largest cloud provider in the world by an order of magnitude.
The article can be found here: http://www.networkworld.com/community/node/58829
This article is also noteworthy, I think, because it includes some sizing data for the largest players in the industry. If true, this data represents one of the first genuine, accurate comparisons of the infrastructure behind the biggest names on the web.
Observed Performance of Amazon EC2 Instances
A thread has been emerging surrounding the observed performance of EC2 instances and the possibility that Amazon is experiencing capacity issues as their business continues to grow. Three excellent articles on this topic are linked below:
http://www.datacenterknowledge.com/archives/2010/01/14/amazon-we-dont-have-cloud-capacity-issues/
http://alan.blog-city.com/has_amazon_ec2_become_over_subscribed.htm
https://www.cloudkick.com/blog/2010/jan/12/visual-ec2-latency/
This is a question that I receive often in my day job, so I have a few comments to add to this thread. First, if you have used EC2 then you know that Amazon explicitly refuses to quantify the performance that you are entitled to receive in real-world units. Instead, they have created qualitative terms — “compute units” for CPU performance, “moderate” bandwidth, etc. — that provide a measure of comparison against the other levels of services that Amazon provides. In and of itself, these qualitative designators are not a problem, except when trying to determine the potential variance between two elements that should have the same qualitative performance. For example, any two “large” EC2 instances running the same operating system should, in theory, provide the same measurable performance against a reasonable benchmark. In practice, however, there is a degree of variance between elements that should be the same; testing across a sample of “large” EC2 instances resulted in variations in the results of the SPECjvm2008 benchmark of ~30%. This would seem to indicate that some “large” instances are better than others.
Second, if you have ever asked Amazon any question about their existing capacity, their existing utilization, or their rate of capacity growth, then you probably received a polite “we don’t break out results or data for our AWS unit” response. But there are some hard data points then can be detected externally. For example, if you have attempted to start a large number of EC2 instances simultaneously then you might have seen an error message stating that not enough instances of the requested type were available. In my experience, this response has been exceptionally rare, and it is a testament to Amazon’s capacity planning that this response is, in fact, very rare. It does provide a hard data point, however, to detect when EC2 becomes oversubscribed, and one would expect the frequency of this response to increase if EC2 were oversubscribed. In the thread thus far, I have not read that anyone has detected a measurable increase in the frequency of this response, and thus one might conclude that EC2 utilization continues to be below the saturation level, or at least it remains similar to historical levels.
And lastly, I do have one comment about network performance. I have observed network latency issues of the type described in the cloudkick blog, but only for smaller EC2 instance types that could be assumed to be sharing hardware and network interface ports with other EC2 instances. I have not observed latency issues with larger EC2 instance types that could be assumed to consume an entire piece of hardware. I am clearly making some unsubstantiated assumptions here, but my guess is that the observed network latency issues could be a sharing or starvation issue in the virtualization infrastructure, rather than a true network capacity problem. But I would stress that this is just an educated guess, so your mileage may vary.
Hope this helps.
Cloud Computing and Mobile Devices
The explosive proliferation of mobile devices — smartphones, netbooks, and tablets — presents new challenges for software development. These devices have limited screen size, limited CPU and memory resources, and most importantly, limited power; these constraints will complicate the direct migration of existing thick client desktop software products to these devices. Computationally expensive applications will be very sensitive to these constraints, given that most devices employ CPU throttling to conserve power and to increase longevity for other functions, and CPU use on these devices will need to be minimized wherever possible.
Future advances in technology may alleviate some of these concerns, but battery technology has traditionally failed to keep pace with Moore’s Law, especially in cases of miniaturization, and thus the power concerns, and by extension the CPU concerns, may persist.
Cloud computing provides a potential solution for these concerns. In terms of power consumption, cloud computing provides a source of remote CPU cycles that do not consume device power, and these remote CPU cycles can be used to enable computationally expensive applications to operate on devices with a significantly lower net device power cost. Network power consumption will be marginally increased when using cloud-based resources, but the research hypothesis is that overall device power consumption will be significantly reduced.
I submit that a new application paradigm for these devices will need to evolve from the seeds of cloud computing, web applications, and client/server software in order to minimize device power consumption while otherwise providing a recognizable application user interface. The tools and infrastructure for these applications must be designed to maintain a bidirectional stream of data, visuals, and interface actions between a device and a cloud-based application provider, and to do so in a manner that will be both cost effective and beneficial for the power consumption of the device.
Time and Clock Issues in Windows-Based EC2 Instances
I’ve recently observed some anomalies in Windows-based EC2 instances that I think are worth sharing. The primary issue appears to affect the clock setting on some of the instances, but my guess is that there is an underlying hardware-dependent bug in the virtualization layer that is the cause of this issue and some other related side-effects (more on those later).
I built and I continue to operate a public facing enterprise software-as-a-service (SaaS) product for my company. Behind the scenes, I run both Linux-based and Windows-based EC2 instances to facilitate the functions of the service, and at times we might run large numbers of EC2 instances concurrently. Lately, I’ve started to see some Windows-based instances that intermittently boot with an incorrect clock setting. The time zone appears to be incorrect, and the time on the box is UTC instead of PST. The time, in hours/minutes/seconds, appears to be otherwise correct, but it is off by eight hours due to the time zone. Ordinarily, I wouldn’t be too worried about this, but in my case I use the S3 API on the instance, and the S3 API calculates a security signature for all requests that incorporates the current time of the machine in the algorithm. If the time setting of the S3 caller differs by more than 15 minutes from the time setting of the S3 servers, then the request will be bounced by the server with the following error message:
The difference between the request time and the current time is too large
The failure of this S3 request is a major problem for my application, so I dug into this to see what was going on. As far as I can tell, Windows is failing to sync up with time.windows.com via NTP and the result of this error seemed to be the incorrect time zone setting (although I can’t tell you why exactly that error would result in that outcome). I switched to an alternate NTP server at nist.gov — time-nw.nist.gov — and this appears to resolve the issue.
Some further searching has led me to believe that there may be an underlying hardware-related bug in the virtualization layer that affects the ability of Windows to access the time.windows.com service, as well as affecting some other applications (e.g. Cygwin, and Bash and Perl running via Cygwin), and this bug is observed only on AMD hardware. This is just a guess, however, and I can not say conclusively that this is a hardware-related issue since there is no transparency into EC2. But if you are running an application that is sensitive to the clock setting on an EC2 instance, then you should pay attention to the time zone setting of your instances as they boot.
Hope this helps.
My experimental local and real-time search engine is now available
I built another experimental search engine to play with some new ways to collect local and real-time information. The site is called screamradius. It has been doing well so far, and my experiments and enhancements continue.
More new features to come. Stay tuned!
Recent Posts
- “Is colocation cheaper than using a cloud computing service to run the same workload?”
- “How Big is Amazon’s Cloud Computing Business?”
- Amazon EC2 I/O Performance: Local Ephemeral Disks vs. RAID 0 Striped EBS Volumes
- “We create 5 exabytes every two days.”
- “Go Screw Yourself, Apple.”
- “You are not Google. (or: you don’t really need NoSQL…)”
- “The largest cloud providers are botnets.”
- Observed Performance of Amazon EC2 Instances
- Cloud Computing and Mobile Devices
- Time and Clock Issues in Windows-Based EC2 Instances
- My experimental local and real-time search engine is now available
- Entropy in Cloud Computing Applications
- How to Jailbreak iPhone 3.01
- How to Detect the Front (Home) Page of a Wordpress Blog
- How to Create an Amazon EC2 AMI That is Larger Than 10GB

