Thursday, December 10, 2009

Read-After-Write Consistency in Amazon S3

S3 has an "eventual consistency" model, which presents certain limitations on how S3 can be used. Today, Amazon released an improvement called "read-after-write-consistency" in the EU and US-west regions (it's there, hidden at the bottom of the blog post). Here's an explanation of what this is, and why it's cool.


What is Eventual Consistency?

Consistency is a key concept in data storage: it describes when changes committed to a system are visible to all participants. Classic transactional databases employ various levels of consistency, but the golden standard is that after a transaction commits the changes are guaranteed to be visible to all participants. A change committed at millisecond 1 is guaranteed to be available to all views of the system - all queries - immediately thereafter.

Eventual consistency relaxes the rules a bit, allowing a time lag between the point the data is committed to storage and the point where it is visible to all others. A change committed at millisecond 1 might be visible to all immediately. It might not be visible to all until millisecond 500. It might not even be visible to all until millisecond 1000. But, eventually it will be visible to all clients. Eventual consistency is a key engineering tradeoff employed in building distributed systems.

One issue with eventual consistency is that there's no theoretical limit to how long you need to wait until all clients see the committed data. A delay must be employed (either explicitly or implicitly) to ensure the changes will be visible to all clients.

Practically speaking, I've observed that changes committed to S3 become visible to all within less than 2 seconds. If your distributed system reads data shortly after it was written to eventually consistent storage (such as S3) you'll experience higher latency as a result of the compensating delays.


What is Read-After-Write Consistency?

Read-after-write consistency tightens things up a bit, guaranteeing immediate visibility of new data to all clients. With read-after-write consistency, a newly created object or file or table row will immediately be visible, without any delays.

Note that read-after-write is not complete consistency: there's also read-after-update and read-after-delete. Read-after-update consistency would allow edits to an existing file or changes to an already-existing object or updates of an existing table row to be immediately visible to all clients. That's not the same thing as read-after-write, which is only for new data. Read-after-delete would guarantee that reading a deleted object or file or table row will fail for all clients, immediately. That, too, is different from read-after-write, which only relates to the creation of data.


Why is Read-After-Write Consistency Useful?

Read-after-write consistency allows you to build distributed systems with less latency. As touched on above, without read-after-write consistency you'll need to incorporate some kind of delay to ensure that the data you just wrote will be visible to the other parts of your system.

But no longer. If you use S3 in the US-west or EU regions, your systems need not wait for the data to become available.


Why Only in the AWS US-west and EU Regions?

Read-after-write consistency for AWS S3 is only available in the US-west and EU regions, not the US-Standard region. I asked Jeff Barr of AWS blogging fame why, and his answer makes a lot of sense:
This is a feature for EU and US-West. US Standard is bi-coastal and doesn't have read-after-write consistency.
Aha! I had forgotten about the way Amazon defines its S3 regions. US-Standard has servers on both the east and west coasts (remember, this is S3 not EC2) in the same logical "region". The engineering challenges in providing read-after-write consistency in a smaller geographical area are greatly magnified when that area is expanded. The fundamental physical limitation is the speed of light, which takes at least 16 milliseconds to cross the US coast-to-coast (that's in a vacuum - it takes at least four times as long over the internet due to the latency introduced by routers and switches along the way).

If you use S3 and want to take advantage of the read-after-write consistency, make sure you understand the cost implications: the US-west and EU regions have higher storage and bandwidth costs than the US-Standard region.


Next Up: SQS Improvements?

Some vague theorizing:

It's been suggested that AWS Simple Queue Service leverages S3 under the hood. The improved S3 consistency model can be used to provide better consistency for SQS as well. Is this in the works? Jeff Barr, any comment? :-)

Friday, December 4, 2009

The Open Cloud Computing Interface at IGT2009

Today I participated in the Cloud Standards & Interoperability panel at the IGT2009 conference, together with Shahar Evron of Zend Technologies, and moderated by Reuven Cohen. Reuven gave an overview of his involvement with various governments on the efforts to define and standardize "cloud", and Shahar presented an overview of the Zend Simple Cloud API (for PHP). I presented an overview of the Open Grid Forum's Open Cloud Computing Interface (OCCI).

The slides include a 20,000-foot view of the specification, a 5,000-foot view of the specification, and an eye-level view in which I illustrated the metadata travelling over the wire using the HTTP Header rendering.

Here's my presentation.

Tuesday, November 17, 2009

How to Work with Contractors on AWS EC2 Projects

Recently I answered a question on the EC2 forums about how to give third parties access to EC2 instances. I noticed there's not a lot of info out there about how to work with contractors, consultants, or even internal groups to whom you want to grant access to your AWS account. Here's how.


First, a Caveat


Please be very selective when you choose a contractor. You want to make sure you choose a candidate who can actually do the work you need - and unfortunately, not everyone who advertises as such can really deliver the goods. Reuven Cohen's post about choosing a contractor/consultant for cloud projects examines six key factors to consider:
  1. Experience: experience solving real world problems is probably more important than anything else.
  2. Code: someone who can produce running code is often more useful than someone who just makes recommendations for others to follow.
  3. Community Engagement: discussion boards are a great way to gauge experience, and provide insight into the capabilities of the candidate.
  4. Blogs & Whitepaper: another good way to determine a candidate's insight and capabilities.
  5. Interview: ask the candidate questions to gauge their qualifications.
  6. References: do your homework and make sure the candidate really did what s/he claims to have done.
Reuven's post goes into more detail. It's highly recommended for anyone considering using a third-party for cloud projects.


What's Your Skill Level?


The best way to allow a contractor access to your resources depends on your level of familiarity with the EC2 environment and with systems administration in general.

If you know your way around the EC2 toolset and you're comfortable managing SSH keypairs, then you probably already know how to allow third-party access safely. This article is not meant for you. (Sorry!)

If you don't know your way around the EC2 toolset, specifically the command-line API tools, and the AWS Management Console or the ElasticFox Firefox Extension, then you will be better off allowing the contractor to launch and configure the EC2 resources for you. The next section is for you.


Giving EC2 Access to a Third Party


[An aside: It sounds strange, doesn't it? "Third party". Did I miss two parties already? Was there beer? Really, though, it makes sense. A third party is someone who is not you (you're the first party) and not Amazon (they're the counterparty, or the second party). An outside contractor is a third party.]

Let's say you want a contractor to launch some EC2 instances for you and to set them up with specific software running on them. You also want them to set up automated EBS snapshots and other processes that will use the EC2 API.


What you should give the contractor


Give the contractor your Access Key ID and your Secret Access Key, which you should get from the Security Credentials page:



The Access Key ID is not a secret - but the Secret Access Key is, so make sure you transfer it securely. Don't send it over email! Use a private DropBox or other secure method.

Don't give out the email address and password that allows you to log into the AWS Management Console. You don't want anyone but you to be able to change the billing information or to sign you up for new services. Or to order merchandise from Amazon.com using your account (!).


What the contractor will do


Using ElasticFox and your Access Key ID and Secret Access Key the contractor will be able to launch EC2 instances and make all the necessary configuration changes on your account. Plus they'll be able to put these credentials in place for automated scripts to make EC2 API calls on your behalf - like to take an EBS snapshot. [There are some rare exceptions which will require your X.509 Certificates and the use of the command-line API tools.]

For example, here's what the contractor will do to set up a Linux instance:
  1. Install ElasticFox and put in your access credentials, allowing him access to your account.
  2. Set up a security group allowing him to access the instance.
  3. Create a keypair, saving the private key to his machine (and to give to you later).
  4. Choose an appropriate AMI from among the many available. (I recommend the Alestic Ubuntu AMIs).
  5. Launch an instance of the chosen AMI, in the security group, using the keypair.
  6. Once the instance is launched he'll SSH into the instance and set it up. He'll use the instance's public IP address and the private key half of the keypair (from step 3), and the user name (most likely "root") to do this.
The contractor can also set up some code to take EBS snapshots - and the code will require your credentials.


What deliverables to expect from the contractor


When he's done, the contractor will give you a few things. These should include:
  • the instance ids of the instances, their IP addresses, and a description of their roles.
  • the names of any load balancers, auto scaling groups, etc. created.
  • the private key he created in step 3 and the login name (usually "root"). Make sure you get this via a secure communications method - it allows privileged access to the instances.
Make sure you also get a thorough explanation of how to change the credentials used by any code requiring them. In fact, you should insist that this must be easy for you to do.

Plus, ask your contractor to set up the Security Groups so you will have the authorization you need to access your EC2 deployment from your location.

And, of course, before you release the contractor you should verify that everything works as expected.


What to do when the contractor's engagement is over


When your contractor no longer needs access to your EC2 account you should create new access key credentials (see the "Create a new Access Key" link on the Security Credentials page mentioned above).

But don't disable the old credentials just yet. First, update any code the contractor installed to use the new credentials and test it.

Once you're sure the new credentials are working, disable the credentials given to the contractor (the "Make Inactive" link).


The above guidelines also apply to working with internal groups within your organization. You might not need to revoke their credentials, depending on their role - but you should follow the suggestions above so you can if you need to.

Tuesday, October 27, 2009

What Language Does the Cloud Speak, Now and In the Future?

You're a developer writing applications that use the cloud. Your code manipulates cloud resources, creating and destroying VMs, defining storage and networking, and gluing these resources together to create the infrastructure upon which your application runs. You use an API to perform these cloud operations - and this API is specific to the programming language and to the cloud provider you're using: for example, for Java EC2 applications you'd use typica, for Python EC2 applications you'd use boto, etc. But what's happening under the hood, when you call these APIs? How do these libraries communicate with the cloud? What language does the cloud speak?

I'll explore this question for today's cloud, and touch upon what the future holds for cloud APIs.


Java? Python? Perl? PHP? Ruby? .NET?

It's tempting to say that the cloud speaks the same programming language whose API you're using. Don't be fooled: it doesn't.

"Wait," you say. "All these languages have Remote Procedure Call (RPC) mechanisms. Doesn't the cloud use them?"

No. The reason why RPCs are not provided for every language is simple: would you want to support a product that needed to understand the RPC mechanism of many languages? Would you want to add support for another RPC mechanism as a new language becomes popular?

No? Neither do cloud providers.

So they use HTTP.


HTTP: It's a Protocol

The cloud speaks HTTP. HTTP is a protocol: it prescribes a specific on-the-wire representation for the traffic. Commands are sent to the cloud and results returned using the internet's most ubiquitous protocol, spoken by every browser and web server, routable by all routers, bridgeable by all bridges, and securable by any number of different methods (HTTP + SSL/TLS being the most popular, a.k.a. HTTPS). RPC mechanisms cannot provide all these benefits.

Cloud APIs all use HTTP under the hood. EC2 actually has two different ways of using HTTP: the SOAP API and the Query API. SOAP uses XML wrappers in the body of the HTTP request and response. The Query API places all the parameters into the URL itself and returns the raw XML in the response.

So, the lingua franca of the cloud is HTTP.

But EC2's use of HTTP to transport the SOAP API and the Query API is not the only way to use HTTP.


HTTP: It's an API

HTTP itself can be used as a rudimentary API. HTTP has methods (GET, PUT, POST, DELETE) and return codes and conventions for passing arguments to the invoked method. While SOAP wraps method calls in XML, and Query APIs wrap method calls in the URL (e.g. http://ec2.amazonaws.com/?Action=DescribeRegions), HTTP itself can be used to encode those same operations. For example:
GET /regions HTTP/1.1
Host: cloud.example.com
Accept: */*
That's a (theoretical) way to use raw HTTP to request the regions available from a cloud located at cloud.example.com. It's about a simple as you can get for an on-the-wire representation of the API call.

Using raw HTTP methods we can model a simple API as follows:
  • HTTP GET is used as a "getter" method.
  • HTTP PUT and POST are used as "setter" or "constructor" methods.
  • HTTP DELETE is used to delete resources.
All CRUD operations can be modeled in this manner. This technique of using HTTP to model a higher-level API is called Representational State Transfer, or REST. RESTful APIs are mapped to the HTTP verbs and are very lightweight. They can be used directly by any language (OK, any language that supports HTTP - which is every useful language) and also by browsers directly.

RESTful APIs are "close to the metal" - they do not require a higher-level object model in order to be usable by servers or clients, because bare HTTP constructs are used.

Unfortunately, EC2's APIs are not RESTful. Amazon was the undisputed leader in bringing cloud to the masses, and its cloud API was built before RESTful principles were popular and well understood.


Why Should the Cloud Speak RESTful HTTP?

Many benefits can be gained by having the cloud speak RESTful HTTP. For example:
  • The cloud can be operated directly from the command-line, using curl, without any language libraries needed.
  • Operations require less parsing and higher-level modeling because they are represented close to the "native" HTTP layer.
  • Cache control, hashing and conditional retrieval, alternate representations of the same resource, etc., can be easily provided via the usual HTTP headers. No special coding is required.
  • Anything that can run a web server can be a cloud. Your embedded device can easily advertise itself as a cloud and make its processing power available for use via a lightweight HTTP server.
All these benefits are important enough to be provided by any cloud API standard.


Where are Cloud API Standards Headed?

There are many cloud API standardization efforts. Some groups are creating open standards, involving all industry stakeholders and allowing you (the developer) to use them or implement them without fear of infringing on any IP. Some of them are not open, where those guarantees cannot be made. Some are language-specific APIs, and others are HTTP-based APIs (RESTful or not).

The following are some popular cloud APIs:

jClouds
libcloud
Cloud::Infrastructure
Zend Simple Cloud API
Dasein Cloud API
Open Cloud Computing Interface (OCCI)
Microsoft Azure
Amazon EC2
VMware vCloud
deltacloud

Here's how the above products (APIs) compare, based on these criteria:

Open: The specification is available for anyone to implement without licensing IP, and the API was designed in a process open to the public.
Proprietary: The specification is either IP encumbered or the specification was developed without the free involvement of all ecosystem participants (providers, ISVs, SIs, developers, end-users).
API: The standard defines an API requiring a programming language to operate.
Protocol: The standard defines a protocol - HTTP.



This chart shows the following:
  • There are many language-specific APIs, most open-source.
  • Proprietary standards are the dominant players in the marketplace today.
  • OCCI is the only completely open standard defining a protocol.
  • Deltacloud was begun by RedHat and is currently open, but its initial development was closed and did not involve players from across the ecosystem (hence its location on the border between Open and Proprietary).

What Does This Mean for the Cloud Developer?

The future of the cloud will have a single protocol that can be used to operate multiple providers. Libraries will still exist for every language, and they will be able to control any standards-compliant cloud. In this world, a RESTful API based on HTTP is a highly attractive option.

I highly recommend taking a look at the work being done in OCCI, an open standard that reflects the needs of the entire ecosystem. It'll be in your future.

Update 27 October 2009:
Further Reading
No mention of cloud APIs would be complete without reference to William Vambenepe's articles on the subject:

Saturday, October 17, 2009

Avoiding EC2 InsufficientInstanceCapacity: Insufficient Capacity Errors

Here's a quick tip from this thread on the AWS EC2 Developer Forums.

If you experience the InsufficientInstanceCapacity: Insufficient Capacity error, you'll be glad to know there are some strategies for working around it. Justin@AWS offers this advice:
There can be short periods of time when we are unable to accommodate instance requests that are targeted to a specific Availability Zone. When a particular instance type experiences unexpected demand in an Availability Zone, our system must react by shifting capacity from one instance type to another. This can result in short periods of insufficient capacity. We incorporate this data into our capacity planning and try to manage all zones to have adequate capacity at all times. The following steps will ensure that you will have the best experience launching Amazon EC2 instances when an initial insufficient capacity message is received:

1. Don't specify an Availability Zone in your request unless necessary. By targeting a specific Availability Zone you eliminate our ability to satisfy that request by using our other available Availability Zones. Please note that a single RunInstances call will allocate all instances within a single Availability Zone.

2. If you require a large number of instances for a particular job, please request them in batches. The best practice to follow here is to request 25% of your total cluster size at a time. For example, if you want to launch 200 instances, launching 50 instances at a time will result in a better experience.

3. Try using a different instance type. As capacity varies across instance types, attempting to launch difference instance types provides spill over capacity should your primary instance type be temporarily unavailable.

Unfortunately, these techniques require that you be willing to accept higher bandwidth costs for cross-availability-zone traffic.

And, none of these tips help if you're using Auto Scaling. A single Auto Scaling Group must be in a specific availability zone, so #1 won't help. You can try using smaller numbers of instances when a trigger is reached by choosing a smaller LowerBreachScaleIncrement or UpperBreachScaleIncrement (which control by how many instances or by what percent to scale in each direction), as per #2, but this is only helpful if you've planned in advance. And #3 is only possible if you've already noticed an auto scaling activity failure and changed the Launch Configuration - which defeats the purpose of Auto Scaling.

Auto Scaling's error reporting and recovery is very limited currently. Are you listening, AWS?

Update 18 October 2009: AWS is listening. The following post by John@AWS appears in this thread:
AutoScaling currently reports [...] InsufficientInstanceCapacity [...] as a generic Internal Error. This is unintentional, and will be remedied in our next release.
Cool!

Update 19 October 2009: Auto Scaling Groups can now be configured to support more than one Availability Zone. Here is the salient quote from the updated documentation:
Instance Distribution and Balance across Multiple Zones

Amazon Auto Scaling attempts to distribute instances evenly between the Availability Zones that are enabled for your AutoScalingGroup. Auto Scaling uses the Availability Zone with the least number of instances when launching new instances. However, if an Availabilty Zone has insufficient capacity or if Amazon EC2 is unable to launch new instances in it, then Auto Scaling launches instances in another Availability Zone to satisfy the required capacity for your group.

Certain operations and conditions can cause your AutoScalingGroup to become unbalanced. Auto Scaling compensates by creating a rebalancing activity under any of the following conditions:

  1. You issue a request to change the Availability Zones for your group.

  2. You call TerminateInstanceInAutoScalingGroup, which causes the group to become unbalanced.

  3. An Availability Zone that previously had insufficient capacity recovers and has additional capacity available.

Auto Scaling always launches new instances before attempting to terminate old ones, so a rebalancing activity will not compromise the performance or availability of your application.

Multi-Zone Instance Counts when Approaching Capacity

Because Auto Scaling always attempts to launch new instances before terminating old ones, being at or near the specified maximum capacity could impede or completely halt rebalancing activities. To avoid this problem, the system can temporarily exceed the specified maximum capacity of a group by a 10% margin during a rebalancing activity. The margin is only extended if the group is at or near maximum capacity and needs rebalancing (either as a result of user-requested rezoning or to compensate for zone availability issues). The extension only lasts as long as needed to re-balanced the group (typically a few minutes).

Sunday, September 27, 2009

Alternatives to Elastic IPs for EC2 Name Resolution

How can you handle DNS lookups in EC2 without going crazy each time a resource's IP address changes? One solution is to use an Elastic IP, a stable IP address that can be remapped to different instances, but Elastic IPs are not appropriate for all situations. This article explores the various methods of managing name resolution with EC2 instances.

Features of Different Name Resolution Methods

Before diving into the methods themselves let's take a look at the factors to consider when evaluating methods of managing name resolution. Here are the factors:
  • Updatable in code. You will want to write code to make changes to the name resolution settings automatically, in response to infrastructure events (e.g. launching a new server).
  • Propagation delay. It can take some time for changes to name resolution settings to propagate (especially with DNS). A solution should offer some degree of assurance that changes will propagate within a known and reasonable period of time. [Note that some clients (e.g. the IE browser or the Java rutime) by default ignore the DNS TTL, artificially increasing the propagation delay for DNS-based methods.]
  • Compatible with DNS. If your service will be accessed by a web browser or other client that you do not control, your name resolution method will need to be compatible with DNS. Otherwise clients will not be able to resolve your hostnames properly.
  • Ease of implementation. Some solutions, while technically sufficient, are difficult to implement.
  • Public / Private IP addresses. Whether the solution can serve public and/or private IP addresses. If your clients are inside the same EC2 region then you want their lookups to resolve to the private IP address. Clients outside the same EC2 region should be served the public IP address.
  • Supply. Is there any practical limitation on the number of name resolution entries?
  • Cost. How much it costs to implement, including costs for idle resources and updating settings.
Methods of Name Resolution

As mentioned above, there are a number of different methods to manage name resolution. These are:
  • Traditional DNS.
  • Dynamically update the /etc/hosts file on the various application hosts. The /etc/hosts file on linux (like the C:\Windows\System32\Drivers\etc\hosts file on Windows) contains host-name-to-IP-address mappings that are checked before DNS is consulted, allowing it to override DNS. The file can be updated via pull (initiated by the host) or push (initiated by an external agent).
  • Store the mappings in S3 or SimpleDB. Clients must use the S3 or SimpleDB APIs for name resolution.
  • Use a dynamic DNS provider.
  • Run your own traditional DNS servers for your domain. Clients must be able to see these DNS servers.
  • Run your own dynamic DNS servers for your domain. Clients must be able to see these DNS servers.
  • Elastic IPs. The AWS pricing model discourages (though not strongly enough, I believe) Elastic IPs from being left unused, so you should use them for instances hosting services that are always on, such as your web server or your Facebook application. You should set up a DNS entry pointing the host names to the Elastic IPs, and then any remapping of the Elastic IP to a different instance happens via the EC2 API without requiring any change to DNS.
Here is a table (click on it to see it full size) showing how each of these name resolution methods stack up against each other:



Notes:
  • Dynamically updating /etc/hosts can be used to store either the public IP or the private IP but not both for the same client. You can use one /etc/hosts file for your clients inside the same EC2 region which contains the private IPs, and a different but corresponding /etc/hosts file for your clients outside the EC2 region (or outside EC2 completely) which contains the public IPs. The propagation delay is governed by the frequency with which you update the /etc/hosts file on each client. You can minimize this delay by increasing the frequency of updates. This technique is described in detail in an article by Tim Dysinger.
  • Similarly, the two "run your own DNS" methods (Your Own DNS for your Domain, Your Own Dynamic DNS for your Domain) can be used to resolve to either the public IP address or the private IP address, but not both for the same client. You should set up your clients inside EC2 to utilize the DNS service inside EC2, and the domain should be configured to point to the DNS service running outside EC2 so that clients outside EC2 will see the public IPs. Note that clients running inside EC2 whose DNS resolution you do not control (for example, another EC2 user's client) will be referred to the public IPs. Jeff Roberts offers some great practical suggestions for running your own DNS inside EC2.
This table demonstrates the following:
  • Elastic IPs are the best choice when you need only a limited number of resolvable names and you will use them constantly. If you use their corresponding DNS name then they intelligently resolve to the public IP when looked up from the internet and to the private IP when looked up from within EC2.
  • If you need an unlimited number of resolvable names within EC2 then you should run your own dynamic DNS within EC2.
  • Methods that are incompatible with DNS should only be used with clients you control.
As we can see, Dynamic DNS (especially running your own) has one distinct advantage over using Elastic IPs: unlimited supply at no cost when unused.

When Running Your Own Dynamic DNS is Better than Elastic IPs

One application for running your own Dynamic DNS is a testing environment that includes large clusters of EC2 instances, for example database cluster or application nodes, connected to web layer instance(s). These cluster instances will only be visible to the front-end web tier, so they do not need a publicly resolvable IP address. And your testing environment is not likely to be running all the time. Elastic IPs would work here (presuming you needed only 5 or you could convince AWS to increase your Elastic IP limit to meet your needs), but would cost money when unused. A more economical solution might be to use your own Dynamic DNS within EC2 for these instances. If you have spare capacity on an existing instance then you can put the Dynamic DNS service there - otherwise you will need another instance, making the cost less attractive. In any case you'll need the instance hosting the Dynamic DNS to have an Elastic IP to allow failover without affecting the clients. And you'll need a script to dynamically configure the /etc/resolv.conf on your EC2 clients to point to the private IP address of the Dynamic DNS instance by looking up its Elastic IP's DNS name.

Let's compare the monthly costs of using Elastic IPs with the costs of running your own dynamic DNS for a testing environment such as the above. The cost reflects the following ingredients and assumptions:
  • The number of hours over the month that allocated addresses (DNS entries) are not associated with a live instance, in total for all allocated addresses. If you have DNS entries / addresses and leave them unmapped for 10 hours each then you have 100 unmapped hours.
  • The number of changes to the DNS mappings made that month.
  • The fractional cost of running an instance just to serve the dynamic DNS. If you have spare capacity on an existing instance then this is the instance cost multiplied by the fraction of the capacity that the dynamic DNS service uses. If you need to spin up a dedicated instance for the dynamic DNS service then this is the entire cost of that instance.
  • Pricing for Elastic IPs: free when in use. 1 cent per hour unused. First 100 remaps per month free, 10 cents per remap afterward.
It should be obvious that using dynamic DNS for this testing environment will be economical when

FractionalDNSInstanceCost < NumUnmappedHours * 0.01 + MAX(NumMappingChanges - 100, 0) * 0.1

For simplicity's sake this can be rewritten in clearer terms:

FractionalDNSInstanceCost < NumInstances * ( NumHoursClusterUnused * 0.01 + MAX(NumTimesClusterIsLaunched - 100, 0) * 0.1)

Right about now I'm wishing Excel had better 3-D graphing capabilities. Here's something helpful to visualize this:



The chart shows the monthly cost of running clusters of different sizes according to how many times the cluster is launched. The color "bands" show the areas in which the monthly cost lies, depending on how many hours the cluster remains unused. For a given number of times launched (i.e. for a given vertical line), the "bottom" point of each band is the cost when the cluster is unused zero hours (i.e. always on), and the "top" point is the cost when the cluster is unused for 500 hours (about 20 days).

The dominant factors are, first, the number of instances in the cluster and, second, the number of times the cluster will be launched. A cluster of 100 instances costs $10 each time it is launched beyond the first 100, (plus $1 for each hour unused). For large cluster sizes, the more times you launch, the higher the cost of using Elastic IPs will be and the more attractive the run-your-own dynamic DNS option becomes.

Thursday, September 24, 2009

Cool Things You Can Do with Shared EBS Snapshots

I've been awaiting this feature for a long time: Shared EBS Snapshots. Here's a brief intro to using the feature, and some cool things you can do with shared snapshots. I also offer predictions about things that will appear as this feature gains adoption among developers.

How to Share an EBS Snapshot

Really, it's easy. The first thing you'll need to know is the Account Number of the user with whom you want to share the snapshot. If you want to make the snapshot public then you don't need this. The account number can be found in the Your Account > Account Activity page. It's in small numbers in the top-right of the page (so small you may need to click on the image below to see it in full size):


The person with whom you want to share the snapshot (you are the sharer, they are the "sharee"?) should tell you this 12-digit number. Don't worry, sharee, it's not a secret.

Once you have the sharee's account number you, the sharer, go into the AWS Management Console and choose the Snapshots item. Find the snapshot you want to share and right-click on it, choosing "Snapshot Permissions". You'll get the following dialog:



Fill in the sharee's account number, without the separating dashes, into the dialog, and hit "Save". It should only take a few seconds and... presto! The snapshot should be visible in the sharee's AWS Management Console Snapshots page.

Cool Things You Can Do with Shared Snapshots

Update 27 September 2009: Before you share snapshots publicly, read Eric Hammond's warning about the dangers of doing so.

Easily move data between development, testing, and production

You've been keeping separate AWS accounts for your production environment, your testing environment, and your development environment, right? Right? Well, in case you haven't, you no longer have any excuse not to do so. You can now share your database, your HDFS volumes (if you use Cloudera's Hadoop distribution with EBS support), and anything else of significant size between these separate accounts. No more "tar, gzip, split into < 5GB chunks, upload to S3" and "download from S3, concatenate, untar-gzip". Your data is ready to go with the newly-created volume.

Share entire setups for troubleshooting and support


If you support a product that is deployed in EC2 you no longer need to jump through hoops to get access to your customer's files when there's a problem. Simply have them put the relevant files into an EBS volume, snapshot it, and share the snapshot with you.

Deliver your application in a more granular manner

Until today you delivered your application as an AMI - perhaps even a DevPay AMI - and you may not have given your customers root access. But, if your application used less than 100% of an instance's CPU, the customer was stuck paying for an entire CPU. Now, you can distribute your applications as a shared snapshot instead, and your customers will be free to use the rest of the instance's CPU. You'll just need to build a way to manage access, only allowing authorized customers to see the snapshot.

Deliver you customer's results in a more usable format

If you run a service that provides large amounts of data, you no longer need to use S3 to share the results. Until today you had to store the results in S3, and your customer needed to retrieve the results from S3 in order to use them. No longer: now you can provide a shared snapshot of the results, and the customer can access them via their filesystem more simply. "The shared snapshot is the new bucket."

Mount a volume created from a shared snapshot at startup

In a previous article I explained how to automatically mount an EBS volume created from a snapshot during the instance's startup sequence. I provided a script that gets the snapshot ID via the user-data and does all the rest automatically. Now you can also use snapshots that have been shared.

Update 25 September 2009: Share entire machines

Reader Robert Staveley (Tom) comments below about his use for shared snapshots: Sharing entire machines - boot code and everything - between development, testing, and production accounts. Using the technique to boot an instance from an EBS volume he points out that the entire bootable hard drive and all applications (even beyond 10GB) can be shared between these accounts.

Things to Expect in the Future

Shared snapshots are still a very new feature, but here are some things I expect to happen now that this is possible.
  • The AWS Management Console is the only UI that allows you to share a snapshot. ElasticFox will be adding this capability Real Soon Now, and I am sure others will as well.
  • Alternatives to AMIs. AMIs have many limitations, such as the 10GB maximum size, that can be circumvented using a technique I described to boot from an EBS volume. I expect to see OS distributions packaged as a shared EBS snapshot. These distributions could all share a common AMI containing just enough code to create a volume from the shared distribution snapshot, mount it, and boot from it. No more headaches bundling an AMI - just share a new bootable EBS snapshot.
  • Payment gateway services for managing access to shared snapshots. Now that you're distributing software as a shared snapshot you'll need to manage access to the snapshot, limiting it to authorized customers. You might build that system yourself today, but soon we'll see third-party services that do this for you.
Do you have other cool uses or predictions for shared snapshots? Please comment!