Amazon AWS Storage Basics: Stop the Sprawl Before it Begins!

by Ofir Nachmani
March 5, 2012

There is a common perception that cloud storage should not really worry you because it is very cheap and available at any time. But is that really true? I often hear AWS consumers say that AWS storage means S3 (Simple Storage Service) – this is true but it is not the whole truth. There are actually 4 different AWS cloud storage models. We’ll get back to those but first let’s focus on the importance of understanding your AWS S3 footprint.
Having complete visibility into your S3 buckets enables you to implement AWS storage policies and tier your storage needs with confidence. If you use Amazon storage you must have the transparency in order to make better decisions and take actions looking at:

Usage configuration and policy analysis
Usage analysis to tier storage
Reduced Redundancy Storage (RRS) and Glacier Opportunities
Cost and Asset Reports

Now, let’s jump back to additional AWS cloud storage opportunities. As I stated before, AWS storage opportunities extend beyond S3. Below, I have outlined some of the different offerings to meet your cloud storage needs.

1 – EBS (Elastic Block Store) – Storage designed specifically for EC2 instances. The EBS volumes can be mounted on an instance and basically serve as simple external virtual hard drives. EBS volume sizes range from 1GB to 1Tera. EBS accounts can have 5000 volumes or an aggregate size of 20 TiB (whichever is smaller) unless a higher limit is requested from Amazon ($0.10 per GB/month of provisioned storage).

2 – S3 – A complete public storage service. You can store data and make it publicly available without any other infrastructure. It is available over HTTP and stores objects as keys. S3 accounts can have a maximum of 100 buckets, each with unlimited storage and an unlimited number of files. The maximum size of a single file is 5 GB. ($0.10 per GB-month of provisioned storage).

What are S3 buckets, objects or keys? To learn more see the S3 page on the AWS portal or check out these Technical FAQs. Also take a look at the differences between S3 and EBS.

3 – Snapshots – Simple point-of-time back-ups of an EBS volume that are stored on S3. After the initial snapshot, only the changes between snapshots are stored in S3. AWS accounts are limited to 10,000 EBS snapshots ($0.14 per GB/month of data stored, without taking into consideration data-transfer to and from S3).

4 – AMI (Amazon Machine Image) –The AWS image of an instance. Amazon called it “the basic unit to deploy services delivered using EC2”. The AMI file-system is compressed, encrypted, signed, split into a series of 10MB chunks, and uploaded into Amazon S3 for storage. The AMI can also be backed up on an EBS volume. Check out this list of ready-made AMIs

See prices on Amazon AWS portal
Now, let’s talk about growth or rather, uncontrolled cloud-storage sprawl. The appealing cost, ease-of-use, and availability of on-demand storage resources lead the cloud user to consume more and more. But what about deleting redundant or orphaned storage resources?
Consider the following example: a running web application on Amazon AWS. In this case, the overall weight of the storage cost is compounded from:
1 – 4 EBS volumes with a provisioned storage of 300G each ($0.10 per 1G) totaling $120/month
2 – EBS snap shots with a 10% daily increase ($0.14 per 1G) totaling $108.15/Month
3 – S3 usage of 30G a month totaling $4.2/month
The total cost is about $223 per month for this static use. But this isn’t a cloud utilization case – it is a simple and affordable hosting solution that does not take into consideration scalability, I/O operations, and bandwidth that are also important cost factors. The conversation of cost must also include the growth patterns of resources and environment complexity. Consider the following use case demonstrating a growth of 100GB a day!

“Yelp uses Amazon S3 to store daily logs and photos, generating around 100GB of logs per day. The company also uses Amazon Elastic MapReduce to power approximately 20 separate batch scripts, most of those processing the logs.” Check out this case study

Now let’s do a quick calculation of the cost change: 100GB a day *$0.14 = $14 a day => an additional cost of $420 per month. It’s not all that much money if we consider only S3 and ignore all other storage types. Due to the low cost, most public cloud consumers push the issue of storage management to the end of their “cloud adoption” tasks list. We have also found that the IT team is somewhat helpless with regards to accurately tracking, measuring and controlling their cloud storage. Even for new cloud consumers with very low storage expenses, the inevitable growth over time will make the storage costs and complexity a significant part of the overall cloud’s environment and operations.
To avoid S3 storage sprawl, it’s best to have clear storage policies from the get go. The IT organization (such as an ISV or an enterprise IT department) must think ahead and take action even if the current state is not so problematic. For example, make sure to map the storage means and prices with related business goals. The following tips can help you take the next step forward in taking charge of your AWS storage:
1 – Plan: Architect and manage a smart deployment. This is not trivial – IT organizations consume the cloud’s storage as an extension of their on-premise storage solutions. Unfortunately, they don’t often don’t step back to think and plan. You should make sure that rules are in place and the chosen storage type fit the data and the needed SLA.

“Organizations should use caution not to divert from on-premises, best practices solutions when placing data in the cloud. For example, if a business best practice requires a dataset to have daily snapshots, the same application in the cloud should not be configured to perform hourly snapshots just because cloud storage has a lower cost point. This not only will cost more money for the additional space, but also will add to the management cost of maintaining the additional snapshots. The real expense resides in the time it takes to manage the storage. Read more on Gartner’s report.” “Report Public Cloud-Storage Management: The Epitome of Storage Sprawl” a report by Gartner

2 – Start Small: The common cloud approach is also valid here. For example, in the case of EBS volumes, here too you should avoid “traditional thinking” but be sure to start small. Your storage capacity can always grow in keeping with demand.
3 – Documentation and Measurement: To avoid duplication and superfluous data transfers every change should be documented (even starting with an excel spreadsheet will do). The immaturity of cloud tools, including storage measurement, makes it hard to have a real-time qualitative view. I invite you to check our post on the IaaS management market.
4 – Clean Up: Define a storage clean up routine for your IT operations (or should we say your “DevOps“) team.

“S3 includes a property for CreatedDate. Today the way to do this is to LIST objects in your bucket, pass over the returned list to grep out objects with CreatedDate older than 30 days, then send delete requests for those objects. Delete requests as you probably know are free.” Learn more about Old data deletion on the AWS forums

Storage redundancy can happen when not all linked resources are deleted. For example, shutting down instances or EBS-based AMIs will not automatically delete the associated EBS storage. Even an API call to terminate an instance will not necessarily include shutting down the attached EBS volumes. Temporary files, EBS volumes and AMIs should have standard naming and should be deleted as part of (at least) a “weekly clean up” task. Learn more about cleaning up EC2 Images, AMIs, and Instances on the AWS portal.
5 – Compression: Same as on-premise you need to keep a compression mechanism for the files before uploading them to an S3 bucket or to an EBS volume.
Amazon AWS writes on its site:

“Snapshot storage is based on the amount of space your data consumes in Amazon S3. Because data is compressed before being saved to Amazon S3, and Amazon EBS does not save empty blocks, it is likely that the size of a snapshot will be considerably less than the size of your volume.” Read More

We tried to find what is meant by “considerably less” but we didn’t get any satisfying quantitative results. Amazon AWS doesn’t present a clear view of the generated snapshot size. How can I tell the size of an AWS snapshot? Check out this Quora discussion.
Public cloud storage is an extremely dynamic environment containing enormous amount of components. We define it as a “typical cloud problem”. The cloud storage monitoring solutions currently on the market are still immature, as is the awareness of the cloud consumer. The organization’s cloud environment grows consistently as does the fog around it.