Validating CDMI Features – Metadata Search

Here we go again with an announcement of a cloud offering that again validates an existing standardized feature of CDMI. The new Amazon CloudSearch offering lets you store structured metadata in the cloud and perform queries on the metadata. They missed an opportunity, however, to integrate this with their existing cloud object storage offering. After all, if you already have object storage, why not put the metadata with the data object instead of separating it out in a separate cloud?

CDMI lets you put the user metadata directly into the storage object, where it is protected, backed up, archived and retained along with the actual data. CDMI’s rich query functions are then able to find the storage object based on the values of the metadata without talking to a separate cloud offering with a new, proprietary API.

CDMI standardizes a Query Queue that allows the client to create a scope specification (equivalent to a WHERE clause) to find specific objects that match the criteria, and a results specification (equivalent to a SELECT clause) that determines the elements of the object that are returned for each match. Results are placed in a CDMI queue object and can be processed one at a time, or in bulk. This powerful feature allows any storage cloud that has a search feature to expose it in a standard manner for interoperability between clouds.

An example of the metadata associated with a query queue is as follows:

{
     "metadata" : {
          "cdmi_queue_type" : "cdmi_query_queue",
          "cdmi_scope_specification" : [
               {
                    "domainURI" : "== /cdmi_domains/MyDomain/",
                    "parentURI" : "starts /MyMusic",
                    "metadata" : {
                         "artist" : "*Bono*"
                    }
               }
          ],
          "cdmi_results_specification": {
               "objectID" : "",
               "metadata" : {
                    "title" : ""
               }
          }
     }
}

 

When results are stored in a query queue, each enqueued value consists of a JSON object of MIME-type “application/json”. This JSON object contains the specified values requested in the cdmi_results_specification of the query queue metadata.

An example of a query result JSON object is as follows:

{
     "objectID" : "00007E7F0010EB9092B29F6CD6AD6824",
     "metadata" : {
          "title" : "Vertigo"
     }
}

Thus if you are using your storage cloud for storing music files, for example, all of the metadata for each mp3 object can be stored right along with the object, and CDMI’s powerful query mechanisms can be used to find the files you are interested in without invoking a separate search cloud with disassociated metadata,

Data Reduction Research Notes

With the continuing system enterprise data growth rates, which in some areas may even exceed 100% year over year, according to the IDC, many technical approaches to reducing overall storage needs are being investigated. The following is a short review of the areas in which interesting technical solutions have been implemented. One primary technique which has been receiving a lot of attention involves ‘Deduplication’ technology, which can be divided into many areas. Some papers covering deduplication overviews are currently available in the DPCO presentation & tutorial page, at http://www.snia.org/forums/dpco/knowledge/pres_tutorials. A new presentation by Gene Nagle (the current chairman of the DPCO) and Thomas Rivera will be posted there soon, and will be presented at the upcoming spring 2012 SNW conference.

Other areas which have been investigated involve storage management, rather than concentrating on data reduction. This involves implementing storage tiers, as well as creating new technologies, such as Virtual Tape Libraries and Solid State Devices, in order to ease the implementation of various tiers. Here are the areas which seem to have had quite a bit of activity.

Data reduction areas

• Compression
• Thin Provisioning
• Deduplication, which includes
o File deduplication
o Block deduplication
o Delta block optimization
o Application Aware deduplication
o Inline vs. Post processing deduplication
o Virtual Tape Library (VTL) deduplication

Storage Tiering

Tiered storage arranges various storage components in a structured organization, in order to have data storage automatically migrated between storage components which have significantly different performance as well as cost. These components are quite variable, based on performance characteristics and throughput, location with regards to the servers, overall cost, media types, and other issues. The policies based on these parameters which are developed to define each tier will have significant effects, since these policies determine the movement of data within the various tiers, and the resulting accessibility of that data. An overview of Storage Tiering, called “What’s Old Is New Again”, written by Larry Freeman, is available in this DPCO blog, and he will also be giving a related presentation at the Spring 2012 SNW.

SSD and Cache Management

Solid state memory has become quite popular, since it has such high retrieval performance rate, and can be used both as much larger cache implementation than before, as well as the top level for tiered storage. A good discussion of this is at http://www.informationweek.com/blog/231901631

VTL

Storage presented as a virtual tape library will allow integration with current backup software, using various direct attach or network connections, such as SAS, FibreChannel, or iSCSI. A nice overview is at http://searchdatabackup.techtarget.com/feature/Virtual-tape-library-VTL-data-deduplication-FAQ.

Thin Provisioning

Thin provisioning is a storage reduction technology which uses storage virtualization to reduce overall usage; for a brief review, see http://www.symantec.com/content/en/us/enterprise/white_papers/b-idc_exec_brief_thin_provisioning_WP.en-us.pdf

Deduplication Characteristics & Performance Issues

When looking at the overall coverage of deduplication techniques, it appears that file level deduplication can cover a high percentage of the overall storage, which may offer a simpler and quicker solution for data reduction. Block level deduplication may introduce bigger performance and support issues and will add a layer of indirection, in addition to de-linearizing data placement, but it is needed for some files, such as VM & filesystem images. In addition, when performing deduplication on backup storage, this may not be a severe issue.

One deduplication technique called sparse file support, where chunks of zeros are mapped by marking their existence in metadata, is available in NTFS, XFS, and the ext4 file systems, among others. In addition, the Single Instance Storage (SIS) technique, which replaces duplicate files with copy-on-write links, is useful and performs well.

Source side deduplication is complex; storage side deduplication is much simpler, so implementing deduplication at the storage site, rather than at the server site, may be preferable. In addition, global deduplication in clustered environments or SAN/NAS environments can be quite complex, and may lead to fragmentation, so local deduplication, operating within each storage node, is a simpler solution. It uses a hybrid duplicate detection model aiming for file-level deduplication, and reverting to segment level deduplication only when necessary. This reduces the global problems to simple routing issues, so that the incoming files are routed to the node which has the highest likelyhood of possessing a duplicate copy of the file, or of parts of the file.

See “A Study of Practical Deduplication”, given the best paper award at USENIX Fast 2011: http://www.usenix.org/events/fast11/tech/full_papers/Meyer.pdf. It has references to other papers which discuss various experiments and measurements with deduplication and other data reduction techniques. Also, look at various metrics, discussed in “Tradeoff in Scalable Data Routing for Deduplication Clusters” at http://www.usenix.org/events/fast11/tech/full_papers/Dong.pdf

Data Reduction Research Notes

With the continuing system enterprise data growth rates, which in some areas may even exceed 100% year over year, according to the IDC, many technical approaches to reducing overall storage needs are being investigated. The following is a short review of the areas in which interesting technical solutions have been implemented. One primary technique which has been receiving a lot of attention involves ‘Deduplication’ technology, which can be divided into many areas. Some papers covering deduplication overviews are currently available in the DPCO presentation & tutorial page, at http://www.snia.org/forums/dpco/knowledge/pres_tutorials. A new presentation by Gene Nagle (the current chairman of the DPCO) and Thomas Rivera will be posted there soon, and will be presented at the upcoming spring 2012 SNW conference.

Other areas which have been investigated involve storage management, rather than concentrating on data reduction. This involves implementing storage tiers, as well as creating new technologies, such as Virtual Tape Libraries and Solid State Devices, in order to ease the implementation of various tiers. Here are the areas which seem to have had quite a bit of activity.

Data reduction areas

• Compression
• Thin Provisioning
• Deduplication, which includes
o File deduplication
o Block deduplication
o Delta block optimization
o Application Aware deduplication
o Inline vs. Post processing deduplication
o Virtual Tape Library (VTL) deduplication

Storage Tiering

Tiered storage arranges various storage components in a structured organization, in order to have data storage automatically migrated between storage components which have significantly different performance as well as cost. These components are quite variable, based on performance characteristics and throughput, location with regards to the servers, overall cost, media types, and other issues. The policies based on these parameters which are developed to define each tier will have significant effects, since these policies determine the movement of data within the various tiers, and the resulting accessibility of that data. An overview of Storage Tiering, called “What’s Old Is New Again”, written by Larry Freeman, is available in this DPCO blog, and he will also be giving a related presentation at the Spring 2012 SNW.

SSD and Cache Management

Solid state memory has become quite popular, since it has such high retrieval performance rate, and can be used both as much larger cache implementation than before, as well as the top level for tiered storage. A good discussion of this is at http://www.informationweek.com/blog/231901631

VTL

Storage presented as a virtual tape library will allow integration with current backup software, using various direct attach or network connections, such as SAS, FibreChannel, or iSCSI. A nice overview is at http://searchdatabackup.techtarget.com/feature/Virtual-tape-library-VTL-data-deduplication-FAQ.

Thin Provisioning

Thin provisioning is a storage reduction technology which uses storage virtualization to reduce overall usage; for a brief review, see http://www.symantec.com/content/en/us/enterprise/white_papers/b-idc_exec_brief_thin_provisioning_WP.en-us.pdf

Deduplication Characteristics & Performance Issues

When looking at the overall coverage of deduplication techniques, it appears that file level deduplication can cover a high percentage of the overall storage, which may offer a simpler and quicker solution for data reduction. Block level deduplication may introduce bigger performance and support issues and will add a layer of indirection, in addition to de-linearizing data placement, but it is needed for some files, such as VM & filesystem images. In addition, when performing deduplication on backup storage, this may not be a severe issue.

One deduplication technique called sparse file support, where chunks of zeros are mapped by marking their existence in metadata, is available in NTFS, XFS, and the ext4 file systems, among others. In addition, the Single Instance Storage (SIS) technique, which replaces duplicate files with copy-on-write links, is useful and performs well.

Source side deduplication is complex; storage side deduplication is much simpler, so implementing deduplication at the storage site, rather than at the server site, may be preferable. In addition, global deduplication in clustered environments or SAN/NAS environments can be quite complex, and may lead to fragmentation, so local deduplication, operating within each storage node, is a simpler solution. It uses a hybrid duplicate detection model aiming for file-level deduplication, and reverting to segment level deduplication only when necessary. This reduces the global problems to simple routing issues, so that the incoming files are routed to the node which has the highest likelyhood of possessing a duplicate copy of the file, or of parts of the file.

See “A Study of Practical Deduplication”, given the best paper award at USENIX Fast 2011: http://www.usenix.org/events/fast11/tech/full_papers/Meyer.pdf. It has references to other papers which discuss various experiments and measurements with deduplication and other data reduction techniques. Also, look at various metrics, discussed in “Tradeoff in Scalable Data Routing for Deduplication Clusters” at http://www.usenix.org/events/fast11/tech/full_papers/Dong.pdf

The Future of Flash & SSDs Is Not-So-Bleak

You have have seen articles about the study by a UCSD researcher that says future of Flash (and NAND Flash-based SSDs)  is bleak: http://www.networkworld.com/news/2012/021612-ssds-have-a-bleak-future-256255.html?source=NWWNLE_nlt_daily_am_2012-02-20

Well, SSSI member Allyn Malventano of PC Perspectives begs to differ: http://www.pcper.com/reviews/Editorial/NAND-Flash-Memory-Future-Not-So-Bleak-After-All

 

SNIA ESF Sponsored Webinar on Advances in NFSv4

Good news.

The SNIA Ethernet Storage Forum (ESF) will be presenting a live webinar on the topic of NFS version 4, including version 4.1 (RFC 5661) as well as a glimpse of what is being considered for version 4.2. The expert on the topic will be Alex McDonald, SNIA NFS SIG co-chair. Gary Gumanow, ESF Board Member will moderate the webinar.

The webinar will begin at 8am PT / 11am ET. You can register for this BrightTalk hosted event here http://www.brighttalk.com/webcast/663/41389.

The webinar will be interactive, so feel free to ask questions of the guest speaker. Questions will be addressed live during the webinar. Answers to questions not addressed during the webinar will be included with answers from the webinar on a blog post after the event on the SNIA ESF blog.

So, get registered. We’ll see you on the 29th.

Recommended Reading List on SSDs and Performance

SSSI has developed an extensive library of educational materials about SSD performance and how to use the SSS Performance Test Specifications to measure it.  If you’re new to SSDs or simply want to become more knowledgeable on the subject, we can help.

Below is a list of white papers, presentations, webcasts, and even a video that discuss SSDs, SSD performance and how it should be measured.  The list is in the recommended order of reading / viewing, and ranges from basic overviews to technical details.  Hope you find this useful.

  1. What more logical place to start than Solid State Storage 101?  This white paper talks about SSDs, how they work and how they fit into system architectures.
  2. Another white paper, NAND Flash Solid State Storage for the Enterprise, looks at Flash memory in more detail and how SSD controllers work.
  3. Facing an SSS Decision? Here is How SNIA is Helping Users Evaluate SSS Performance is a presentation that starts to delve into SSD performance and the basic principles of the SSS Performance Test Specification.
  4. The presentation Validating SSS Performance also introduces the SSS PTS, but in additional detail.
  5. The Solid State Storage Performance Test Specification (SSS PTS) White Paper provides an easily understandable introduction to the SSS PTS.
  6. Here’s a video of our own Eden Kim Describing the SSS PTS at Storage Visions 2012.
  7. SNIA Solid State Storage Test Specification is a more technical description of the contents of the SSS PTS.
  8. Now that you’ve read all about them, the actual SSS PTS documents can be downloaded here.
  9. And finally, SSSI has put together a webpage on Understanding SSD Performance, which explains the test results generated from the SSS PTS and what they mean to users.

You can find a lot of other informative material related to SSDs on the SSSI Education page.

If you have any questions, comments or requests, please comment on this post or send a message to asksssi@snia.org.

Solid State Storage Contributors Honored at SNIA Symposium

Passionate and dedicated volunteers are vital to the success of SNIA and its programs.  Congratulations to the following SNIA Solid State Storage honorees, selected by the entire SNIA member community, who were recognized for their 2011 contributions!

Volunteer of the Year recognizes an individual who, above all others in 2011, consistently stepped up and helped SNIA achieve something new and groundbreaking or who significantly advanced an existing program. Congratulations to Paul Wassenberg of Marvell, SNIA Solid State Storage Initiative Chair, winner of the 2011 SNIA Volunteer of the Year for his leadership in Solid State Storage education and outreach of SSSI activities including the Enterprise and Client Performance Test Specifications and Understanding SSD Performance Project .

The Industry Impact Honoree recognizes an individual who has significantly advanced a cause for SNIA leading to an impact on the industry or the Association. Congratulations to Eden Kim of Calypso Systems, SNIA Solid State Storage Technical Work Group, winner of the 2011 Industry Impact Award for his leadership on development of the Solid State Storage Enterprise and Client Performance Test Specifications.

The Most Significant Impact by a Technical Work Group recognizes the SNIA TWG, which above all others in 2011, had members and efforts which consistently stepped up and helped SNIA achieve something new and groundbreaking or which significantly advanced an existing program. Congratulations to the Solid State Storage Technical Work Group, honored in 2011 for their development of the Solid State Storage Enterprise and Client Performance Test Specifications.

Eden Kim, SNIA SSS TWG Chair, receiving "SNIA Industry Impact Award" from Wayne Adams, SNIA Chairman of the Board, and Leo Leger, SNIA Executive Director

Solid State Storage Contributors Honored at SNIA Symposium

Passionate and dedicated volunteers are vital to the success of SNIA and its programs.  Congratulations to the following SNIA Solid State Storage honorees, selected by the entire SNIA member community, who were recognized for their 2011 contributions!

Volunteer of the Year recognizes an individual who, above all others in 2011, consistently stepped up and helped SNIA achieve something new and groundbreaking or who significantly advanced an existing program. Congratulations to Paul Wassenberg of Marvell, SNIA Solid State Storage Initiative Chair, winner of the 2011 SNIA Volunteer of the Year for his leadership in Solid State Storage education and outreach of SSSI activities including the Enterprise and Client Performance Test Specifications and Understanding SSD Performance Project .

The Industry Impact Honoree recognizes an individual who has significantly advanced a cause for SNIA leading to an impact on the industry or the Association. Congratulations to Eden Kim of Calypso Systems, SNIA Solid State Storage Technical Work Group, winner of the 2011 Industry Impact Award for his leadership on development of the Solid State Storage Enterprise and Client Performance Test Specifications.

The Most Significant Impact by a Technical Work Group recognizes the SNIA TWG, which above all others in 2011, had members and efforts which consistently stepped up and helped SNIA achieve something new and groundbreaking or which significantly advanced an existing program. Congratulations to the Solid State Storage Technical Work Group, honored in 2011 for their development of the Solid State Storage Enterprise and Client Performance Test Specifications.

Eden Kim, SNIA SSS TWG Chair, receiving "SNIA Industry Impact Award" from Wayne Adams, SNIA Chairman of the Board, and Leo Leger, SNIA Executive Director

Share

Understand SSD Performance Project

At last week’s Storage Vision conference, SSSI announced the Understanding SSD Performance project, which is intended to educate users about how to use the SSS PTS (Performance Test Specification) to make intelligent decisions about SSD performance.  You can find the press release here.

The project outcomes so far include a new webpage at www.snia.org/forums/sssi/pts, a white paper (www.snia.org/forums/sssi/knowledge/education), and a webcast.

Join us for the webcast on January 19 at 11AM Pacific Time by going to www.brighttalk.com/webcast/663/40549.