Storage Basics Q&A and No One’s Pride was Hurt

In the first of our “Everything You Wanted To Know About Storage But Were Too Proud To Ask – Part Chartreuse,” we covered the storage basics to break down the entire storage picture and identify the places where most of the confusion falls. It was a very well attended event and I’m happy to report, everyone’s pride stayed intact! We got some great questions from the audience, so as promised, here are our answers to all of them:

Q. What is parity? What is XOR?

A. In RAID, there are generally two kinds of data that are stored: the actual data and the parity data. The actual data is obvious; parity data is information about the actual data that you can use to reconstruct it if something goes wrong.

It’s important to note that this is not simply a copy of A and B, but rather a logical operation that is applied to the data. Commonly for RAID (other than simple mirroring) the method used is called an exclusive or, or XOR for short. The XOR function outputs true only when inputs differ (one is true, the other is false).

There’s a neat feature about XOR, and the reason it’s used by RAID. Calculate the value A XOR B (let’s call it AxB). Here’s an example on a pair of bytes.

A                                  10011100

B                                  01101100

A XOR B is AxB              11110000

Store all three values on separate disks. Now, if we lose A or B, we can use the fact that AxB XOR B is equal to A, and AxB XOR A is equal to B. For example, for A;

B                                  01101100

AxB                              11110000

A XOR AxB is A              10011100

We’ve regenerated the A we lost. (If we lose the parity bits, they can just be reconstructed from A and B.)

Q. What is common notation for RAID? I have seen RAID 4+1, and RAID (4,1). In the past, I thought this meant a total of 5 disks, but in your explanation it is only 4 disks.

A. RAID is notated by levels, which is determined by the way in which data is laid out on disk drives (there are always at least two). When attempting to achieve fault tolerance, there is always a trade-off between performance and capacity. Such is life.

There are 4 common RAID levels in use today (there are others, but these are the most common): RAID 0, RAID 1, RAID 5, and RAID 6. As a quick reminder from the webinar (you can see pictures of these in action there):

  • RAID 0: Data is striped across the disks without any parity. Very fast, but very unsafe (if you lose one, you lose all)
  • RAID 1: Data is mirrored between disks without any parity. Slowest, but you have an exact copy of the data so there is no need to recalculate anything to reconstruct the data.
  • RAID 5: Data is striped across multiple disks, and the parity is striped across multiple disks. Often seen as the best compromise: Fast writes and good safety net. Can withstand one disk loss without losing data.
  • RAID 6: Data is striped across multiple disks, and two parity bits are stored on all the disks. Same advantages of RAID 5, except now you can lose 2 drives before data loss.

Now, if you have enough disks, it is possible to combine RAID levels. You can, for instance, have four drives that combine mirroring and striping. In this case, you can have two sets of drives that are mirrored to each other, and the data is striped to each of those sets. That would be RAID 1+0, or often called RAID 10. Likewise, you can have two sets of RAID 5 drives, and you could stripe or mirror to each of those sets, and it would be RAID 50 or RAID 51, respectively.

Erasure Coding has a different notation, however. It does not use levels like RAID; instead, EC identifies the number of data bits and the number of parity bits.

So, with EC, you take a file or object and split it into ‘k’ blocks of equal size. Then, you take those k blocks and generate n blocks of the same size, such that any k out of n blocks suffice to reconstruct the original file. This results in a (n,k) notation for EC.

Since RAID is a subset of EC, RAID6 is the equivalent of EC or RAID(n,2) or n data disks and 2 parity disks. RAID(4,1) is RAID5 with 4 data and 1 parity, and so on.

Q. Which RAIDs are classified/referred to as EC? I have often heard people refer to RAID 5/6 as EC. Is this only limited to 5/6?

A. All RAID levels are types of EC. The math is slightly different; traditional RAID uses XOR, and EC uses Galois Fields or polynomial arithmetic.

Q. What’s the advantage of RAID5 over RAID1?

A. As noted above, there is a tradeoff between the amount of capacity that you need in order to stay fault tolerant, and the performance you wish to have in any system.

RAID 1 is a mirrored system, where you have a single block of data being written twice – one to each disk. This is done in parallel, so it doesn’t take any extra time to do the write, but there’s no speed-up either. One advantage, however, is that if a disk fails there is no need to perform any logical calculations to reconstruct data – you already have a copy of the intact data.

RAID 5 is more distributed. That is, blocks of data are written to multiple disks simultaneously, along with a parity block. That is, you are breaking up the writing obligations across multiple disks, as well as sending parity data across multiple disks. This significantly speeds up the write process, but more importantly it also distributes the recovery capabilities as well so that any disk can fail without losing data.

Q. So RAID improves WRITES? I guess because it breaks the data into smaller pieces that can be written in parallel. If this is true, then why will READ not benefit from RAID? Isn’t it that those pieces can be read and re-combined into a larger piece from parallel sources would be faster?

A. RAID and the “striping” of IO can improve writes by reducing serialization by allowing us to write anywhere. But a specific block can only be read from the disk it was written to, and if we’re already reading or writing to that disk and it’s busy – we must wait.

Q. Why is EC better for object stores than RAID?

A. Because there’s more redundancy, EC can be made to operate across unreliable and less responsive links, and at potentially geographic scales.

Q: Can you explain about the “RAID Penalty?” I’ve heard it called “Write Penalty” or “Read before Write penalty.”

A. When updating data that’s already been written to disk, there’s a requirement to recalculate the parity data used by RAID. For example, if we update a single byte in a block, we need to read all the blocks, recalculate the parity, and write back the updated data block and the parity block (twice in the case of dual parity RAID6).

There are some techniques that can be used to improve the performance impact. For example, some systems don’t update blocks in place, but use pointer-based systems and only write new blocks. This technique is used by flash-based SSDs as the write size is often 256KB or larger. This can be done in the drive itself, or by the RAID or storage system software. It is very important to avoid when using Erasure Coding as there are so many data blocks and parity blocks to recalculate and rewrite that it would become prohibitive to do an update.

Q. What is the significance of RAIN? We have not heard much about it.

A.A Redundant Array of Independent Nodes works under the same principles of RAID – that is, each node is treated as a failure domain that must be avoided as a Single Point of Failure (SPOF).Where as RAID maintains an understanding of data placement on individual drives within a node, RAIN maintains an understanding of data placement on nodes (that contain drives) within a storage environment.

Q. Is host same as node?

A. At its core, a “node” is an endpoint. So, a host can be a node, but so can a storage device at the other end of the wire.

Q. Does it really matter what Erasure Coding (EC) technologies are named or is EC just EC?

A. A. Erasure Coding notation refers to the level of resilience involved. This notation underscores not only the write patterns for storage of data, but also the mechanisms necessary for recovery. What ‘matters’ really will depend upon the level of involvement for those particular tasks.

Q. Is the Volume Manager concept related to Logical Unit Numbering (LUNs)?

A. It can be. A volume manager is an abstraction layer that allows a host operating system to create a Volume out of one or more media locations. These locations can be either logical or physical. A LUN is an aggregation of media on the target/storage side. You can use a Volume Manager to create a single, logical volume out of multiple LUNs, for instance.

A. For additional information on this, you may want to watch our SNIA-ESF webcast, “Life of a Storage Packet (Walk).”

Q. What’s the relationship between disk controller and volume manager?

A. Following on the last question, a disk controller does exactly what it sounds like – it controls disks. A RAID controller, likewise, controls disks and the read/write mechanisms. Some RAID controllers have additional software abstraction capabilities that can act as a volume manager as well.

We hope these answers clear things up a bit more. As you know, our “Everything You Wanted To Know About Storage, But Were Too Proud To Ask” is a series, since this Chartreuse event, we’ve done “Part Mauve – The Architecture Pod” where we explained channel vs. bus, control plane vs. data plane and fabric vs. network. Check it out on-demand and follow us on Twitter @SNIAESF for announcements on upcoming webcasts.

 

 

Storage Performance Benchmarking Webcast Series Continues

Attendees cannot get enough of the SNIA Ethernet Storage Forum’s Storage Performance Benchmarking Webcast series. On March 8, 2016 our experts, Mark Rogov and Ken Cantrell, will return for the third installment of our series with “Storage Performance Benchmarking: Block Components.” This session aims to continue educating anyone untrained in the storage performance arts to ascend to a common base with the experts. In this Webcast, you will gain an understanding of the block components of modern storage arrays and learn storage block terminology, including:

  • How storage media affects block storage performance
  • Integrity and performance trade-offs for data protection: RAID, Erasure Coding, etc.…
  • Terminology updates: seek time, rebuild time, garbage collection, queue depth and service time

As always, the event will be live and Mark and Ken will be on hand to answer your questions. I encourage you to register today. We hope to see you on March 8th!

New Webcast: Hierarchical Erasure Coding: Making Erasure Coding Usable

On May 14th the SNIA-CSI (Cloud Storage Initiative) will be hosting a live Webcast “Hierarchical Erasure Coding: Making erasure coding usable.” This technical talk, presented by Vishnu Vardhan, Sr. Manager, Object Storage, at NetApp and myself, will cover two different approaches to erasure coding – a flat erasure code across JBOD, and a hierarchical code with an inner code and an outer code. This Webcast, part of the SNIA-CSI developer’s series, will compare the two approaches on different parameters that impact the IT business and provide guidance on evaluating object storage solutions. You’ll learn:

  • Industry dynamics
  • Erasure coding vs. RAID – Which is better?
  • When is erasure coding a good fit?
  • Hierarchical Erasure Coding- The next generation
  • How hierarchical codes make growth easier
  • Key areas where hierarchical coding is better than flat erasure codes

Register now and bring your questions. Vishnu and I will look forward to answering them.

Ethernet Connected Drives Webcast Q&A

At our recent SNIA ESF Webcast “Visions for Ethernet Connected Drives” Chris DePuy of the Dell’Oro Group discussed potential benefits, use cases, and challenges of Ethernet connected drives. It’s not surprising that we had a lot of questions given that this market is in its infancy. As promised during our live event, here are answers to questions from the audience. If you think of additional questions, please feel free to comment on this blog.

Q. Will this also mandate new protocols to be used for storage like RDMA?

A. We did not receive any feedback from the technology companies we surveyed about RDMA specifically, but new protocols very well may be required to make effective and cost-effective use of eDrives. Storage systems offer many capabilities beyond just standard Ethernet networking and new protocols may be required to deliver those as well as new services in this new storage system architecture.

Q. Is White Box bought primarily by cloud customers?

A. Yes, in our research, substantially all purchases of White Box storage devices are purchased by cloud service providers.

Q. I may have missed it but aren’t we really talking about the HGST Open Ethernet Drive Architecture and the Seagate Kinetic Open Storage Platform? Both use Ethernet interfaces but HGST puts Debian on each HDD and Seagate has a key-value API for applications to directly write to the HDD. The actual deployment of these Ethernet HDDs would be in Ethernet Layer 2 switched backplanes in a 4U chassis being built by Supermicro, Xyratex (Seagate) and several others.

A. Given this was a presentation made to a neutral industry association; we chose not to discuss specific vendors. To answer your questions, yes, we are talking about Ethernet Connected Drives from HGST and Seagate, but we also integrated feedback from other suppliers of related technology, as well, including Toshiba. To your other question, yes, we have seen enclosures with embedded Ethernet switch technology connecting to the Ethernet drives from various other vendors. In our research for this webinar, we have also seen Ethernet switch technology embedded into enclosures that don’t use Ethernet connected drives, as well, but these would have systems to convert traditional HDD interfaces, but the network would see Ethernet as the outward facing interface.

Q. Doesn’t that take space on the drive when you put CPU and more memory?

A. We asked this question, too, but learned that there is sufficient space to maintain the HDD and all the parts in the same form factors we historically have known.

Q. What can one implement in these internal processors used in Ethernet drives? For instance can we run erasure codes such as Jerasure or XOR based codes yet do the basic tasks needed for the Ethernet drives?

A. We did not receive specific feedback during the surveys for this webinar about where one would run erasure coding. Generally, though, the decision will lead to design considerations for which CPU and memory choices would be made for each drive, which in turn would change economics as to whether the overall system is affordable/feasible. Note that doing erase coding on the drives increases the amount of intelligence required on the drive, for the arithmetic, for the requisite peer-to-peer networking, and for maintaining state information about other relevant drives required for completing the erasure codes. New software to manage all this would be required as well.

Q. Can I ran Ceph OSD plus Erasure code based on open source Jerasure in the Ethernet connected drive internal ARM processor?

A. We did not receive specific feedback during the surveys for this webinar about where one would run erasure coding. Generally, though, the decision will lead to design considerations for which CPU and memory choices would be made for each drive, which in turn would change economics as to whether the overall system is affordable/feasible.

Q. Erasure coding is more complex compared to RAID, how do I implement erasure coding with Ethernet drives?

A. We did not receive specific feedback during the surveys for this webinar about where or how one would run erasure coding.

Q. Does the economics assume including the cost of the Ethernet Ports? If so are you assuming unmanaged or managed Ethernet ports?

A. In the slides, we portrayed a simplistic capital spending model that considered just servers and hard drives. In reality, there are many other factors that play into both CAPEX and OPEX comparisons between conventional and Ethernet Connected Drive architectures. Examples include the cost differential between using Ethernet switching versus traditional HDD interfaces and how much memory and CPU is needed to support a particular use case.

Q. How does the increased number of network ports needed influence this price equation?

A. In the slides, we portrayed a simplistic capital spending model that considered just servers and hard drives. In reality, there are many other factors that play into both CAPEX and OPEX comparisons between conventional and Ethernet Connected Drive architectures. Examples include the cost differential between using Ethernet switching versus traditional HDD interfaces, how much memory and CPU is needed to support a particular use case.

Q. I’m confused how Power and Cooling could be saved. If you need X number of drives to store data then you would need the same number of drives in the connected drive model wouldn’t you? Perhaps more if the e-drives lack efficiency features?

A. The general point is that proponents of Ethernet Connected Drives argue there won’t be a need for storage-oriented servers, and so the savings would result from there being fewer of them consuming power.

Q. I guess the protocol would change commanding the drives?

A. There is no single approach that has been agreed upon. During the presentation, we said there are multiple technical approaches, one of which includes using Key Value APIs, and the other is to install an Operating System onto each drive that could run whatever you want on it.

Q. Are Ethernet connected drives JBODS on Ethernet?

A. Yes, that is the way we view it, too. Sometimes they are even called, “eBODS” where the traditional JBOD controller is replaced with an Ethernet switch.

Q. How is data protected–i.e., RAID or other mechanism.

A. In our surveys, we learned that the most common method would be to leverage erasure coding that is commonly associated with object oriented storage systems.

Q. How will photonics impact this concept?

A. Photonics is involved in data center Ethernet for higher speed communications. In our surveys, we did not encounter a single instance of a vendor discussing photonics at the Ethernet Connected Drive. For HDDs, 1GbE provides more than enough bandwidth for the drive.

Q. Are the servers today connecting the storage just dumb boxes that expose storage? Don’t they do processing as well? With Ethernet drives we’re removing that computational node it seems.

A. This is a very good point. Today’s conventional storage systems have significant computing capabilities – we think these could be used to do computing as well as performing storage-oriented tasks as they do primarily today. We expect that in the future, the servers that are packaged in external storage systems will be organized in a way that allows customers to run storage functions as well as more traditional purposes that would allow us to just call them ‘servers.’ In fact, there are several startups that are popularizing this idea.

Q. When it comes to HDD manufacturers there are only three left…WD (HGST), Seagate (Samsung) and Toshiba. When it comes to SSD or flash drives there are more manufacturers. Seagate is using a dual Serial Gigabit Media Independent Interface (SGMII) on its Kinetic HDDs. What other ways are there to do Ethernet on an HDD?

A. We did not receive any feedback from the technology companies we surveyed about this topic. Note, that SNIA recently started an “Object Drive Technical Work Group” to help drive standards for Ethernet-connected drives. If this topic is of interest, we encourage you to join that TWG.

Q. Have you seen any indication of a ratio between CPU power and Memory vs. the size of the storage? What is the typical White Box? EG Intel (version?) Memory (in GB?) Storage (in TB?)

A. The uses cases we presented are based on vendor-supplied viewpoints that implicitly incorporate the answers to your question, but don’t specifically address it. What we learned is that in these use-cases, there is an assumed positive TCO savings, but not every vendor agrees with these calculations – again without providing specifics like you are asking about.

Q. How can you eliminate the object servers? You still need that functionality somewhere if you ever hope to find the data again, or protect it… You may move away from dedicated Object servers but that code has to run somewhere thus saying they are eliminated is wrong…

A. This is a very good point. The use cases offered to us suggest that this code would either reside in the Ethernet Connected Drive, or on the server running an application itself, or both. This is why we made the point that the applications would have to be re-written to take advantage of the proposed new architecture.

Q. Is the cost of Ethernet HDDs expected to be the same as current HDDs and why?
Ethernet HDDs have more processing capabilities so shouldn’t they cost more (is that 10% more?)

A. Correct. If more components were added to an otherwise identical HDD, then, the cost would be greater. This is paramount to one of the main dissenting views we learned about during the survey process. It does raise the question as to whether it makes sense to deliver underlying HDDs that are NOT identical to traditional HDDs to offset costs somehow – maybe with lower speeds, or whether these Ethernet Connected Drives would be sold at lower margins by the HDD vendors.

Q. Do Server power TCO numbers take account of lower power consumption of next generation servers as indicated by Intel?

A. We do not know what version of servers was used in these vendor-supplied TCO calculations.

Q. If you are planning to offload processing to the processor on the HDD then you are assuming that the HDD vendors will expose those drives for user access – is there any evidence of this?

A. There is no single approach that has been agreed upon, and therefore no single answer to this question. During the presentation, we said there are multiple technical approaches, one of which includes using Key Value APIs, and the other is to install an Operating System onto each drive that could run whatever you want on it.

Q. How is redundancy handled on eHDD based appliance… aka a drive fails?

A. The custom-built software would presumably be developed to handle this. And obviously, the eHDD has to add enough CPU and memory to manage all this — which of course adds cost.

Q. It seems that with the CPUs on each drive, the archive, object or whatever the application would need to be rewritten to support this specific method of parallel processing. Is anyone doing this now?

A. During the survey process, we learned that many applications were being ported to this environment, some of which apparently do take advantage of parallel computing. Given we were planning to immediately divulge information to the public, we were not presented with details.

Q. What is nearline storage?

A. This is the way it was described to us by some of the technology companies we surveyed, but the meaning is that it represents a more traditional storage system you might see in an enterprise where many drives are stopped (not rotating) and are turned on when a request comes in.

Q. Why are analytics specifically optimized for Ethernet attached storage devices – the presenter seems to anticipate that processing can be pushed onto the drive, and if this is the case why can’t other drive interfaces do this – PCIe attached storage should be even more amenable for this.

A. The presenter was sharing views compiled by the responses of various technology companies during a series of interviews conducted before this webcast. Analytics is a large, growing industry today and exists without Ethernet Connected Drives. Some of the companies surveyed offered the view that putting processing capabilities into each HDD may enhance the overall system’s performance.

Q. Can the presenter comment on the value of scale-out for E-Drives, versus legacy SAN scale out?

A. Some of the technology companies interviewed by the presenter suggested that systems based on Ethernet Connected Drives may scale to larger capacities than traditional architectures on the basis that the storage-oriented servers no longer present an impediment to scaling.

Q. Just as object storage addresses RAID smart drives could provide the meta data needed by the swift controllers to do deduplication, or the controller may do deduplication as a pre-process or post process like we have seen on NetApp or Data Domain evolve over years.
If we use optic connections the port density issue is resolved and this end up looking like something from 2001 (the movie) correct?

A. Photonics is involved in data center Ethernet for higher speed communications. In our surveys, we did not encounter a single instance of a vendor discussing photonics at the Ethernet Connected Drive. As noted above, 1GbE is more than sufficient for eHDDs.

Q. FYI…48TB Capacity Kinetic Storage Appliance $5000.00 street price
White Box 2U Dual Xeon storage server with 48TB RAW…$8000 street price

A. Thank you for sharing! You may have noticed we did not mention specific vendors during the presentation – perhaps others viewing your question will take note of your viewpoint.

Q. To the extent that hyperscale cloud environments have servers with open sockets or slots for direct attach storage of drives, how are there financial savings to connect through Ethernet instead of direct attach? Will servers of the future remove these slots and sockets? Are there other cluster wide benefits with regards to performance for data accessed directly through the network instead of through the server with the local storage, when the data is accessed by a large number of servers?

A. Hyperscalers are buying storage-related hardware at a fraction of the price that systems OEMs are selling them for mainly because they do not demand software that enterprises value so much – they leverage open source and make their own for their very specific needs. If you look at the slide about the ‘White Box Effect’ in the presentation, you get a sense for just how much less they pay – or anyone else who buys a White Box pays – but make no mistake about it, these devices don’t do much unless you integrate them into a working system intended to store and safely retain data. To answer your question, we observe that these hyperscalers are such large customers of components and systems that they could choose to request custom hardware designs with customized specifications – more of this kind of interface, fewer of that kind, etc. As an analogy, in the networking industry, one of the largest buyers of the underlying network technology like processors, Ethernet interfaces and optics are the handful of hyperscalers – and in fact these customers are larger than most vendors.

Q. Why would each drive not know about other drives storage? How does this differ from existing storage servers?

A. In the traditional storage architecture, a central system is involved. The dissenting viewpoint we received from some of the technology companies we interviewed was a counterpoint that may exist only under certain design scenarios. Our view is that if a system is designed with the goal in mind to make each drive aware of each other’s contents, then that is technically possible of course. But at a cost, as you add CPU, memory, and software to do this.

Q. I can see flash and Wi-Fi Ethernet connected drives providing Internet of Things storage for values that can be harvested impendent of when the value was stored. Thus getting a low power system that could live off of USB type power or power over Ethernet being why corporations would look at this.

A. I think the point you are making is that flash consumes very little power, right? This revolutionary technology (lets just say, non volatile memory to keep it general) is causing all kinds of disruptive changes in the storage industry, and as costs come down for NVM, all kinds of different scenarios become possible.

Q. Cost model might need to include a simpler lower cost local server with the Ethernet drive clusters by adding a cost item to the left side of their equation, comments?

A. Agreed – the equation we provided was simplistic and could be expanded to include many other terms and other simultaneous equations as well. We just thought that providing it would frame the discussion on the slide instead of just saying it verbally.

Q. Obviously, it will be higher, but how do you envision this changing Ethernet bandwidth requirements? Will Ethernet connected drives only become a reality once 40, 25, 100 Gb becomes the mainstream for Ethernet networks?

A. Network bandwidth needs will be a function of how the servers interact with the drives – I can see scenarios where traffic might be kept more locally, or where asking each drive for ‘the answer’ instead of ‘all of its data’ so it can be processed in a server, might actually cause your premise that traffic increases. The point I’m getting to is that it depends on what applications these Ethernet Connected Drives are used for. Nevertheless, Metcalfe’s law (all available bandwidth installed will be consumed) has not yet been repealed publicly that I’m aware of.

Q. With Ethernet connected drives are we still stuck with the fundamental issue that HDD are still transactionally inefficient and thus while a novel concept the basic drive unless improvements are made in transactional efficiencies are improved remain the bottleneck?

A. We think HDDs will co-exist with Flash/NVM for a very long time. Some very smart engineers are working to make this co-existence increasingly efficient, taking into account the strengths and weaknesses of both storage media.