David Fair – SNIA on Storage

Q&A on Exactly How iSCSI has Evolved

June 3, 2016June 3, 2016 David Fair

Our recent SNIA ESF Webcast, “The Evolution of iSCSI” drew a big and diverse group of attendees. From beginners looking for iSCSI basics, to experts with a lot of iSCSI deployment experience, there were plenty of good questions. Our presenters, Andy Banta and Fred Knight, did a great job answering as many as they could during the live event, but we didn’t have time to get to them all. So here are answers to them all. And by the way, if you missed the Webcast, it’s now available on-demand.

Q. What are the top 3 reasons to choose iSCSI over FC SAN?

A. 1. Use of commodity equipment and protocols. It means that you don’t have to set up a completely separate network. It means you don’t have to buy separate HBAs. 2. Inherent networking capability. Built on top of TCP/IP, it benefits from any networking technology to come along. These include routing, tunneling, authentication, encryption, etc. 3. Ease of automation and configuration. In it’s simplest form, an iSCSI host only needs to know the IP address of the target system. In more complex systems, hosts and storage provide APIs to allow automation through scripting or management tools.

Q. Please comment on why SCSI went from being a widely used protocol for all sorts of devices to being focused as only essentially a storage protocol?

A. SCSI was originally designed as both a protocol and a bus (original Parallel SCSI). Because there were no other busses, the SCSI bus did it all; disks, tapes, scanners, printers, Optical (CDs), media changers, etc. As other busses came onto the market (think USB), many of those devices moved to the new bus (CDs, printers, scanners, etc.) Commodity devices used commodity busses (IDE, SATA, USB), and enterprise devices used enterprise busses (FC, SAS); and so, disks, tapes, and media changers mostly stayed on SCSI.

The name SCSI can be confusing for some, as the term originally was used for both the SCSI protocol and the SCSI bus. The term for the SCSI protocol is all that remains today; the SCSI bus (the old SCSI parallel bus) is no longer in wide use. Today, the FC bus, or the SAS bus, or the SoP bus, or the SRP bus are used to carry the SCSI protocol. The SCSI Architecture Model (SAM) describes a very distinct separation between the device layer (the SCSI protocol) and the transport layer (the bus).

And, the SCSI command set has become the basis for many subsequent command sets. The JEDEC group used the SCSI command set as a model (JEDEC devices are in your cell phone), the ATAPI devices used SCSI commands, and many SCSI commands and SATA commands have a common heritage. The Mt. Fuji group (a standards group in Japan) also uses SCSI as the basis for new DVD and BlueRay devices. So, while not widely known, the SCSI command family has grown well beyond what is managed by the ANSI/INCITS T10 committee that originally defined SCSI in to a broad set of capabilities that are used across the industry, by a broad group of organizations. But, that all said, scanners and printers are still on USB, and SCSI is almost all about storage in one form or another.

Q. How does iSCSI support software-defined storage?

A. Answered during the talk. SDS provides more automation and knobs on the storage capabilities. But SDS still needs a way to transport the storage and iSCSI works perfectly fine for that. They are complementary technologies, not competing.

Q. With 40Gb and faster coming soon to a server near you, what kind of impact will that have on CPU utilization? Will smaller servers be able to push that much traffic?

A. More throughput simply requires more CPU. With good multithreaded drivers available, this can mean simply adding cores to keep the pipe as full as possible. As we mentioned near the end, using iSCSI with RDMA lightens the load on the CPU even more, so you’ll probably be seeing more of that.

Q. Is IPSec commonly supported on iSCSI targets?

A. Yes, IPsec is required to be implemented on an iSCSI target to be a compliant device. However, it is not commonly enabled by customers. If they MUST provide IPsec there are a lot of non-compliant initiators and targets on the market.

Q. I’m told direct connect with iSCSI is discouraged, that there should be a switch in place to handle the buffering, latency, acknowledgement etc….. Is this true or a best practice to make sure switches are part of the design?

A. If you have no need to connect to multiple targets or multiple initiators, there’s no harm in direct connections.

Q. Ethernet was not designed to support storage traffic. The TCP/IP protocol suite was not designed to support storage traffic. SCSI was not designed to be encapsulated. So TCP/IP FTW? I think not. The reason iSCSI is exists is [perceived] cost savings. I get fed up with people constantly looking for ways to squeeze another penny out of something. To me it illustrates that they’re not very creative. Fibre Channel is a stupid name, but it is a purpose built protocol that works as designed to.

A. Ethernet is a general purpose network. It is capable of handling lots of different traffic (including storage). By putting iSCSI onto an existing Ethernet infrastructure, it can (as you point out) create a substantial cost savings over installing a FC network (although that infrastructure savings comes with other costs – such as the impact of a shared wire). However, installing a dedicated Ethernet network provides many of the advantages of a dedicated FC network, but at an added cost over that of a shared Ethernet infrastructure. While most consider FC a purpose-built storage network, it is worth pointing out that some also consider it a general purpose network (for example FC-Avionics is built into Fighter Jets, and it’s not for storage). And while not designed to be encapsulated, (it was designed for a parallel bus), SCSI today is encapsulated on every transport that carries it (yes, that includes FCP and SAS).
There are many kinds of storage at different price points, USB storage, SATA devices, rotating media (at different RPMs), SSD devices, SAS devices, FC devices, single spindles, arrays, cloud, drop boxes, etc., all with the corresponding transport wires. iSCSI is one of those wires. Each protocol and wire offer specific advantages and disadvantages. There can be a lot of confusion about which to use, but just as everyone does not drive the same type car (a FORD FUSION for example), everyone does not need the same type of storage (FC devices/arrays). Yes, I drive a FORD FUSION, and I like FC storage, but I use a USB stick on my laptop, and I pray my bank never puts my financial records out in the cloud. Selecting the right storage (and wire) for the job at hand can be one of a system administrators most interesting problems to solve.As for the name – that is often what happens in committees…

Q. As a best practice for Windows servers, disable hardware acceleration features in NICs (TOE etc.)? Are any NIC features valuable given modern multicore CPUs?

A. Yes. Typically the only reason to disable TOE is that multiple or virtual TCP/IP stacks are going to be using the same NIC. TSO, LRO and jumbo frames will benefit any OS that can take advantage of them.

Q. What is the advantage of iSCSI when compared with NVMe?

A. NVMe and iSCSI are very different protocols. NVMe started life as a direct attach protocol to communicate to native PCIe devices (not even outside the box). iSCSI was a network protocol from day one. iSCSI has to deal with the potential for long network induced delays, and complex out of order error recovery issues. NVMe operates over an interlocked bus, and as such, does not have those issues.

But, NVMe is now being extended over fabrics. NVMe over a RoCE V1 transport will be a data center network (since there is no IP routing). NVMe over a RoCE V2 transport or an iWARP transport will have the same routing capabilities that iSCSI has. When it comes to the raw command set, they are very similar (but there are some differences). SCSI is a more full featured command set than NVMe – it has been developed over a span of over 25 years, and has developed solutions for all the problems that have been discovered during that time span. NVMe has a more limited (or more focused) command set (for example, there are no tape commands in the NVMe command set). iSCSI is available today, as is direct attach NVMe, but NVMe over Fabrics is still in the development phases (the specification is expected to be available the first week of June, 2016). NVMe products will take some time to mature and to develop solutions for the problems they have not discovered yet. Another example of this is the ability to support shared storage – it existed on day one in iSCSI, but did not exist in the first NVMe specification. To support shared storage in NVMe over Fabrics, that capability has since been added, and it was done using a SCSI compatible method (to make it easier for host S/W that already performs this function).

There is a large community working to develop NVMe over Fabrics. As memory based storage device get cheaper, and the solution space matures, NVMe will become more attractive.

Q. How often do iSCSI installations provide encryption of data in flight? How: IPsec, IKEv2-SCSI + ESP-SCSI, etc.?

A. Rarely. More often than not, if in-flight data security is needed, it will be run on an isolated network. Well under 100% of installations are 100% compliant. VMware never qualified IPsec with iSCSI and didn’t have any obvious switch to turn it on. Side note: We standards guys can be overly picky about words. Since the question is “provide” the answer is – 100% of compliant installations PROVIDE encryption (IPsec V2 – see above), however, in practice, installations that require that type of security typically run on isolated networks, rather than turn on encryption.

Q. How do multiple independent applications inside the same initiator map to iSCSI sessions to the same target? E.g., iSCSI session one-to-one with application?

A. There is no relationship between applications and sessions. When an iSCSI initiator discovers a target, the initiator logs in and establishes a session. If iSCSI MCS (multi connection session) is being used, multiple TCP connections may be established and used in parallel to process operations for that session.

Applications send reads and writes to the operating system. Those IO requests make their way through the file system and caching layers into the device driver. The device driver issues the IO request to the device (over the iSCSI session) and retains information about that IO. When a completion is received from a device (the WRITE command or READ command completed), it is matched up with the request. That completion status (success or error) is passed back through the operating system (file system, etc.) to the application. So it is the responsibility of the device driver to mux/demux the requests from all the applications out over the iSCSI session and track the responses as the operations are completed.

When an operating system is using MPIO (multi-pathing), then the device driver may create multiple sessions between the initiator and the target. This is where operating system MPIO policies such as round-robin, shortest queue, LRU, etc. come into play. In this case, the MPIO driver will send an IO operation to the device using what it considers to be the most appropriate path (based on the selected policy). But again, there is no relationship between the application and the path used for IO (any application can have it’s IO send via any path).

Today, MPIO is used more commonly than MCS.

Q. Will Microsoft iSCSI implement iSER?

A. This is a question for Microsoft or iSER-capable NIC vendor that provides Microsoft drivers.

Q.Zadara has some iSER deployments using Linux and VMware clients going to the Zadara cloud storage.

A. There’s an answer, all by itself.

Q. In the case of iWARP, the TCP layer takes care of out-of-order IP packet receptions. What layer does the out-of-order management of packets in ROCE ?

A. RoCE headers contain a 24 bit “Packet Sequence Number” that is used to validate the required ordering and detect lost packets. As such, ordering still occurs, just in a different way.

Q. Correction: RoCE is over Ethernet packets and is not routable. RoCEv2 is the one over UDP/IP and *is* routable.

A. You are correct. RoCE is not routable by IP. RoCE transmits raw Ethernet frames with just Ethernet MAC headers and no IP headers, and as such, it is not routable by IP. RoCE V2 puts the information into UDP packets (with appropriate IP headers), and therefore it is routable by IP.

Q. How prevalent is iSER today in deployment? And what are some of the typical applications that leverage iSER?

A. Not terribly prevalent today, but higher speed Ethernet might drive more adoption, due to the CPU savings demonstrated.

Find out How iSCSI is Evolving

May 4, 2016May 4, 2016 David Fair

The next Ethernet Storage Forum Webcast. “Evolution of iSCSI including iSER, iSCSI over RDMA Ethernet,” will focus on developments with iSCSI – the Internet Protocol standard for transferring SCSI commands across an Ethernet network, enabling hosts to link to storage devices wherever they may be. At this Webcast on May 24th, I will be joined by Fred Knight, Standards Technologist at NetApp, and Andy Banta, Storage Janitor at SolidFire/NetApp, who will discuss the evolution of iSCSI up to iSER, which takes advantage of Ethernet RDMA fabric technologies to enhance performance. Register now to hear:

A brief history of iSCSI
How iSCSI works
IETF refinements to the specification
Enhancing iSCSI performance with iSER

The Webcast will be live, so please bring your questions for Andy and Fred. We hope to see you there!

A Q&A on Storage Performance Benchmarking: Block Components

April 21, 2016April 21, 2016 David Fair

For the third time, our storage performance benchmarking experts, Ken Cantrell and Mark Rogov, have generated an abundance of interest (in the form of questions) on block storage performance. If you missed the Webcast, “Storage Performance Benchmarking: Block Components,” it’s available on demand. It was no small effort to answer all the great questions that we received. And for those of you who have been waiting, we apologize, but we think the detailed and thoughtful answers Mark and Ken have put together are well worth the wait.

Q1: Are these numbers applicable to the 90th percentile for any given storage array, please?

Mark: These numbers represent HDD/SSD performance numbers. They aren’t meant to represent any particular storage array vendor’s performance. See the end of our presentation (bottleneck analysis) as to why it is really really hard to answer your question.

Q2: How about NVDIMM-F or NVDIMM-P or NVDIMM-X claiming 3-4M IOPS type of Enterprise storage devices?

Ken: Yup. They’re fast.

There’s a great presentation by Jim Handy titled “Understanding the Intel/Micron 3D XPoint Memory” presented at SDC2015 that I’d recommend you take a look at to understand more about this kind of memory and its possible positioning.

Mark: Great question. I think the conclusion of our presentation answers it. Flash (and we use flash as a collective term, defining everything that is not spinning storage to be “flash”) is drastically faster than spinning drives. But even within Flash, there are plenty of new technologies which compete with each other and improve the overall performance landscape. So, within the scope of our presentation, even a simple good old SLC drive tops the capability of a SAS line. If we improve on one drive, by switching the technology to a faster/newer/better variant (e.g., NVDIMM-F), or by stacking the drives, the resulting set will much more likely expose the limitations of the “regular” storage array.

Q3: I’d like to know which tool you are using to measure IOPS if possible.

Ken: The SNIA Solid State Storage Initiative (SSSI) has developed substantial expertise in the area of SSD performance and behavior. The SSS Performance Test Specifications were developed by the SNIA SSS Technical Work Group (TWG) and define how to measure SSD performance in a manner that is accurate, repeatable and enables comparison between different manufacturers’ products. Learn more about the SSD Performance Project here.

All of the Flash and HDD numbers at the beginning of the presentation were taken directly from the Solid State Storage Performance Test Specification summary results (SSS PTS). The SSS PTS provides a comprehensive method for measuring flash performance in the most vendor neutral approach that I’ve seen.

The Flash and HDD numbers at the end of the presentation were 80% of the starting numbers – scaled down to make them slightly more like what we’ve seen in a greater number of environments (that aren’t pushing their drives as hard).

Q4: Throughputs with SSD is not as much as one can get from a spinning drive when one keeps cost/GB on the axis. Comments please.

Ken: Now we have 3 axes? I’m not even sure how to visualize what you’re asking, but I’m pretty sure I understand the intent … and this is a harder question than it would appear on the surface. Why?

First off, prices aren’t my thing – I tend to focus on the internals and let the sales guys talk prices. Additionally, vendors often engage in significant discounting or bundling that makes it difficult for the average person (i.e., me) to understand true costs.
The astounding random I/O performance of flash enables support for compression and deduplication without dramatically increasing client-perceived latency. There’s a reason you see so many vendors offering inline deduplication and inline compression now when they did not even five years ago – flash is the enabler that makes this happen. So what is the true comparison? Raw HDD vs Raw flash? Or Raw HDD vs flash plus the storage efficiency (SE) savings it enables? If flash with SE features (dedupe and compression), then what is the savings that you can/should expect for your dataset? 1.5x? 5x? 50x? Knowing this is a prerequisite to answering the question, and the answer will be dependent both on the vendor’s features and your own data set characteristics.
As we discussed in the first session, if your application/user base have some sort of minimum performance expectations, particularly around latency, then HDDs may simply not be able to provide you the performance you need. You DID mention throughput (IOPS?) explicitly and with IOPS, OPS, or data rates (MB/s), you can always match flash data rates with HDDs – it just might take a LOT more HDD drives than flash devices. Latency/response time is different though – depending on whether you are drive bound and what your I/O characteristics look like (read vs write, random vs sequential), you may simply be unable to ever hit your latency targets with HDD.
The world, it is a-changing. six years ago it was easy to say “SSD for performance sensitive niche applications!” and smile. Today, prices continue to drop, vendors are making new decisions around the use of consumer grade vs enterprise grade flash, and overall flash/SSD is moving much more mainstream. And … consider the new 16TB (yes 16 TERABYTE) SSD drives announced by Samsung. My personal view (and I’m explicitly disclaiming that I’m speaking on my behalf, and not NetApp’s – which honestly, you should assume for all my answers) is that these are going to change the landscape almost as dramatically as SSD itself has.
There are definitely vendors that believe in the cost benefits of HDD. We chose not to mention specific vendors in the webcast, but consider BackBlaze. In their blog, they are extremely open about how they have configured their data center – and they are an (all?) HDD shop. In fact, “by the end of 2015, the Backblaze datacenter had 56,224 spinning hard drives containing customer data.” Speaking of Backblaze, you might be interested in their assessment of the 16TB drive, for their shop.

You might also be interested in slide 21 of the following, which includes some price/performance numbers from EMC and Oracle.

Q5: Does NVMe drive technology move things to a higher level?

Ken: If you truly mean NAND-based flash accessed via NVMe instead of SAS/SATA, yes. Look at the perf results linked out of question 3. If you mean the use of next-generation non-volatile memory (NVM) instead of NAND-based flash, then yes. The following chart is contained in a lot of SNIA presentations; I It does a good job of pointing out just how much faster we can get.

I also strongly recommend a look through of Advances in Non-Volatile Storage Technologies by Tom Coughlin from Coughlin Associates. If you care about these topics, the SNIA Storage Developer Conference is a great opportunity to learn more.

Q6: Why NAND gates and not AND gates?

Mark: NAND and NOR gates are known as “universal gates”–they can be combined in various groups and combinations to do any basic operations, i.e., AND, NOT, OR, etc. So, flash manufacturers had to choose between NAND and NOR. And just like with any technology, the price drove the choice. NAND gates are simply cheaper and slower. NORs are faster and more expensive. Actually, there are some NOR products in the market.

Q7: Mark accidently said 15K was 15,000/sec when it’s 15,000/minute.

Ken: Thanks! (Shame on you Mark!)

Mark: Thank you… I can’t believe that I misspoke! I never do! Never! Ahh!!!

Mark’s Lawyer: On behalf of my client, I move to remove this question and the digital recording from Exhibit A to Exhibit B (aka “never again section”)

Q8: Do you guys have any data about how expensive an erase-modify-write operation is, compared with spinning disks in terms of performance?

Ken: This is what we were attempting to demonstrate in the first set of slides. The PTS (see question 3) forces flash devices into a steady state mode where they are continuously doing program-erase cycles. So the results shown there demonstrate the difference between HDD writes (seek, spin, write) and flash writes (erase and program).

Your question made me wonder though … so I also did a quick literature search. Interesting to see how rates have changed over time, and how they vary by device:

From M-Systems, in 2002: Erase cycle was 3ms

From Micron, in 2006: The erase time for a 128KB erase block was 500 µs

From AnandTech, in 2012: Erase time for SLC was 1.5-2ms, MLC was 3ms and TLC was ~4.5ms (huh? SLC vs MLC vs TLC?)

Q9: Why can’t the pointer be at the page level instead of a block level (say, metadata within a block)? I’m sure that there is a reason. What do we gain by treating an entire block as a monolithic?

Mark: This is an excellent question to ask Google. I think the reasons for selecting a NAND gate technology, and for bundling a bunch of NAND gates into groups and for creating blocks (in essence, super groups) is power. It takes less power to operate the drives with NAND gates and blocks.

Q10: I heard someone mention NOR gates, instead of NAND, are NOR gates persistent, over a power cycle?

Ken. Yes.

Mark: There are plenty of other “Logic gates” see this article on Wikipedia for more information.

Q11: So, there is no advantage in keeping IO sequentially in an SSD?

Ken: Technically, or practically? Technically speaking, I think it does matter. Micron documented this in 2006, noting that “Random access time on NOR Flash is specified at 0.075μs; on NAND Flash, random access time for the first byte only is significantly slower—25μs (see Table 2 on page 5). However, after initial access has been made, the remaining 2111 bytes are shifted out of NAND at a mere 0.025μs per byte.” The raw numbers have changed over the years, but I don’t believe the principle has. Violin Memory stated in 2013 that, “The idea of sequential I/O doesn’t exist with flash memory, because there is no physical concept of blocks being adjacent or contiguous. Logically, two blocks may have consecutive block addresses, but this has no bearing on where the actual information is electronically stored. You might therefore say that all flash I/O is random, but in truth the principles of random I/O versus sequential I/O are disk concepts so they don’t really apply.”

Practically speaking, I agree. Sequential vs random I/O is irrelevant for flash. Given (a) average I/O sizes for workloads and (b) the incredible performance of flash devices compared to the needs of the vast majority of people using them, it doesn’t much matter if you can access subsequent bytes in a NAND-based flash device faster than you can access the first bytes. They are plenty fast enough.

Note that it is hard to find public info on this. Sequential I/O tends to use larger I/O sizes, and random I/O uses smaller I/O sizes. So finding apples-to-apples comparisons between sequential and random I/O is difficult.

Mark: Yes, the flash drive doesn’t care anymore. But the hosts and application still do. Where it matters is in the workloads. Ken and I are still planning to dedicate an entire hour talking about workloads, and Random vs. Sequential will surely be a large part of it. However, we will admit that in the future, when all storage will be flash (which is, of course, a pipe dream) it won’t matter anymore.

Q12: What is the acceptance level to Erasure Coding, and hence the change in the way Storage Performance testing will change?

Mark: As we said during the webcast, RAID is a special case of Erasure Coding. Therefore its acceptance rate is 100% J But on a more serious note, Erasure Coding is necessary for any scale out system: and every vendor uses their own N+M rules.

Q13: Is RAID-1 always half the write performance? If the writes go to both drives simultaneously, I could see write performance being less than 100% of what one drive can do, but not half.

Ken: This was asked in a dry run as well. You’ve hit on something that seems to be a sticking point for multiple people. Perhaps consider it this way. It looks mathy and complicated, but bear with me …

Consider two physical drives. Call them P1 and P2.

Let the write performance (in iops) of P1 be P1_w.

Let the write performance (in iops) of P2 be P2_w.

How fast can P1 write? P1_w.

How fast can P2 write? P2_w.

If you can write to both P1 and P2 at the same time, independently, and completely in parallel, how fast can you write in aggregate? P1_w + P2_w.

For the previous question, what if P1_w = P2_w?

Then P1_w + P2_w = P1_w + P1_w = (2)*P1_w.

Now …

Consider a RAID-1 pair comprised of the same P1 and P2. Call it R1.

Writes can be sent (in a good implementation) to both P1 and P2 at the same time.

But, before a write is considered complete, it must be acknowledged by BOTH P1 and P2.

If P1_w > P2_w, what is the best performance of R1? P2_w. P2 is slower, so we’ll always be waiting on it (assuming performance is consistent), so the best we can do is P2_w.

Same logic if P1_w < P2_w.

What if P1_w = P2_w? What is the best performance of R1? Same logic … but since they are the same speed, it is simply P1_w.

So …

In the non-RAID-1 case, our performance (assuming P1_w = P2_w) was 2 * P1_w.

In the RAID-1 case, our performance (assuming P1_w = P2_w) is P1_w.

50% reduction.

RAID-1 only achieves ½ of what the physical pair could.

Mark: What Ken said.

Q14: Is there any kind of “asynch” RAID1 so that I can keep the performance of the disks but keep the mirroring?

Ken: See the previous answer also.

For reads, certainly. For writes, not that I know of, although you can make it much less visible. For example, if you have a caching RAID controller/system, your writes will go to memory and then go to disk whenever the controller/system decides to flush it. Perhaps it is big enough that it turns random I/O into sequential I/O (and you’re on HDDs) and the perf improvement from doing sequential instead (instead of random) is enough you don’t notice the effect of RAID itself.

Mark: I think that in reality, the behavior of a particular implementation is always vendor-dependent. Generally speaking, RAID1 does allow the reading from both drives, but budgets or software bugs or just plain ignorance could result in an implementation where that is not true. Address vendor documentation to know for sure.

Q15: Why do you need to read old parity to recalculate and write a new one? Isn’t the parity only calculated based on the data being written?

Ken: See answer to question #14.

Mark: It is a math trick… reading the parity saves reading the rest of the blocks on the full stripe. With 3 drives the savings are non-obvious, but with 5 or 14 there are significant.

Q16: This calculation is correct for 3 disks, right? If there are more disks and partial write is for stripe on single drive then you need to read more to calculate parity

Ken: No. There are some great write-ups about how RAID-5 works. Instead of pasting those here, I strongly encourage you to visit http://rickardnobel.se/how-raid5-works/ AND http://rickardnobel.se/raid-5-write-penalty/ and then tweet Mark (@markrogov) or Ken (@kencantrelljr) with questions/follow-up.

(I have no connection to Rickard … I just think he’s done a great job in his write-up.)

Mark: Yes, Rickard’s write up is spot on. Our goal is to introduce a fairly complex subject in a deceivingly simple manner. There are many edge cases that we don’t address: partial write to sector, partial write to a block, partial write a stripe… all those have their own consequences, and storage vendors deal with those differently.

Q17: I am also interested in Data Recovery on NAND technology

Ken: Me too. It isn’t a topic we’re planning to cover though.

Q18: Does caching write data help when one uses SSD?

Ken: It can. Memory is still faster than flash. It depends entirely on how the memory is used. For example, with writes, if memory were used as a write-through cache (look it up if you need), it wouldn’t make things faster. If it were used as a write-back cache, it would. If it is used as a read cache, it will almost certainly make reads of data faster. But even there, life is never simple. Why? Because if you’re using memory to cache data, you’re not using it for something else … and it is possible that the memory could be better used for caching metadata, for example.

Mark: Here, I’d like to recall our good friend, Dr. J Metz, who created an excellent presentation on comparing computer caches to pizza delivery in “Life of a Storage Packet (Walk)” And in his example, caching will keep the pizza warmer. Even if a flash drive is used.

Q19: If the customer is interested in throughput in MB/s then they probably won’t do IOs with 4KB size…

Ken: Agreed. I’m fairly certain that you’re referring to adding MB/s numbers on slide 41. We had a discussion about doing that when putting the slides together. The transition between slide 40 and 42 changed the I/O size from 4KiB to 128KiB, changed from writes to reads, and changed from random I/O to sequential I/O. Adding the MB/s numbers to slide 40/41 was meant to ease the transition between slide 40 and 42. You’re absolutely right though … rarely does anyone want to talk data rates (MB/s) when using small I/O sizes.

Mark: Agreed. Although a true performance guru would recognize that these are the two sides of the same coin.

Ethernet RDMA Protocols Support for NVMe over Fabrics – Your Questions Answered

March 21, 2016March 21, 2016 David Fair

Our recent SNIA Ethernet Storage Forum Webcast on How Ethernet RDMA Protocols iWARP and RocE Support NVMe over Fabrics generated a lot of great questions. We didn’t have time to get to all of them during the live event, so as promised here are the answers. If you have additional questions, please comment on this blog and we’ll get back to you as soon as we can.

Q. Are there still actual (memory based) submission and completion queues, or are they just facades in front of the capsule transport?

A. On the host side, they’re “facades” as you call them. When running NVMe/F, host reads and writes do not actually use NVMe submission and completion queues. That data just comes from and to RNIC RDMA queues. On the target side, there could be real NVMe submissions and completion queues in play. But the more accurate answer is that it is “implementation dependent.”

Q. Who places the command from NVMe queue to host RDMA queue from software standpoint?

A. This is managed by the kernel host software in code written to the NVMe/F specification. The idea is that any existing application that thinks it is writing to the existing NVMe host software will in fact cause the SQE entry to be encapsulated and placed in an RDMA send queue.

Q. You say “most enterprise switches” support NVMe/F over RDMA, I guess those are ‘new’ ones, so what is the exact question to ask a vendor about support in an older switch?

A. For iWARP, any switch that can handle Internet traffic will do. Mellanox and Intel have different answers for RoCE / RoCEv2. Mellanox says that for RoCE, it is recommended, but not required, that the switch support Priority Flow Control (PFC). Most new enterprise switches support PFC, but you should check with your switch vendor to be sure. Intel believes RoCE was architected around DCB. The name itself, RoCE, stands for “RDMA over Converged Ethernet,” i.e., Ethernet with DCB. Intel believes RoCE in general will require PFC (or some future standard that delivers equivalent capabilities) for efficient RDMA over Ethernet.

Q. Can you comment on when one should use RoCEv2 vs. iWARP?

A. We gave a high-level overview of some of the deployment considerations on slide 30. We refer you to some of the vendor links on slide 32 for “non-vendor neutral” perspectives.

Q. If you take RDMA out of equation, what is the key advantage of NVMe/F over other protocols? Is it that they are transparent to any application?

A. NVMe/F allows the application to bypass the SCSI stack and uses native NVMe commands across a network. Most other block storage protocols require using the SCSI protocol layer, translating the NVMe commands into SCSI commands. With NVMe/F you also gain parallelism, simplicity of the command set, a separation between administrative sessions and data sessions, and a reduction of latency and processing required for NVMe I/O operations.

Q. Is ROCE v1 compatible with ROCE v2?

A. Yes. Adapters speaking RoCEv2 can also maintain RDMA connections with adapters speaking RoCEv1 because RoCEv2 ports are backwards interoperable with RoCEv1. Most of the currently shipping NICs supporting RoCE support both RoCEv1 and RoCEv2.

Q. Are RoCE and iWARP the only way to use Ethernet as a fabric for NMVe/F?

A. Initially yes; only iWARP and RoCE are supported for NVMe over Ethernet. But the NVM Express Working Group is also targeting FCoE. We should have probably been clearer about that, though it is noted on slide 11.

Q. What about doing NVMe over Fibre Channel? Is anyone looking at, or doing this?

A. Yes. This is not in scope for the first spec release, but the NVMe WG is collaborating with the FCIA on this. So NVMe over Fibre Channel is expected as another standard in the near future, to be promoted by T11.

Q. Do RoCE and iWARP both use just IP addresses for management or is there a higher level addressing mechanism, and management?

A. RoCEv2 uses the RoCE Connection Manager, and iWARP uses TCP connection management. They both use IP for addressing.

Q. Are there other fabrics to run NVMe over fabrics? Can you do this over OmniPath or Infiniband?

A. InfiniBand is in scope for the first spec release. Also, there is a related effort by the FCIA to support NVMe over Fibre Channel in a standard that will be promoted by T11.

Q. You indicated NVMe stack is in kernel while RDMA is a user level verb. How are NVMe SQ/ CQ entries transferred from NVMe to RDMA and vice versa? Also, could smaller transfers in NVMe (e.g. SGL of 512B) combined to larger sizes before being sent to RDMA entries and vice versa?

A. NVMe/F supports multiple scatter gather entries to combine multiple incontinuous transfers, nevertheless, the protocol doesn’t support chaining multiple NVMe commands on the same command capsule. A command capsule contains only a single NVMe command. Please also refer to slide 18 from the presentation.

Q. 1) How do implementers and adopters today test NVMe deployments? 2) Besides latency, what other key performance indicators do implements and adopters look for to determine whether the NVMe deployment is performing well or not?

A. 1) Like any other datacenter specification, testing is done by debugging, interop testing and plugfests. Local NVMe is well supported and can be tested by anyone. NVMe/F can be tested using pre-standard drivers or solutions from various vendors. UNH-IOH is an organization with an excellent reputation for helping here. 2) Latency, yes. But also sustained bandwidth, IOPS, and CPU utilization, i.e., the “usual suspects.”

Q. If RoCE CM supports ECN, why can’t it be used to implement a full solution without requiring PFC?

A. Explicit Congestion Notification (ECN) is an extension to TCP/IP defined by the IETF. First point is that it is a standard for congestion notification, not congestion management. Second point is that it operates at L3/L4. It does nothing to help make the L2 subnet “lossless.” Intel and Mellanox agree that generally speaking, all RDMA protocols perform better in a “lossless,” engineered fabric utilizing PFC (or some future standard that delivers equivalent capabilities). Mellanox believes PFC is recommended but not strictly required for RoCE, so RoCE can be deployed with PFC, ECN, or both. In contrast, Intel believes that for RoCE / RoCEv2 to deliver the “lossless” performance users expect from an RDMA fabric, PFC is in general required.

Q. How involved are Ethernet RDMA efforts with the SDN/OCP community? Is there a coming example of RoCE or iWarp on an SDN switch?

A. Good question, but neither RoCEv2 nor iWARP look any different to switch hardware than any other Ethernet packets. So they’d both work with any SDN switch. On the other hand, it should be possible to use SDN to provide special treatment with respect to say congestion management for RDMA packets. Regarding the Open Compute Project (OCP), there are various Ethernet NICs and switches available in OCP form factors.

Q. Is there a RoCE v3?

A. No. There is no RoCEv3.

Q. iWARP and RoCE both fall back to TCP/IP in the lowest communication sense? So they are somewhat compatible?

A. They can speak sockets to each other. In that sense they are compatible. However, for the usage model we’re considering here, NVMe/F, RDMA is required. Because of L3/L4 differences, RoCE and iWARP RNICs cannot speak RDMA to each other.

Q. So in case of RDMA (ROCE or iWARP), the NVMe controller’s fabric port is Ethernet?

A. Correct. But it must be RDMA-enabled Ethernet.

Q. What if I am using soft RoCE, do I still need an RNIC?

A. Functionally, soft RoCE or soft iWARP should work on a regular NIC. Whether the performance is sufficient to keep up with NVMe SSDs without the hardware offloads is a different matter.

Q. How would the NVMe controller know that a command is placed in the submission queue by the Fabric host driver? Is the fabric host driver responsible for notifying the NVMe controller through remote doorbell trigger or the Fabric target driver should trigger the doorbell?

A. No separate notification by the host required. The fabric’s host driver simply sends a command capsule to notify its companion subsystem driver that there is a new command to be processed. The way that the subsystem side notifies the backend NVMe drive is out of the scope of the protocol.

Q. I am chair of ETSI NFV working group on NFV acceleration. We are working on virtual RDMA and how VM can benefit from hardware independent RDMA. One corner stone of this is virtual-RDMA pseudo device. But there is not yet consensus on minimal set of verbs to be supported: Do you think this minimal verb set can be identified? Last, the transport address space is not consistent between IB, Ethernet. How supporting transport independent RDMA?

A. You know, the NVM Express Working Group is working on exactly these questions. They have to define a “minimal verb set” since NVMe/F generates the verbs. Similarly, I’d suggest looking to the spec to see how they resolve the transport address space differences.

Q. What’s the plan for Linux submission of NVMe over Fabric changes? What releases are being targeted?

A. The Linux Driver WG in the NVMe WG expects to submit code upstream within a quarter of the spec being finalized. At this time it looks like the most likely Linux target will be kernel 4.6, but it could end up being kernel 4.7.

Q. Are NVMe SQ/CQ transferred transparently to RDMA Queues or can they be modified?

A. The method defined in the NVMe/F specification entails a transparent transfer. If you wanted to modify an SQE or CQE, do so before initiating an NVMe/F operation.

Q. How common are rNICs for recent servers? i.e. What’s a quick check I can perform to find out if my NIC is an rNIC?

A. rNICs are offered by nearly all major server vendors. The best way to check is to ask your server or NIC vendor if your NIC supports iWARP or RoCE.

Q. This is most likely out of the scope of this talk but could you perhaps share about 30K level on the differences between “NVMe controller” hardware versus “NVMeF” hardware. It’s most likely a combination of R-NIC+NVMe controller, but would be great to get your take on this.

A goal of the NVMe/F spec is that it work with all existing NVMe controllers and all existing RoCE and iWARP RNICs. So on even a very low level, we can say “no difference.” That said, of course, nothing stops someone from combining NVMe controller and rNIC hardware into one solution.

Q. Are there any example Linux targets in the distros that exercise RDMA verbs? An iWARP or iSER target in a distro?

A. iSER allows this using a LIO or TGT SCSI target.

Q. Is there a standard or IP for RDMA NIC?

A. The various RNICs are based on IBTA, IETF, and IEEE standards are shown on slide 26.

Q. What is the typical additional latency introduced comparing NVMe over Fabric vs. local NVMe?

A. In the 2014 IDF demo, the prototype NVMe/F stack matched the bandwidth of local NVMe with a latency penalty of only 8µs over a local iWARP connection. Other demonstrations have shown an added fabric latency of 3µs to 15µs. The goal for the final spec is under 10µs.

Q. How well is NVME over RDMA supported for Windows ?

A. It is not currently supported, but then the spec isn’t even finished. Contract Microsoft if you are interested in their plans.

Q. RDMA over Ethernet would not support Layer 2 switching? How do you deal with TCP over head?

A. L2 switching is supported by both iWARP and RoCE. Both flavors of RNICs have MAC addresses, etc. iWARP had to deal with TCP/IP in hardware, a TCP/IP Offload Engine or TOE. The TOE used in an iWARP RNIC is significantly constrained compared to a general purpose TOE and therefore can operate with very high performance. See the Chelsio website for proof points. RoCE does not use TCP so does not need to deal with TCP overhead.

Q. Does RDMA not work with fibre channel?

A. They are totally different Transports (L4) and Networks (L3). That said, the FCIA is working with NVMe, Inc. on supporting NVMe over Fibre Channel in a standard to be promoted by T11.

Storage Performance Benchmarking Webcast Series Continues

January 22, 2016January 22, 2016 David Fair

Attendees cannot get enough of the SNIA Ethernet Storage Forum’s Storage Performance Benchmarking Webcast series. On March 8, 2016 our experts, Mark Rogov and Ken Cantrell, will return for the third installment of our series with “Storage Performance Benchmarking: Block Components.” This session aims to continue educating anyone untrained in the storage performance arts to ascend to a common base with the experts. In this Webcast, you will gain an understanding of the block components of modern storage arrays and learn storage block terminology, including:

How storage media affects block storage performance
Integrity and performance trade-offs for data protection: RAID, Erasure Coding, etc.…
Terminology updates: seek time, rebuild time, garbage collection, queue depth and service time

As always, the event will be live and Mark and Ken will be on hand to answer your questions. I encourage you to register today. We hope to see you on March 8^th!

How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics

January 6, 2016January 6, 2016 David Fair

NVMe (Non-Volatile Memory Express) over Fabrics is of tremendous interest among storage vendors, flash manufacturers, and cloud and Web 2.0 customers. Because it offers efficient remote and shared access to a new generation of flash and other non-volatile memory storage, it requires fast, low latency networks, and the first version of the specification is expected to take advantage of RDMA (Remote Direct Memory Access) support in the transport protocol.

Many customers and vendors are now familiar with the advantages and concepts of NVMe over Fabrics but are not familiar with the specific protocols that support it. Join us on January 26^th for this live Webcast that will explore and compare the Ethernet RDMA protocols and transports that support NVMe over Fabrics and the infrastructure needed to use them. You’ll hear:

Why NVMe Over Fabrics requires a low-latency network
How the NVMe protocol is mapped to the network transport
How RDMA-capable protocols work
Comparing available Ethernet RDMA transports: iWARP and RoCE
Infrastructure required to support RDMA over Ethernet
Congestion management methods

The event is live, so please bring your questions. We look forward to answering them.

Ethernet Roadmap for Networked Storage Q&A

July 17, 2015July 17, 2015 David Fair

Almost 200 people attended our joint Webcast with the Ethernet Alliance: “The 2015 Ethernet Roadmap for Networked Storage.” We had a lot of great questions during the live event, but we did not have time to answer them all. As promised, we’ve complied answers for all of the questions that came in. If you think of additional questions, please feel free to comment on this blog.

Q. What did you mean by parity of flash with HDD?

A. We were referring to the O’Reilly article in “Network Computing.” O’Reilly is predicting parity in BOTH capacity and price in 2016.

Q. When do we expect IEEE standards ratification for 25G speed?

A. 2016. You can see the exact schedule here.

Q. Do you envision the Enterprise, Cloud Providers, HPC, Financials getting rid of their 10/40GbE infrastructure and replacing that with 25/100GbE infrastructure in 2017? Will these customers deploy 100GbE/25GbE switch in the leaf layer in 2017?

A. Deployment will occur over a multi-year time span overall if only because switch infrastructure is expensive to upgrade, as reflected in the Crehan Research forecast. New deployments will likely move to 25/100GbE as new switches with 100GbE downstream ports become available in 2016. Just because the Cloud Service Providers are currently the most aggressive in driving new infrastructure purchases, they represent the largest early volumes for 25/100 GbE. Enterprise is still in the midst of the transition from 1GbE to 10GbE.

Q. What are some of the developments on spanning-tree derivatives vs. Dykstra based derivatives such as OSPF, FSPF for switches?

A. Beyond the scope of this presentation on Ethernet. Ethernet is defined by the IEEE for L1 and L2 in the ISO model. Your questions are at L3 and L4, which is handled by organizations like IETF.

Q. With all the speeds possible who is working on flow control?

A. Flow control at the 802.1 level is supported in the Layer 1/2 PHY & MAC by setting upper bounds on the delay through each layer which allows higher layers to comprehend the delays & response times to pause frames. Each new speed & PHY in 802.3 is accompanied by delay constraint specifications to support this.

Q. Do you have an overlay graphic that shows the Ethernet RDMA roadmap? If so, is Ethernet storage the primary driver for that technology?

A. Beyond the scope of this presentation on Ethernet. Ethernet is defined by the IEEE for L1 and L2 in the ISO model. Your questions are at L3 and L4, which is handled by organizations like IETF and the InfiniBand Trade Association.

Q. The adoption of faster and new Ethernet always has to do with the costs of acquiring new technology. How long do you think it will take to adopt/acquire faster Ethernet in datacenters now that the development is happening much faster than the last 20 years?

A. Please see the chart on slide 7 where Crehan Research predicts how fast the technology will diffuse into deployments.

Q. What do you expect as cost comparison between Ethernet and InfiniBand going forward?
Also, what work is being done to reduce latency?

A. Beyond the scope of this presentation. Latency is primarily a consequence of design methodologies and semiconductor process technology, and thus under the control of the silicon device manufacturers. Some vendors prioritize latency more than others.

Q. What’s the technical limitation as speeds go higher and higher?

A. A number of factors limit speeds going faster and faster, but the main problem is that materials attenuate signals as they travel at higher frequencies.

Q. Will 1GbE used for manageability purposes disappear from public cloud? If so, what is the expected time frame?

A. This is a choice for end users. Most equipment is managed on a separate network for security concerns, but users can eliminate these management networks at any time.

Q. What are the relative market size predictions for the expanding number of standards (25G, 50G, 100G, 200G, etc.)?

A. See the Crehan Research forecast in the presentation.

Q. What is the major difference between SMF & MMF for the not so initiated?

A. The SMF has a 9um core while the MMF has a 50um core. Different lasers are used for each fiber type and MMF typically goes 100 meters above 10GbE and SMF goes from 500m to 10km.

Q. Will 25G be available through both copper and fibre connectivity?

A. Yes. IEEE 802.3 work is currently underway to specify 25Gb/s on twinax (“direct attach copper)” to 5 meters, printed circuit backplane up to ~1m, twisted pair copper to 30m, multimode fiber to 100m. There is no technology barrier to 25G on SMF, just that a standards project to specify it has not started yet.

Q. This is interesting from a hardware viewpoint, but has nothing to do with storage yet. Are we going to get to how this relates to storage other than saying flash drives are fast and only Ethernet can keep up?

A. Beyond the scope of this presentation on Ethernet. Ethernet is defined by the IEEE for L1 and L2 in the ISO model. Your questions are directed at the higher layers. The key point of this webcast is that storage networking engineers need to pay much more attention to the Ethernet roadmap than they have historically, primarily because of NVM.

Q. How does “SFP 28″ fit in this mix? Is it required for 25G?

A. SFP28 connectors and modules are required for 25GbE because they give better performance than SFP+ that only works to 10GbE.

Q. Can you provide the quick difference between copper & optical on speed & costs?

A. Copper and optical Ethernet links are usually standardized at the same speed. 400GbE is not defining a copper link but an active Direct Attached Cable (DAC) will probably support 400GbE. Cost depends on volume and many factors and is beyond the scope of this presentation. Copper is usually a fraction of the cost of optical links.

Q. Do you think people will try to use multiple CAT 5e to get more aggregate bandwidth to the access points to avoid having to run Fibre to them?

A. IEEE is defining 2.5GBASE-T and 5GBASE-T to enable Cat5e to support faster wireless access points.

Q. When are higher speeds and PoE going to reach the point when copper based Ethernet will become a viable heat source for buildings thus helping the environment?

A. IEEE is defining 4 wire PoE to deliver at least 60W to end devices. You can find out more here.

Q. What are the use cases for 2.5Gb and 5.0Gb Base-T?

A. The leading use case for 2.5G/5GBASE-T is to provide the uplink for wireless LAN access points that support 802.11ac and future wireless technology. Wireless LAN technology has advanced to the point where >1Gb/s BW is needed upstream from the AP, and 2.5G/5G provide a higher speed uplink while preserving the user’s investment in Cat5e/Cat6 cabling.

Q. Why not have only CFP2 sockets right away with things disabled for lower speeds for all the intervening years leading to full-fledged CFP2?

A. CFP2 is defined for 100GbE and 8 ports can be used on a 1U switch. 100GbE switches are shifting to QSFP28 so that 32 ports of 100GbE is supported in a 1U switch at low cost. The CFP2 is much more expensive than QSFP28 and will not be used for lower speeds because of the high cost.

Next Webcast: The 2015 Ethernet Roadmap for Networked Storage

May 5, 2015May 5, 2015 David Fair

The ESF is excited to announce our next live Webcast, “The 2015 Ethernet Roadmap for Networked Storage.”

For over three decades, Ethernet has advanced on a simple “powers-of-ten” speed increases, and this model has served the industry well. Ethernet is changing in big ways and the Ethernet Alliance has captured the latest changes in the 2015 Ethernet Roadmap.

On June 30^th at 10:00 a.m. PT an expert panel comprised of Scott Kipp, President of the Ethernet Alliance, David Chalupsky, Chair IEEE P802.3bq/bz TFs and the Ethernet Alliance BASE-T Subcommittee and myself will present the Ethernet Alliance’s 2015 Ethernet Roadmap for the networking technology that underlies most of future network storage.

SNIA has focused on protocols and usage models and more or less just takes Ethernet for granted. The biggest technology disruption in the storage space is the emergence into the mainstream of Non-Volatile Memory (NVM), FLASH in particular. NVM increasingly moves system bottlenecks from the storage subsystem to the network. Developments in NVM — most recently 3D FLASH — assure that the cost per GB will continue aggressive declines and demand for bandwidth will go up. NVM will become more prevalent, making the roadmap for Ethernet increasingly more important to the storage networking community.

This will be a live and interactive session. I encourage you to register now and bring your questions for our experts. I hope to see you on June 30^th.

SNIA ESF Leadership Welcomes Chad Hintz

March 12, 2015March 12, 2015 David Fair

The ESF continues our busy schedule hosting informative Webcasts, writing and publishing articles and participating at industry conferences. 2015 also brings a change in ESF leadership. I’d like to welcome Chad Hintz of Cisco as our newest ESF board member. Chad has been elected as chair of our Storage over Ethernet Special Interest Group (SIG).

The Storage over Ethernet SIG is focused on a growing trend among modern data centers to deploy consolidated Ethernet networks as the primary network infrastructure for all LAN and storage traffic. Technologies such as Data Center Bridging (DCB), Fibre Channel over Ethernet (FCoE) among others, offer organizations a robust environment to support mixed workloads with each using the most appropriate protocol (including NFS, SMB and iSCSI) over a shared Ethernet physical transport. The Storage over Ethernet SIG offers educational and thought leadership materials related to these technologies and the business value they offer to organizations of all sizes. Especially appreciated by our audience is that this information comes from SNIA and thus is vendor-neutral.

Chad brings a wealth of expertise to ESF. He is a Technical Solutions Architect focusing on designing enterprise solutions for customers around data center technologies. He holds 3 CCIEs in routing and switching, security and storage and has held certifications from Novell, VMware, and Cisco. We’re confident his expertise and passion for Ethernet Storage will be a big asset to our group.

We are looking forward to having Chad on our team to help guide the many activities we have planned this year. Other members of the 2015 board include myself as ESF chair, Alex McDonald (NetApp), and Mike Jochimsen (Emulex).

As I mentioned, the ESF is busy creating vendor-neutral educational on Ethernet connected storage networking technologies. I encourage you to check out some of our recent and upcoming content:

Upcoming Webcast – Visions for Ethernet Connected Drives

On-demand Webcast – Benefits of RDMA in Accelerating Ethernet Storage Connectivity

Article – Cloud File Services

Article – Weave Your Cloud with a Data Fabric

New Webcast: Visions For Ethernet Connected Drives

February 20, 2015February 20, 2015 David Fair

Mark your calendar for March 25^th as SNIA-ESF, together with the Dell’Oro Group, will be hosting a live Webcast, “Visions for Ethernet Connected Drives.” The arrival of mass-storage services, the emergence of analytics applications and the adoption of object storage by the cloud-services industry have provided an impetus for new storage hardware architectures. One such underlying hardware technology is the Ethernet connected hard drive, which is in early stages of availability.

Please join us on March 25^th to hear Chris DePuy, Vice President of Dell’Oro Group share findings from interviews with storage-related companies, including those selling hard drives, semiconductors, peripherals and systems, as he will present some common themes uncovered, including:

What system-level architectural changes may be needed to support Ethernet connected drives
What capabilities may emerge as a result of the availability of these new drives
What part of the value chain spends the time and money to package working solutions

We will also present some revenue and unit statistics about the storage systems and hard drive markets and will discuss potential market scenarios that may unfold as a result of the object storage and Ethernet connected drive trends.

I’ll be hosting the event and together with Chris, taking your questions. I hope you’ll join us.