Storage Performance Benchmarking Q&A

Our recent SNIA-ESF webcast, “Storage Performance Benchmarking: Introduction and Fundamentals” really struck a chord! It was our most highly rated and well attended webcast to date with more than 300 people at the live event. If you missed it, it’s now available on-demand. Thanks again to my colleagues, Ken Cantrell and Mark Rogov, who did an outstanding job explaining the basics and setting the stage for our next webcast in this series, “Storage Performance Benchmarking: Part 2“ on October 21st. Mark your calendar! Our audience had many great questions and there just wasn’t enough time to answer them all, so as promised, here are answers to all the questions we received.

Q. Can you explain the difference between MiB and MB?

A. The difference lies between the way that bytes are calculated. A megabyte (MB) is defined in decimal (base-10), and a mebibyte (MiB) in binary (base-2). There is a similar relationship between kilobytes (base-10) and kibibytes (KiB). So, if you begin with a single byte:

1 kilobyte (KB) = 1000 bytes (B)

1 kibibyte (KiB) = 1024 bytes (B)

1 megabyte (MB) = 1000 KB = 1000 * 1000 bytes = 1,000,000 bytes

1 mebibyte (MiB) = 1024 KiB = 1024 * 1024 bytes = 1,048,576 bytes

The distinction is important because very few applications or vendors make use of the MiB notation and instead use the MB notation to refer to binary numbers. If vendor or tool A is using MB to refer to base-10 measurements, and vendor or tool B is using MB to refer to base-2 measurements, it may falsely appear that the two vendors or tools are performing differently (or the same), when in fact the difference (or similarity) is simply because they are labeling two different things (MB and MiB) as the same (MB).

Telecommunications and networking generally measure and report in MB. Disk drive manufacturers tend label storage in terms of MB. Many operating systems report storage labeled as MB, but calculated as MiB. Storage performance (whether from an application’s view, or the storage vendor’s own tools) is generally measured in binary (MiB/s), but reported in decimal (MB/s).

Q. I disagree with the conflation of the terms “IOPS” and “throughput.” Throughput generally refers to overall concept of aggregate performance. IOPS measure throughput at a given request size, which most clients assume to be small blocks (~512B-4K). Many clients would say that bandwidth, measured in MB/s or GB/s, is a measure of throughput @ large block size (>~128K). Performance typically varies significantly @ different block sizes. So clients need to understand that all 3 of these concepts differ.

A. The SNIA dictionary equates throughput to IOPS, which is why we listed it as a common alternative for IOPS.  However, the problem with “throughput” is in the industry’s lack of agreement on what this term means. See the examples below, where sometimes it clearly refers to IOPS and other times to MB/s.  This is why, in the end, we recommend you simply don’t use the term at all, and use either IOPS or MB/s.

Throughput = MB/s

Techterms.com: “Throughput refers to how much data can be transferred from one location to another in a given amount of time. It is used to measure the performance of hard drives and RAM, as well as Internet and network connections.”

Meriam-Webster: “The amount of material, data, etc., that enters and goes through something (such as a machine or system)”

Throughput = IOPS

SNIA Dictionary: “Throughput:  [Computer System] the number of I/O requests satisfied per unit time.   Expressed in I/O requests/second, where a request is an application request to a storage subsystem to perform a read or write operation.”

Wikipedia:  “When used in the context of communication networks, such as Ethernet or packet radio, throughput or network throughput is the rate of successful message delivery over a communication channel. The data these messages belong to may be delivered over a physical or logical link, or it can pass through a certain network node. Throughput is usually measured in bits per second (bit/s or bps), and sometimes in data packets per second (p/s or pps) or data packets per time slot.”

TechTarget:  “Throughput is a term used in information technology that indicates how many units of information can be processed in a set amount of time.”

Throughput = Either

Dictionary.com:  “1. The rate at which a processor can work expressed in instructions per second or jobs per hour or some other unit of performance. 2. Data transfer rate.”

Q. So, did I hear it right? IOPS and Throughput are the same and thus can be used interchangeably?

A. Please see definitions above.

Q. The difference between FE/BE is the raid level + Write percentage = overhead IOPS. It is different at times due to RAID or any virtualization at the storage level, like if you have RAID 5, every front-end write would have 4 backend write. What happens if the throughput at the back end of a controller is more than at the front end?

A. The difference between Front End (FE) and Back End (BE) is generally the vantage point. One of the objectives of the presentation was to show that when comparing two results (or systems) the IOPS must be measured at the same point. BE IOPS depend, in part, on the protection overhead and on the software engine’s ability to counteract it. But instead of defining why there is a difference, we aimed at exposing that such difference exists. The “whys” will be addressed in the future webcasts.

Q. Also, another point that would dictate which system is the storage capacity each has to offer. Very good compare & contrast illustration.

A. Agreed, although capacity itself has several facets: raw capacity or the sum of manufacture advertised capacities of all internal drives; usable capacity or the amount of data that could be written on one system after all formatting and partitioning; and logical capacity or the amount of data that external hosts could write onto the system. Logical may be different from usable based on some data reduction services that a system could offer.

Q. Why is there a lower limit on your acceptable latency band? Storage can be too fast for your business needs?

A. We had some discussions looking at whether you show it as a cap, or a band. In many cases, customers will have a target latency cap, but allow some level of exceptions (either % of time it is exceeded, or % by which it is exceeded). That initially suggested a soft cap and a hard cap. One of us also has seen customers that highly value consistent behavior. In a service provider context, providing “too fast” performance could actually build a downstream customer expectation that will cause satisfaction issues if performance later degrades, even if it degrades to levels within the original targets.  You could certainly see this more generally as a cap though, which Ken stated verbally in the presentation. The band represents not only the cap, but also variance.

Q. What is an “OP”? An”Operation Per”? Shouldn’t this be “$/OPS” or “dollars per operation per second”?

A. This question was raised during the verbal Q&A. An “OP” in our context was any of the potential protocol-level operations. For example, in an NFSv3 context, this could be a CREATE, SETATTR, GETATTR, or FINDFILE operation. The number of these that occur in a second is what we referred to as OPS. There is certainly potential confusion here, and we struggled for consistency as we put the presentation together.

For example, sometimes we (and others) refer to operations per second as “op/s” instead of OPS, or use “Ops” or “OPs” or “ops” to mean “more than one operation” instead of “operations per second.” We chose to use OPS because it is consistent with IOPS, which we see frequently in this space, and are trying to push others towards this terminology for consistency.

Q. More IOPS in the same amount of time seems to imply that the system with more IOPS is performing each IO in less time than the other system. So how do we explain a system that can do more IOPS than another system but with a higher response time for each IO than the other system?

A. Excellent question! Two different angles on a reply follow …

First angle:

A simple IOP count doesn’t include the size of the IOP, or the work that the storage controller must perform to process it. Therefore comparing just the IOPS does not provide a good comparison: what if system 1 was doing large IOs (64KiB or more) and system 2 was doing small IOs (4KiB or less)? To avoid that, most people fix the IO size or run controlled tests changing the IO size in steps and then compare the arrays.

The point that we are making with the webcast is that IOPS alone are not enough for a comparison, one must look beyond that into IO size, response time and above all business requirements.

Second angle:

This is an excellent question, and there are at least two answers.

To begin, we need to explain the relationship between IOPS and response time.

We generally talk about response time (latency) in terms of seconds or milliseconds. We say things like “a response time of 5ms.” But this is actually a shortcut. It is really “seconds per I/O” or “milliseconds per I/O.” Also, remember that in “IOPS” the “p” stands for “per”, so this is “IOs per second.” So, ignoring the conversions between “seconds” and “milliseconds”, the relationship between IOPS and response time is easy…they are inverses of each other:

blog image 1

So:

blog image 2

Implies:

blog image 3

Example:

blog image 4

 

blog image 5

Given this:

The first answer to your question is tied to idle time. Consider the following simple example.

System 1:

For system1, assume that the storage array service time is the only significant contributor to latency.

Client issues an I/O

Storage array services the I/O in 10ms

As soon as the client sees the response, it immediately issues a new I/O

Process continues, with all I/Os serviced by the storage array in 10ms each

How many IOPS is system 1 doing?  Measured at the client, or the storage array, it is doing 100 IOPS with an average response time of 10ms.  Remember the relationship between IOPS and response time from the intro above.

System 2:

For system2, assume that both the client and storage array contribute to latency.

Client issues an I/O

Storage array services the I/O in 5ms.

Client waits 5ms before issuing a new I/O.

Process continues, with all I/Os being serviced by the storage array in 5ms each, and the client waiting 5ms between issuing each I/O.

How many IOPS is system2 doing? If you measure the total number of I/Os completed in a minute, it is completing the same # as System1 and so doing 100 IOPS. But what is the latency? If you measured the latency at the storage array, it would be 5ms. If you measured at the client, it would depend on whether your measurement was above or below the point where the 5ms delay per I/O was inserted, and would see either 5ms or 10ms.

The second answer is tied to “concurrency,” an element we didn’t discuss in the webcast.

Concurrency might also be called “parallelism.”  In the examples above, the concurrency was ‘1’. There was only ‘1’ I/O active at a time.

What if we had more though?  Example:

For both systems, assume that the storage array service time is the only significant contributor to latency.

System 1 (same as above)

Client issues an I/O

Storage array services the I/O in 10ms

As soon as the client sees the response, it immediately issues a new I/O

Process continues, with all I/Os serviced by the storage array in 10ms each

How many IOPS is system 1 doing?  Measured at the client, or the storage array, it is doing 100 IOPS with an average response time of 10ms. Remember the relationship between IOPS and response time from the intro above.

System 2:  Increased concurrency

Client issues two I/Os

Storage array receives both and is able to handle both in parallel, but they take 20ms to complete

As soon as the client sees the response, it immediately issues two new I/Os

Process continues, with all I/Os serviced by the storage array in 20ms each, and the client always issuing two at a time

How many IOPS is system 2 doing? In the aggregate, measured at the client or the storage array, it is doing 100 IOPS. What is the response time? This is where the relationship between IOPS and response time introduced before can break down, depending on what you’re really asking for and it really takes some queuing theory to properly explain this. Ignoring the different kinds of queuing models and simplifying for this discussion, at a high level you could say that the average response time is still 10ms. Some reporting tools are likely to do this. That is misleading however, since each I/O really took 20ms. You’d only see a 20ms response if (a) you actually kept track of the real response times somewhere [which many systems do … they’ll create many different response time buckets and store counts of the # of I/Os with a given response time in a given bucket], or (b) you make use of Little’s Law. Little’s Law would at least give you a better mean response time. We didn’t discuss Little’s Law in the webcast, and I won’t go into much detail of it here (follow the Wikipedia link if you want to know more), but in short, Little’s Law can be rearranged to state that:

blog image 6

or, in our case above:

blog image 7

Q. What is valuable from SPEC’s SFS benchmark in this context?

A. Excellent question. We think very highly of SPEC SFS© 2014 for a variety of reasons, but in this context, a few stick out in particular:

1) The workloads provided by SPEC SFS 2014 are the result of multi-vendor research, actual workload traces, and efforts to target real business use cases. The VDI (virtual desktop infrastructure), SWBUILD (software build), DB (OLTP database) and VDA (video data acquisition) workloads are all distinct workloads with very different I/O patterns, I/O size mixes, and operation mixes. We’ll talk more in another session about these aspects of a workload, but the important element is that each workload is designed to help users understand performance in a specific business context.

2) To encourage folks to think in a business context, SPEC SFS 2014 primarily reports results not in IOPS or MB/S, but in the relevant “business units” at a given response time, along with a measure of overall response time, aptly referred to as ORT (Overall Response Time).  For example, the SWBUILD workload reports load points as “number of concurrent software builds.” MB/s information is still available for those that want to dig in, but what is really important is not the MB/s, but how many software builds, or databases, or virtual desktops, or video streams can a system support, and this is where the reporting focus is.

3) SPEC SFS 2014 recognizes that consistent performance is generally an important element of good performance. SPEC SFS 2014 implements a variety of checks during execution to make sure that each element doing I/O is doing roughly the same amount of I/O and that they are doing the amount of work that was requested of them and aren’t lagging behind.

More information on SPEC SFS 2014 (and SPEC in general) is available from http://www.spec.org

Q. Anything different/special about measuring and testing object storage?

A. Object Storage is a hot topic in the industry right now, but unfortunately fell outside the scope of this fundamentals presentation. We will be examining object storage – and corresponding network protocols – in a later webcast. Stay tuned!

Q. Does Queue depth not matter in response times?

A. We did not define Queue Depth as a term in this webcast because we classified it as an advanced subject. We are planning to address queue depth during a future webcast. In short, the queue depth does play an important role in the performance world, but response time is affected by it only indirectly. When the target Queue is full, the Initiator can’t send any more IOs and waits for the Queue to free up.

The time spent waiting is not response time. Response time measures the time the Target spends processing the IO it was able to receive (place into its queue). However, some scenarios may keep an IO inside a Queue for a long time (due to storage controller doing something else), and thus increase the response time for the waiting IO. In the latter case, increasing the depth of the Queue will increase the response time.

Keep in mind thought that “It Depends”. It depends on where you measure. For example, a client may have no visibility into what is happening at the storage array. It cares about how long the storage array takes to respond … it doesn’t know or care about the differences between the storage array’s view of its service time (how long the storage array is doing work) and its queuing delay / wait time (how long it is waiting to do work).

As far as the client is concerned, this is all response time. It just wants a response. Most “response time” metrics reported by tools / apps/ storage arrays really decompose into a set of true service times and queuing delays, although you may never get to see how those components come together w/o using other tools.

Q. Order is important, in addition to mix ratios. For IOs, sequential vs. random (and specifics on randomness) is important. For file operations, where the protocol is stateful, the sequence of operations can strongly impact performance. Both block and file storage can have caches, which behave very differently depending on ordering and working set size. I suggest including the concept of order in a storage benchmarking intro.

A. These are excellent points, but given the amount of time we had, we had to make some hard choices about what to include and what to drop (or defer). This fell on the chopping block. Someone once told me that the best talks are those where you cry over what you’ve been forced to leave on the cutting room floor. There was a lot of good stuff we couldn’t include. I expect that we will address this idea in the planned discussion of workloads.

Q. The “S” in “OPS” and “IOPS” means Seconds and isn’t a pluralization. Some people make a mistake here and use the term “OP” and “IOP” to mean “1 operation/second” or “1 IO/second”. This mistake happens in this presentation, where “$/OP” is used. The numbers suggest what was intended is “# of dollars to get 1 operations/second.” The term “$/OP” can be misinterpreted to mean “dollars per operation,” which is a different but also relevant metric for perishable storage, like flash.

A. Good catch. We tried hard to be consistent in using “OPS” instead of “op/s” or some variant, when we meant “operations per second”.  There is another Q&A that addresses this. Technically, we slipped up in saying “$/OP” instead of “$/OPS” when we meant, as you noted, “Dollars per (operations per second)”. That said, I think you’ll find that nearly everyone makes this shortcut/mistake. That doesn’t make it right, but does mean you should watch out for it.

Q. Isn’t capacity reported in MiB, Gib etc and bandwidth in MB/s today?

A. Unfortunately, our impression is that capacity is also reported in decimal, at least in marketing literature. All we were trying to point out is that units of measurements must be consistent at all points. Many people mix decimal and binary numbers creating ambiguities and errors (remember that at a TiB vs. TB level, the difference is 10%!) For example, graphs that show IO size in decimal and bandwidth in binary.

Q. Would you also please talk about different SAN protocols and how they differ/impact the performance?

A. There are additional webcasts under development that discuss the storage networking protocols, their performance trade/offs, and how performance can be affected by their use. Due to time constraints we needed to postpone this portion because it deserves more attention than we could provide in this session.

Q. So even though IOPS could be more, the response time could be less…so it’s important to consider both. Is that correct?

A. Absolutely! But just response time is not enough either! You need more load points, and a business objective to truly assess the value of a solution.

 

 

 

New Webcast: Storage Performance Benchmarking 101

There’s an art to making sense of storage performance benchmarks. That’s why ESF is hosting a live Webcast on this important topic. Please join us on July 30th for “Storage Performance Benchmarking: Introduction and Fundamentals.”   At this Webcast, you’ll gain an understanding of the complexities of benchmarking modern storage arrays and learn the terminology foundations necessary for the rest of the series as Mark Rogov, Advisory Systems Engineer at EMC, Ken Cantrell, Performance Engineering Manager at NetApp, and I discuss:

  • The different kinds of performance benchmarking engagements
  • An introduction to the variety of relevant metrics
  • How to determine the “right” metrics for your business
  • Terminology basics: IOPS, op/s, throughput, bandwidth, latency/response time

If you’re untrained in the storage performance arts, this Webcast will bring you up-to-speed on the basics. My colleagues and I hope it will be an informative and interactive hour. Please register today and bring your questions. I hope to see you there.

The Life of a Storage Packet

Keeping storage as close to the application as possible and reasonable is important, but different types of storage can make a big difference for performance as well as types of workloads. Starting with the basics and working to more complexity, find out how storage really works in this first of the Packet Walk series of the “Napkin Dialogues” series. Warning: You’re on your own when tipping the pizza delivery person!

Download (PDF, 1.71MB)

 

The Performance Impact of NVMe and NVMe over Fabrics – Q&A

More than 400 people have already seen our recent live ESF Webcast, “The Performance Impact of NVMe and NVMe over Fabrics.” If you missed it, it’s now available on-demand. It was a great session with a lot of questions from attendees. We did not have time to address them all – so here is a complete Q&A courtesy of our experts from Cisco, EMC and Intel. If you think of additional questions, please feel free to comment on this blog.

Q. Are you saying that just by changing the interface to NVMe for any SSD, one would greatly bump up the IOPS?

A. NVMe SSDs have higher IOPs than SAS or SATA SSDs due to several factors, including the low latency of PCIe and the efficiency of the NVMe protocol.

Q. How much of the speed of NVMe you have shown is due to simpler NVMe protocol vs. using Flash? i.e how would the SAS performance change when you are attaching SSD to SAS

A. The performance differences shown comparing NVMe to SAS to SATA were all using solid-state drives (all NAND Flash). Thus, the difference shown was due to the interface.

Q. Can you comment on the test conditions these results were obtained under and what are the main reasons NVMe outperforms the others?

A. The most important reason NVMe outperforms other interfaces is that it was architected for NVM – rather than inheriting the legacy of HDDs. NVMe is built on the foundation of a very efficient multi-queue model and a simple hardware automatable command set that results in very low latency and high performance. Details for IOPs and bandwidth comparisons are shown in the footnotes of the corresponding foils. For the efficiency tests, the detailed setup information was inadvertantly removed from the backup. This will be corrected.

Q. What is the IOPS difference between NVMe and SAS at the same queue depth?

A. At a queue depth of 32, for the particular devices shown with 4K random reads is NVMe =~ 267K IOPs and SAS =~ 149K IOPs. SAS does not improve when the queue depth is increased to 128. NVMe performance increases to ~ 472K IOPs at a queue depth of 128.

Q. Why not use PCIe directly instead of the NVMe layer on PCIe?

A. PCI Express is used directly. NVM Express is the standard software interface for high performance PCI Express storage devices. PCI Express does not define register, DMA, command set, or feature set for PCIe storage devices. NVM Express replaces proprietary software interfaces and drivers used previously by PCIe SSDs in the market.

Q. Is the Working Group considering adding things like enclosure identification in the transport abstraction so the host/client can identify where the NVMe drives reside?

A. The NVM Express organization is developing a Management Interface specification set for release in Q1’2015 that will enable standardized enclosure management. The intent is that these features could be used regardless of fabric type (PCIe, RDMA, etc).

Q. Are there APIs in the software interface for device query information and device RAID configuration?

A. NVMe includes an Identify Controller and Identify Namespace command that provides information about the NVMe subsystem, controllers, and namespaces. It is possible to create a RAID controller that uses the NVMe interface if desired. Higher level software APIs are typically defined by the OSV.

Q. 1. Are NVMe drivers today multi-threaded? 2. If I were to buy a NVMe device today can you suggest some list of vendors whose solutions are used today in data centers (i.e production and not proof of concept or proto)?

A. The NVM Express drivers are designed for multi-threading – each I/O queue may be owned/controlled by one thread without synchronization with other driver threads. A list of devices that have passed NVMe interoperability and conformance testing are on the NVMe Integrator’s List.

Q. When do you think the market will consolidate for NVMe/PCIE based SSDs and End of SATA era ?

A. By 2018, IDC predicts that Enterprise SSDs by Interface will be PCIe=38%, SAS=28%, and SATA=34%. By 2018, Samsung predicts over 70% of client SSDs will be PCIe. Based on forecasts like this, we expect strong growth for NVMe as the standard PCIe SSD interface in both Enterprise and Client segments.

Q. why can’t it be like a graphic card which does memory transactions?

A. SSDs of today have much longer latency than memory – where a read from a typical NAND page takes > 50 microseconds. However, as next generation NVM comes to market over the next few years, there may be blurring of the lines between storage and memory, where next generation NVM may be used as very fast storage (like NVMe) or as memory as in NVDIMM type of usage models.

Q. It seems that most NVMe drive vendors supply proprietary drivers for their drives. What’s the value of NVMe over proprietary interfaces given this? Will we eventually converge on the open source driver?

A. As the NVMe ecosystem matures, we would expect most implementations would use inbox drivers that are present in many OSes, like Windows, Linux, and Solaris. However, in some Enterprise applications, a vendor may have a value added feature that could be delivered via their own software driver. OEMs and customers will decide whether to use inbox drivers or a vendor specific driver based on whether the value provided by the vendor is significant.

Q. To create an interconnect to a scale-out storage system with many NVMe drives does that mean you would you need an aggregated fabric link (with multiple RDMA links) to provide enough bandwidth for multiple NVMe drives?

A. Depends on the speed of the fabric links and the number of NVMe drives. Ideally, the target system would be configured such that the front-end fabric and back-end NVMe drives were bandwidth balanced. Scaling out multi-drive subsystems on a fabric may require the use of fat-tree switch topologies which may be constructed using some form of link aggregation. The performance of the PCIe NVMe drives is expected to put high bandwidth demands on the front-end network interconnect. Each NVMe SFF8639 2.5” drive has a PCIe Gen3/x4 interconnect with the capability to product 3+ GB/s (24gbps) of sustained storage bandwidth. There are multiple production server systems with 4-8 NVMe sff8639 drive bays, which puts these platforms in the 200Gbps capability when used as NVMe over fabrics storage servers. The combination of PCIe NVMe drives and NVMe over fabric targets is going to have a significant impact on datacenter storage performance.

Q. In other forums we heard about NVMe extensions to deliver vendor specific value add features. Do we have any updates?

A. Each vendor is allowed to add their own vendor specific features and value. It would be best to discuss any vendor specific features with the appropriate vendor.

Q. Given that PCIe is not a scalable fabric at least from a storage perspective, do you see the need for SAS SSDs to increase or diminish over time? Or is your view that NVMe SSDs will populate the tier between DRAM and say, rotating media like SAS HDDs?

A. NVMe SSDs are the highest performance SSDs available today. If there is a box of NVMe SSDs, the most appropriate connection to that JBOD may may be Ethernet or another fabric that then fans out inside the JBOD to PCIe/NVMe SSDs.

Q. From a storage industry perspective, what deficiencies does NVMe have to displace SAS? Will that transition ever happen?

A. NVMe SSDs are seeing initial broad deployment primarily in server use cases that prize the high performance. Storage applications require a robust high availiability interface. NVMe has defined support for dual port, reservations, and other high availability features. NVMe will be used in storage applications as these high availability features mature in products.

Q. Will NVMe over fabrics allow to dma read/write the NVMe device directly (without going through system memory)?

A. The locality of the NVMe over Fabric buffers on the target-side are target implementation specific. That said, one could construct a target that used a 
pool of PCIe NVMe subsystem controller resident memory as the source and/or sink buffers of a fabric’s NIC’s NVMe data exchanges. This type of configuration will have the limitation of having to pre-determine fabric data to NVMe device locality else the data could end up in the wrong drive’s controller memory.

Q. Intel True scale fabric technology was based on Fulcrum ASIC. Could you please provide an input how Intel Omni Scale differs from Intel True Sclae fabric?

A.  In the context of NVMe over Fabrics, Intel Omni-Path fabric is a possible future fabric candidate for an NVMe over Fabrics definition. Specifics on the fabric itself are outside of the scope of NVMe over Fabrics definition. For information on Omni-Path file, please refer to http://www.intel.com/content/www/us/en/omni-scale/intel-omni-scale-fabric-demo.html?wapkw=omni-scale.

Q. Can the host side NVMe client be in user mode since it is using RDMA?

A. It is possible since RDMA QP communications allow for both user and kernel mode access to the RDMA verbs. However, there are implications to consider. The NVMe host software currently resides in multiple operating systems as a kernel level block-storage driver. The goal is for NVMe over Fabrics to share common NVMe code between multiple fabric types in order to provide a consistent and sustainable core NVMe software. When NVMe over Fabrics is moved to user level, it essentially becomes a separate single-fabric software solution that must be maintained independently of the multi-fabric kernel NVMe software. There are some performance advantages of having a user-level interface, such as not having to go through the O/S system calls and the ability to poll the completions. These have to be weighed against the loss of kernel resident functionality, such as upper level kernel storage software, and the cost of sustaining the software.

Q. Which role will or could play the InfiniBand Protocol in the NVMe concept?

A. InfiniBand™ is one of the supported RDMA fabrics for NVMe over RDMA. NVMe over RDMA will support the family of RDMA fabrics through use of a common set of RDMA verbs. This will allow users to select the RDMA fabric type based on their fabric requirements and not be limited to any one RDMA fabric type for NVMe over RDMA.

Q. Where is this experimental code for NVMeOF for Driver and FIO available?

A. FIO is a common Linux storage benchmarking tool and is available from multiple Internet sites. The driver used in the demo was developed specifically as a proof of concept and demonstration for the Intel IDF 2014. They were based on a pre-standardized implementation of the NVMe over RDMA wire protocol. The standard NVMe over RDMA wire protocol is currently being under definition in NVM Express, Inc.. Once the standard is complete, both Host and reference Target drivers for Linux will be developed.

Q. Was polling on the completion queue used on the target side in the prototype?

A. The target side POC implementation used a polling technique for both the NVMe over RDMA CQ and NVMe CQ. This was to minimize the latency by eliminating the interrupt latency on the target for both CQs. Depending on the O/S and both the RDMA and NVMe devices interrupt moderation settings, interrupt latency is typically around 2 microseconds. If polling is not the desired model, Intel processors enable another form of event signaling called Monitor/Mwait where latency is typically in the 500ns latency range.

Q. In the prototype over iWARP, did the remote device dma write/read the NVMe device directly, or did it go through remote system memory?

A. In the PoC, all NVMe commands and command data went through the remote system memory. Only the NVMe commands were accessed by the CPU, the data was not touched.

Q. Are there any dependencies between NVMEoF using RDMA and iWARP? Can standard software RDMA in Linux distributions be used without need for iWARP support?

A. As mentioned, the NVMe over RDMA will be RDMA type agnostic and work with all RDMA providers that support a common set of RDMA verbs.

Q. PCIe doesn’t support multi-host access to devices. Does NVMe over fabric require movement away from PCIe?

A. The NVMe 1.1 specification specifically added features for multi-host support – allowing NVMe subsystems to have multiple NVMe controllers and multiple fabric ports. This model is supported in PCI Express by multi-function/ multi-port PCIe drives (typically referred to as dual-port). Dependng on the fabric type, NVMe over Fabrics will extend to configurations with many hosts sharing a single NVMe subsystem consisting of multiple NVMe controllers.

Q. In light of the fact that NVMe over Fabrics reintroduces more of the SCSI architecture, can you compare and contrast NVMe over Fabrics with ‘SCSI Express’ (SAM/SPC/SBC/SOP/PQI)?

A. NVMe over Fabrics is not a SCSI model, it’s extending the NVMe model onto other fabric types. The goal is to maintain the simplicity of the NVMe model, such as the small amount of NVMe command types, multi-queue interface model, and efficient NVM oriented host and controller implementations. We chose the RDMA fabric as the first fabric because it too was architected with a small number of operations, multi-queue interface model, and efficient low-latency operations.

Q. Is there an open source for NVMe over Fabrics, which was used for the IDF demo? If not, can that be made available to others to see how it was done?

A. Most of the techniques used in the PoC drivers will be implemented in future open-source Host and referent Target drivers. The PoC was both a learning and demonstration vehicle. Due to the PoC drivers using a pre-standards based NVMe over RDMA protocol, we feel it’s best not to propagate the implementation.

Q. What is the overhead of the protocol? Did you try putting NVMe in front just DRAM? I’d assume you’ll get much better results, and understand the limitations of the protocol much better. In front of DRAM it won’t be NVM, but it will give good data regarding protocol latency.

A. The overhead of the protocol on the host-side matched the PCIe NVMe driver. On the target, the POC protocol efficiency was around 600ns of compute latency for a complete 4K I/O. For low-latency media, such as DRAM or next generation NVM, the reduced latency of a solution similar to the PoC will enable the effective use of the media’s low latency characteristics.

Q. Do you have FC performance comparison with NVMe?

A. We did implement an 100GBE/FCoE target with NVMe back-end storage for an Intel IDF 2012 demonstration. FCoE is a combination of two models, FCP and SCSI. Our experience with this target implementation showed that we were adding a significant amount of computational latency on both the host (initiator) and target FCoE/FC/SCSI storage stacks that reduced the performance and efficiency advantages gained in the back-end PCIe NVMe SSDs. A significant component of this computational latency was due to the multiple storage models and associated translations that occurred between the host application and back-end NVMe drives This experience led us down the path of enabling an end-to-end NVMe model through expanding the NVMe model onto a range of fabric types.