• Home
  • About
  •  

    Ethernet Networked Storage – FAQ

    December 8th, 2016

    At our SNIA Ethernet Storage Forum (ESF) webcast “Re-Introduction to Ethernet Networked Storage,” we provided a solid foundation on Ethernet networked storage, the move to higher speeds, challenges, use cases and benefits. Here are answers to the questions we received during the live event.

    Q. Within the iWARP protocol there is a layer called MPA (Marker PDU Aligned Framing for TCP) inserted for storage applications. What is the point of this protocol?

    A. MPA is an adaptation layer between the iWARP Direct Data Placement Protocol and TCP/IP. It provides framing and CRC protection for Protocol Data Units.  MPA enables packing of multiple small RDMA messages into a single Ethernet frame.  It also enables an iWARP NIC to place frames received out-of-order (instead of dropping them), which can be beneficial on best-effort networks. More detail can be found in IETF RFC 5044 and IETF RFC 5041.

    Q. What is the API for RDMA network IPC?

    The general API for RDMA is called verbs. The OpenFabrics Verbs Working Group oversees the development of verbs definition and functionality in the OpenFabrics Software (OFS) code. You can find the training content from OpenFabrics Alliance here. General information about RDMA for Ethernet (RoCE) is available at the InfiniBand Trade Association website. Information about Internet Wide Area RDMA Protocol (iWARP) can be found at IETF: RFC 5040, RFC 5041, RFC 5042, RFC 5043, RFC 5044.

    Q. RDMA requires TCP/IP (iWARP), InfiniBand, or RoCE to operate on with respect to NVMe over Fabrics. Therefore, what are the advantages of disadvantages of iWARP vs. RoCE?

    A. Both RoCE and iWARP support RDMA over Ethernet. iWARP uses TCP/IP while RoCE uses UDP/IP. Debating which one is better is beyond the scope of this webcast, but you can learn more by watching the SNIA ESF webcast, “How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics.”

    Q. 100Gb Ethernet Optical Data Center solution?

    A. 100Gb Ethernet optical interconnect products were first available around 2011 or 2012 in a 10x10Gb/s design (100GBASE-CR10 for copper, 100GBASE-SR10 for optical) which required thick cables and a CXP and a CFP MSA housing. These were generally used only for switch-to-switch links. Starting in late 2015, the more compact 4x25Gb/s design (using the QSFP28 form factor) became available in copper (DAC), optical cabling (AOC), and transceivers (100GBASE-SR4, 100GBASE-LR4, 100GBASE-PSM4, etc.). The optical transceivers allow 100GbE connectivity up to 100m, or 2km and 10km distances, depending on the type of transceiver and fiber used.

    Q. Where is FCoE being used today?

    A. FCoE is primarily used in blade server deployments where there could be contention for PCI slots and only one built-in NIC. These NICs typically support FCoE at 10Gb/s speeds, passing both FC and Ethernet traffic via connect to a Top-of-Rack FCoE switch which parses traffic to the respective fabrics (FC and Ethernet). However, it has not gained much acceptance outside of the blade server use case.

    Q. Why did iSCSI start out mostly in lower-cost SAN markets?

    A. When it first debuted, iSCSI packets were processed by software initiators which consumed CPU cycles and showed higher latency than Fibre Channel. Achieving high performance with iSCSI required expensive NICs with iSCSI hardware acceleration, and iSCSI networks were typically limited to 100Mb/s or 1Gb/s while Fibre Channel was running at 4Gb/s. Fibre Channel is also a lossless protocol, while TCP/IP is lossey, which caused concerns for storage administrators. Now however, iSCSI can run on 25, 40, 50 or 100Gb/s Ethernet with various types of TCP/IP acceleration or RDMA offloads available on the NICs.

    Q. What are some of the differences between iSCSI and FCoE?

    A. iSCSI runs SCSI protocol commands over TCP/IP (except iSER which is iSCSI over RDMA) while FCoE runs Fibre Channel protocol over Ethernet. iSCSI can run over layer 2 and 3 networks while FCoE is Layer 2 only. FCoE requires a lossless network, typically implemented using DCB (Data Center Bridging) Ethernet and specialized switches.

    Q. You pointed out that at least twice that people incorrectly predicted the end of Fibre Channel, but it didn’t happen. What makes you say Fibre Channel is actually going to decline this time?

    A. Several things are different this time. First, Ethernet is now much faster than Fibre Channel instead of the other way around. Second, Ethernet networks now support lossless and RDMA options that were not previously available. Third, several new solutions–like big data, hyper-converged infrastructure, object storage, most scale-out storage, and most clustered file systems–do not support Fibre Channel. Fourth, none of the hyper-scale cloud implementations use Fibre Channel and most private and public cloud architects do not want a separate Fibre Channel network–they want one converged network, which is usually Ethernet.

    Q. Which storage protocols support RDMA over Ethernet?

    A. The Ethernet RDMA options for storage protocols are iSER (iSCSI Extensions for RDMA), SMB Direct, NVMe over Fabrics, and NFS over RDMA. There are also storage solutions that use proprietary protocols supporting RDMA over Ethernet.

     

     

     

     

     

     

     

     

     

     

     

     


    2017 Ethernet Roadmap for Networked Storage

    November 2nd, 2016

    When SNIA’s Ethernet Storage Forum (ESF) last looked at the Ethernet Roadmap for Networked Storage in 2015, we anticipated a world of rapid change. The list of advances in 2016 is nothing short of amazing

    • New adapters, switches, and cables have been launched supporting 25, 50, and 100Gb Ethernet speeds including support from major server vendors and storage startups
    • Multiple vendors have added or updated support for RDMA over Ethernet
    • The growth of NVMe storage devices and release of the NVMe over Fabrics standard are driving demand for both faster speeds and lower latency in networking
    • The growth of cloud, virtualization, hyper-converged infrastructure, object storage, and containers are all increasing the popularity of Ethernet as a storage fabric

    The world of Ethernet in 2017 promises more of the same. Now we revisit the topic with a look ahead at what’s in store for Ethernet in 2017.  Join us on December 1, 2016 for our live webcast, “2017 Ethernet Roadmap to Networked Storage.”

    With all the incredible advances and learning vectors, SNIA ESF has assembled a great team of experts to help you keep up. Here are some of the things to keep track of in the upcoming year:

    • Learn what is driving the adoption of faster Ethernet speeds and new Ethernet storage models
    • Understand the different copper and optical cabling choices available at different speeds and distances
    • Debate how other connectivity options will compete against Ethernet for the new cloud and software-defined storage networks
    • And finally look ahead with us at what Ethernet is planning for new connectivity options and faster speeds such as 200 and 400 Gigabit Ethernet

    The momentum is strong with Ethernet, and we’re here to help you stay informed of the lightning-fast changes. Come join us to look at the future of Ethernet for storage and join this SNIA ESF webcast on December 1st. Register here.

     


    SNIA Storage Developer Conference-The Knowledge Continues

    October 13th, 2016

    SNIA’s 18th Storage Developer Conference is officially a success, with 124 general and breakout sessions;  Cloud Interoperability, Kinetiplugfest 5c Storage, and SMB3 plugfests; ten Birds-of-a-Feather Sessions, and amazing networking among 450+ attendees.  Sessions on NVMe over Fabrics won the title of most attended, but Persistent Memory, Object Storage, and Performance were right behind.  Many thanks to SDC 2016 Sponsors, who engaged attendees in exciting technology … Continue reading


    Questions Aplenty on NVMe over Fabrics

    April 12th, 2016

    Our live SNIA-ESF Webcast, “Under the Hood with NVMe over Fabrics,” generated more questions than we anticipated, proving to us that this topic is worthy of future discussions. Here are answers to both the questions we took during the live event as well as those we didn’t have time for.

    Q. So fabric is an alternative to PCIe, for those of us familiar with PCIe-attached NVMe devices, yes?

    A. Yes, fabric is the term used in the specification that represents a variety of physical interconnects and transports for NVM Express.

    Q. How are the namespaces shared in a fabric?

    A. Namespaces are NVM subsystem resources and are accessible by all controllers in the NVM subsystem. Multi-host access may be coordinated using reservations.

     Q. If there are multiple subsystems accessing same NVMe devices over fabric then how is namespace shared?

    A. The mapping of fabric NVM subsystem resources (namespaces and controllers) to PCIe NVMe device subsystems is implementation specific. They may be mapped 1 to 1 or N to 1, depends on the functionality of the NVMe bridge.

    Q. Are namespace reservations similar to SCSI reservations?

    A. Yes

    Q. Are there plans for defining bindings for Intel Omni Path fabric?

    A. Intel Omni-Path is a good candidate fabric for NVMe over Fabrics.

    Q. Is hybrid attachment allowed? Could a single namespace be attached to a fabric and PCIe (through two controllers) concurrently?

    A. At this moment, such hybrid configuration is not permitted within the specification

    Q. Is a NVM sub-system purpose built or commodity server hardware?

    A. This is a difficult question to answer. At the time of this writing there are not enough “off-the-shelf” commodity components to be able to construct NVMe over Fabric subsystems.

    Q. Does NVMEoF use the same NVMe PCIe controller register map?

    A. A subset of the NVMe controller register mapping was retained for fabrics but renamed to “Properties” to avoid confusion.

    Q. So does NVMe over Fabric act like an extension of the PCIe bus? Meaning that I see the same MMIO registers and queues remotely? Or is it a completely different protocol that is solely message based? Will current NVMe host drivers work on the fabric or does it really require a different driver stack?

    A. Fabrics is not an extension of PCIe, it’s an extension of NVMe. It uses the same NVMe Submission and Completion Queue model and Descriptors as the PCIe NVMe. Most of the original NVMe host driver stack is retained and shared between PCIe and Fabrics, the bottom side was modified to allow for multiple transports.

    Q. Does NVMe over Fabrics support immediate data for writes, or must write data always be fetched by the NVMe controller?

    A. Yes, immediate data is termed “in-capsule” and is used to send the NVMe command data with the NVMe submission entry.

    Q. As far as I know, Linux introduced a multi-queue model at the block layer recently. Is it the same thing you are mentioning?

    A. No, but NVMe uses the Linux Block-MQ layer. NVMe Multi-Queue is used between the host and the NVMe controller for both PCIe and fabric based controllers.

    Q. Are there situations where you might want to have more than one queue pair per CPU? What are they?

    A. Queue-Pairs are matched up by CPU cores, not CPUs, which allows the creation of multiple namespace entities per CPU. This, in turn, is very useful for virtualization and application separation.

    Q. What are three mandatory commands? Do they refer to read/write/sync cache?

    A. Actually, there are 13 required commands. Kevin Marks has a very good presentation from the Flash Memory Summit that provides a list of these commands within the broader NVMe context. You can download it here.  

    Q. Please talk about queue depths? Arbitrary? Limited?

    A. Controller defined maximum queue depths up to a maximum of 64K entries.

    Q. Where will SQs and CQs be physically located? Are they on host memory or SSD memory?

    A. For fabrics, the SQ is located on the controller side to avoid the inefficiency of having to pull SQE’s across a fabric. CQ’s reside on the host.

    Q. How do you create ordering guarantee when that is needed for correctness?

    A. For commands that require sequencing, there is a concept called “Fused Commands” which get sent as a single unit.

    Q. In NVMeoF how are devices discovered?

    A. NVMeoF devices are discoverable via a couple of different means, depending on whether you are using Fibre Channel (which has its own discovery and login process) or an iSCSI-like name server. Mike Shapiro goes over the discovery mechanism in considerable detail in this BrightTALK Webcast.
    Q. I guess all new drivers will be required for NVMeoF?

    A. Yes, new drivers are being written and will be required for NVMeoF.

    Q. Why can’t the doorbell+ plus communication model apply to PCIe? I mean, why doesn’t PCIe use doorbell+?

    A. NVMe 1.2 defines controller resident buffers that can be used for pushing SQ Entries from the host to the controller. Doorbells are still required for PCIe to inform the controller about the new SQ entries.

    Q. If there are two hosts connected to the same subsystem then will NVMe controller have two queues :- one for each host

    A. Yes

    Q. So with your command and data description, does NVMe over Fabric require RDMA or does it have a “Data Ready” type message to tell the host when to send write data?

    A. Data transfer operations are fabric dependent. RDMA uses RDMA_READ, another transport may use some form of Data Ready model.

    Q. Can you quantify the protocol translation overhead? In reality, that does not look like that big from performance perspective.

    A. Submission Queue entries are 64bytes and Completion Queue entries are 16bytes. These are sufficiently small for block storage traffic which typically is in 4K+ size requests. 

    Q. Do Dual Port SSDs need to support two Admin Qs since they have two paths to the same host?

    A. Dual-Port or multi-path capable NVM subsystems require using two NVMe controllers each with one AdminQ and one or more IO queues. 

    Q. For a Dual Port SSD, does each port need to have its Submission Q on a different CPU core in the host? I assume the SQs for the two ports cannot be on the same CPU core.

    A. The mapping of controller queues to host CPU cores is typically per controller. If the host was connected to two controllers, there would be two queues per core. One queue to controller 1 and one queue to controller 2 per host core.

    Q. As you mentioned currently there is an LBA addressing in standard. What will happen when Intel will go to market with new media (3D Point), which is announced to be byte addressable?

    A. The NVMe NVM command set is block based and is independent of the type and access method of the NVM media used in a subsystem implementation.  

    Q. Is there a real benefit of this architecture in a NAS environment?

    A. There is a natural advantage to making any storage access more efficient. A network-attached system still requires block access at the lower levels, and NVMe (either local or over a Fabric) can improve NAS design and flexibility immensely. This is particularly true for pNFS and scale-out SMB paradigms.

    Q. How do you handle authentication across many servers (hosts) on the fabric? How do you decide what host can access what part of each device? Does it have to be namespace specific?

    A. The fabrics specification defines an Authentication model and also defines the naming format for NVM subsystems and hosts. A target implementation can choose to provision NVM subsystems to specific host based on the naming format.

    Q. Having same structure at all layers means at the transport layer of flash appliance also we should maintain the submission and completions Queue model and these mapped to physical Queue of NVMe sub controller?

    A. The NVMe Submission Queue and Completion Queue entries are common between fabrics and PCIe NVMe. This simplifies the steps required to bridge between NVMe fabrics and NVMe PCIe. An implementation may choose to map the fabrics SQ directly to a PCIe NVMe SSD SQ to provide a very efficient simple NVMe transport bridge

    Q. With an RDMA based transport, how will each host discover the NVME controller(s) that it has been granted access to?

    A. Please see the answer above.

    Q. Traditionally SAS supports SAS expander for scaling purpose. How does NVMe over fabric solve the issue as there is no expander concept in NVMe world?

    A. Recall that SAS expanders compensate for SCSI’s inherent lack of scalability. NVMe perpetuates the multi-queue model (which does not exist for SCSI) natively, so SAS expander-like pieces are not required for scale-out.

     

     

     

     


    How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics

    January 6th, 2016

    NVMe (Non-Volatile Memory Express) over Fabrics is of tremendous interest among storage vendors, flash manufacturers, and cloud and Web 2.0 customers. Because it offers efficient remote and shared access to a new generation of flash and other non-volatile memory storage, it requires fast, low latency networks, and the first version of the specification is expected to take advantage of RDMA (Remote Direct Memory Access) support in the transport protocol.

    Many customers and vendors are now familiar with the advantages and concepts of NVMe over Fabrics but are not familiar with the specific protocols that support it. Join us on January 26th for this live Webcast that will explore and compare the Ethernet RDMA protocols and transports that support NVMe over Fabrics and the infrastructure needed to use them. You’ll hear:

    • Why NVMe Over Fabrics requires a low-latency network
    • How the NVMe protocol is mapped to the network transport
    • How RDMA-capable protocols work
    • Comparing available Ethernet RDMA transports: iWARP and RoCE
    • Infrastructure required to support RDMA over Ethernet
    • Congestion management methods

    The event is live, so please bring your questions. We look forward to answering them.


    The Performance Impact of NVMe and NVMe over Fabrics – Q&A

    December 3rd, 2014

    More than 400 people have already seen our recent live ESF Webcast, “The Performance Impact of NVMe and NVMe over Fabrics.” If you missed it, it’s now available on-demand. It was a great session with a lot of questions from attendees. We did not have time to address them all – so here is a complete Q&A courtesy of our experts from Cisco, EMC and Intel. If you think of additional questions, please feel free to comment on this blog.

    Q. Are you saying that just by changing the interface to NVMe for any SSD, one would greatly bump up the IOPS?

    A. NVMe SSDs have higher IOPs than SAS or SATA SSDs due to several factors, including the low latency of PCIe and the efficiency of the NVMe protocol.

    Q. How much of the speed of NVMe you have shown is due to simpler NVMe protocol vs. using Flash? i.e how would the SAS performance change when you are attaching SSD to SAS

    A. The performance differences shown comparing NVMe to SAS to SATA were all using solid-state drives (all NAND Flash). Thus, the difference shown was due to the interface.

    Q. Can you comment on the test conditions these results were obtained under and what are the main reasons NVMe outperforms the others?

    A. The most important reason NVMe outperforms other interfaces is that it was architected for NVM – rather than inheriting the legacy of HDDs. NVMe is built on the foundation of a very efficient multi-queue model and a simple hardware automatable command set that results in very low latency and high performance. Details for IOPs and bandwidth comparisons are shown in the footnotes of the corresponding foils. For the efficiency tests, the detailed setup information was inadvertantly removed from the backup. This will be corrected.

    Q. What is the IOPS difference between NVMe and SAS at the same queue depth?

    A. At a queue depth of 32, for the particular devices shown with 4K random reads is NVMe =~ 267K IOPs and SAS =~ 149K IOPs. SAS does not improve when the queue depth is increased to 128. NVMe performance increases to ~ 472K IOPs at a queue depth of 128.

    Q. Why not use PCIe directly instead of the NVMe layer on PCIe?

    A. PCI Express is used directly. NVM Express is the standard software interface for high performance PCI Express storage devices. PCI Express does not define register, DMA, command set, or feature set for PCIe storage devices. NVM Express replaces proprietary software interfaces and drivers used previously by PCIe SSDs in the market.

    Q. Is the Working Group considering adding things like enclosure identification in the transport abstraction so the host/client can identify where the NVMe drives reside?

    A. The NVM Express organization is developing a Management Interface specification set for release in Q1’2015 that will enable standardized enclosure management. The intent is that these features could be used regardless of fabric type (PCIe, RDMA, etc).

    Q. Are there APIs in the software interface for device query information and device RAID configuration?

    A. NVMe includes an Identify Controller and Identify Namespace command that provides information about the NVMe subsystem, controllers, and namespaces. It is possible to create a RAID controller that uses the NVMe interface if desired. Higher level software APIs are typically defined by the OSV.

    Q. 1. Are NVMe drivers today multi-threaded? 2. If I were to buy a NVMe device today can you suggest some list of vendors whose solutions are used today in data centers (i.e production and not proof of concept or proto)?

    A. The NVM Express drivers are designed for multi-threading – each I/O queue may be owned/controlled by one thread without synchronization with other driver threads. A list of devices that have passed NVMe interoperability and conformance testing are on the NVMe Integrator’s List.

    Q. When do you think the market will consolidate for NVMe/PCIE based SSDs and End of SATA era ?

    A. By 2018, IDC predicts that Enterprise SSDs by Interface will be PCIe=38%, SAS=28%, and SATA=34%. By 2018, Samsung predicts over 70% of client SSDs will be PCIe. Based on forecasts like this, we expect strong growth for NVMe as the standard PCIe SSD interface in both Enterprise and Client segments.

    Q. why can’t it be like a graphic card which does memory transactions?

    A. SSDs of today have much longer latency than memory – where a read from a typical NAND page takes > 50 microseconds. However, as next generation NVM comes to market over the next few years, there may be blurring of the lines between storage and memory, where next generation NVM may be used as very fast storage (like NVMe) or as memory as in NVDIMM type of usage models.

    Q. It seems that most NVMe drive vendors supply proprietary drivers for their drives. What’s the value of NVMe over proprietary interfaces given this? Will we eventually converge on the open source driver?

    A. As the NVMe ecosystem matures, we would expect most implementations would use inbox drivers that are present in many OSes, like Windows, Linux, and Solaris. However, in some Enterprise applications, a vendor may have a value added feature that could be delivered via their own software driver. OEMs and customers will decide whether to use inbox drivers or a vendor specific driver based on whether the value provided by the vendor is significant.

    Q. To create an interconnect to a scale-out storage system with many NVMe drives does that mean you would you need an aggregated fabric link (with multiple RDMA links) to provide enough bandwidth for multiple NVMe drives?

    A. Depends on the speed of the fabric links and the number of NVMe drives. Ideally, the target system would be configured such that the front-end fabric and back-end NVMe drives were bandwidth balanced. Scaling out multi-drive subsystems on a fabric may require the use of fat-tree switch topologies which may be constructed using some form of link aggregation. The performance of the PCIe NVMe drives is expected to put high bandwidth demands on the front-end network interconnect. Each NVMe SFF8639 2.5” drive has a PCIe Gen3/x4 interconnect with the capability to product 3+ GB/s (24gbps) of sustained storage bandwidth. There are multiple production server systems with 4-8 NVMe sff8639 drive bays, which puts these platforms in the 200Gbps capability when used as NVMe over fabrics storage servers. The combination of PCIe NVMe drives and NVMe over fabric targets is going to have a significant impact on datacenter storage performance.

    Q. In other forums we heard about NVMe extensions to deliver vendor specific value add features. Do we have any updates?

    A. Each vendor is allowed to add their own vendor specific features and value. It would be best to discuss any vendor specific features with the appropriate vendor.

    Q. Given that PCIe is not a scalable fabric at least from a storage perspective, do you see the need for SAS SSDs to increase or diminish over time? Or is your view that NVMe SSDs will populate the tier between DRAM and say, rotating media like SAS HDDs?

    A. NVMe SSDs are the highest performance SSDs available today. If there is a box of NVMe SSDs, the most appropriate connection to that JBOD may may be Ethernet or another fabric that then fans out inside the JBOD to PCIe/NVMe SSDs.

    Q. From a storage industry perspective, what deficiencies does NVMe have to displace SAS? Will that transition ever happen?

    A. NVMe SSDs are seeing initial broad deployment primarily in server use cases that prize the high performance. Storage applications require a robust high availiability interface. NVMe has defined support for dual port, reservations, and other high availability features. NVMe will be used in storage applications as these high availability features mature in products.

    Q. Will NVMe over fabrics allow to dma read/write the NVMe device directly (without going through system memory)?

    A. The locality of the NVMe over Fabric buffers on the target-side are target implementation specific. That said, one could construct a target that used a 
pool of PCIe NVMe subsystem controller resident memory as the source and/or sink buffers of a fabric’s NIC’s NVMe data exchanges. This type of configuration will have the limitation of having to pre-determine fabric data to NVMe device locality else the data could end up in the wrong drive’s controller memory.

    Q. Intel True scale fabric technology was based on Fulcrum ASIC. Could you please provide an input how Intel Omni Scale differs from Intel True Sclae fabric?

    A.  In the context of NVMe over Fabrics, Intel Omni-Path fabric is a possible future fabric candidate for an NVMe over Fabrics definition. Specifics on the fabric itself are outside of the scope of NVMe over Fabrics definition. For information on Omni-Path file, please refer to http://www.intel.com/content/www/us/en/omni-scale/intel-omni-scale-fabric-demo.html?wapkw=omni-scale.

    Q. Can the host side NVMe client be in user mode since it is using RDMA?

    A. It is possible since RDMA QP communications allow for both user and kernel mode access to the RDMA verbs. However, there are implications to consider. The NVMe host software currently resides in multiple operating systems as a kernel level block-storage driver. The goal is for NVMe over Fabrics to share common NVMe code between multiple fabric types in order to provide a consistent and sustainable core NVMe software. When NVMe over Fabrics is moved to user level, it essentially becomes a separate single-fabric software solution that must be maintained independently of the multi-fabric kernel NVMe software. There are some performance advantages of having a user-level interface, such as not having to go through the O/S system calls and the ability to poll the completions. These have to be weighed against the loss of kernel resident functionality, such as upper level kernel storage software, and the cost of sustaining the software.

    Q. Which role will or could play the InfiniBand Protocol in the NVMe concept?

    A. InfiniBand™ is one of the supported RDMA fabrics for NVMe over RDMA. NVMe over RDMA will support the family of RDMA fabrics through use of a common set of RDMA verbs. This will allow users to select the RDMA fabric type based on their fabric requirements and not be limited to any one RDMA fabric type for NVMe over RDMA.

    Q. Where is this experimental code for NVMeOF for Driver and FIO available?

    A. FIO is a common Linux storage benchmarking tool and is available from multiple Internet sites. The driver used in the demo was developed specifically as a proof of concept and demonstration for the Intel IDF 2014. They were based on a pre-standardized implementation of the NVMe over RDMA wire protocol. The standard NVMe over RDMA wire protocol is currently being under definition in NVM Express, Inc.. Once the standard is complete, both Host and reference Target drivers for Linux will be developed.

    Q. Was polling on the completion queue used on the target side in the prototype?

    A. The target side POC implementation used a polling technique for both the NVMe over RDMA CQ and NVMe CQ. This was to minimize the latency by eliminating the interrupt latency on the target for both CQs. Depending on the O/S and both the RDMA and NVMe devices interrupt moderation settings, interrupt latency is typically around 2 microseconds. If polling is not the desired model, Intel processors enable another form of event signaling called Monitor/Mwait where latency is typically in the 500ns latency range.

    Q. In the prototype over iWARP, did the remote device dma write/read the NVMe device directly, or did it go through remote system memory?

    A. In the PoC, all NVMe commands and command data went through the remote system memory. Only the NVMe commands were accessed by the CPU, the data was not touched.

    Q. Are there any dependencies between NVMEoF using RDMA and iWARP? Can standard software RDMA in Linux distributions be used without need for iWARP support?

    A. As mentioned, the NVMe over RDMA will be RDMA type agnostic and work with all RDMA providers that support a common set of RDMA verbs.

    Q. PCIe doesn’t support multi-host access to devices. Does NVMe over fabric require movement away from PCIe?

    A. The NVMe 1.1 specification specifically added features for multi-host support – allowing NVMe subsystems to have multiple NVMe controllers and multiple fabric ports. This model is supported in PCI Express by multi-function/ multi-port PCIe drives (typically referred to as dual-port). Dependng on the fabric type, NVMe over Fabrics will extend to configurations with many hosts sharing a single NVMe subsystem consisting of multiple NVMe controllers.

    Q. In light of the fact that NVMe over Fabrics reintroduces more of the SCSI architecture, can you compare and contrast NVMe over Fabrics with ‘SCSI Express’ (SAM/SPC/SBC/SOP/PQI)?

    A. NVMe over Fabrics is not a SCSI model, it’s extending the NVMe model onto other fabric types. The goal is to maintain the simplicity of the NVMe model, such as the small amount of NVMe command types, multi-queue interface model, and efficient NVM oriented host and controller implementations. We chose the RDMA fabric as the first fabric because it too was architected with a small number of operations, multi-queue interface model, and efficient low-latency operations.

    Q. Is there an open source for NVMe over Fabrics, which was used for the IDF demo? If not, can that be made available to others to see how it was done?

    A. Most of the techniques used in the PoC drivers will be implemented in future open-source Host and referent Target drivers. The PoC was both a learning and demonstration vehicle. Due to the PoC drivers using a pre-standards based NVMe over RDMA protocol, we feel it’s best not to propagate the implementation.

    Q. What is the overhead of the protocol? Did you try putting NVMe in front just DRAM? I’d assume you’ll get much better results, and understand the limitations of the protocol much better. In front of DRAM it won’t be NVM, but it will give good data regarding protocol latency.

    A. The overhead of the protocol on the host-side matched the PCIe NVMe driver. On the target, the POC protocol efficiency was around 600ns of compute latency for a complete 4K I/O. For low-latency media, such as DRAM or next generation NVM, the reduced latency of a solution similar to the PoC will enable the effective use of the media’s low latency characteristics.

    Q. Do you have FC performance comparison with NVMe?

    A. We did implement an 100GBE/FCoE target with NVMe back-end storage for an Intel IDF 2012 demonstration. FCoE is a combination of two models, FCP and SCSI. Our experience with this target implementation showed that we were adding a significant amount of computational latency on both the host (initiator) and target FCoE/FC/SCSI storage stacks that reduced the performance and efficiency advantages gained in the back-end PCIe NVMe SSDs. A significant component of this computational latency was due to the multiple storage models and associated translations that occurred between the host application and back-end NVMe drives This experience led us down the path of enabling an end-to-end NVMe model through expanding the NVMe model onto a range of fabric types.