Are Ethernet-attached SSDs Brilliant?

Several solid state disk (SSD) and networking vendors have demonstrated ways to connect SSDs directly to an Ethernet network. They propose that deploying Ethernet SSDs will be more scalable, easier to manage, higher performance, and/or lower cost than traditional storage networking solutions that use a storage controller (or hyperconverged node) between the SSDs and the network. Who would want to attach SSDs directly to the network? Are these vendors brilliant or simply trying to solve a problem that doesn’t exist? What are the different solutions that could benefit from Ethernet SSDs? Which protocols would one use to access them? How will orchestration be used to enable applications to find assigned Ethernet SSDs? How will Ethernet SSDs affect server subsystems such as Ethernet RAID/mirroring and affect solution management such as Ethernet SAN orchestration?  And how do Ethernet SSDs relate to computational storage? Read More

Principles of Networked Solid State Storage – Q&A

At this month’s SNIA Ethernet Storage Forum Webcast, “Architectural Principles for Networked Solid State Storage Access,” Doug Voigt, Chair of the SNIA NVM Programming Technical Working Group, and a member of the SNIA Technical Council, outlined key architectural principles surrounding the application of networked solid state technologies. We had a flurry of questions near the end of the Webcast that we did not have enough time to answer. Here are Doug’s answers to all the questions we received during the event:

Q. Are there wait cycles in accessing persistent memory?

A. It depends entirely on which persistent memory (PM) technology is being accessed and how the memory interconnect is used.  Some technologies have write times that are quite different from read times.  When using tightly timed interconnects such as DDR with those technologies it may be difficult to avoid wait cycles.

Q. How do Pmalloc and malloc share the virtual address space of the application?

A. This is entirely up to the OS and other libraries operating within any constraints of the processor architecture-specific memory management units.  A good mental model would be fairly large regions of contiguous address space in both the physical and virtual domains, where each region will comprise a single type of memory. Capacity will be reserved for pmalloc and malloc in the appropriate regions.

Q. Always flush after doing your memory-mapped IO.  Is that simply good hygiene?

A. Not exactly. The term “Memory Mapped IO” is used to reference control plane (as opposed to data plane) access.  It is often reasonable to set up control plane memory as uncacheable. The need for strict order of access to physical control plane registers is so pervasive that caching is generally not useful. Uncacheable writes are always flushed by the processor, as opposed to the application.

Generally with memory mapped IO devices the data plane uses direct memory access (DMA).  With memory mapped files (as opposed to memory mapped IO) Load/Store (more commonly referred to as “Ld/St”), not DMA, is used in the data plane. Disabling caching in the data plane is generally a big performance sacrifice for small byte range access.

In the Ld/St datapath, strategically placed flushing is required to retain both performance and power failure recovery. The SNIA NVM Programming Model describes this type of functionality.

Q. Once NVDIMM support become pervasive with support from NVMe drives in the server box, should network storage be more focused on SAS Flash or just SAS HDDs?

A. Not necessarily.  NVMe over Fabric, Fibre Channel and iSCSI are also types of networked storage that will likely retain significant market share relative to SAS.

Q. Are the ‘Big Data’ Data Warehouse applications starting to use the persistence memory and domain technologies in their applications?

A. It is too early to see much of this yet. PM technologies might become a priority as a staging area for analytic applications with high ingest or checkpoint rates. NVDIMMs are likely to be too expensive to store anything “big” for quite a while.

Q. Also, is the persistence memory/domains being used in the Hyper-converged and Converged hardware infrastructures?

A. Persistent memory is quintessentially (Hyper-) converged.  It wouldn’t be unreasonable to expect some traction with hyper-converged solutions that experience high storage-performance demand.

Q. What distance would you associate with 10’s of microseconds?

A. In terms of transmission delay, 10’s of uS align with a campus or small city scale, but the distance itself is often not the primary factor.  Switching delays, transmission line properties and software overhead are generally bigger factors.

Q. So latency would be the binding factor for distances…not a question, an observation.

A. Yes, in effect, either through transmission or relay.  See above.

Q. Aren’t there multi-threaded SSDs?

A. Yes, but since the primary metric in this presentation is latency we ignore multi-threading.  It can enable more work to get done, but it generally increases latency rather than reducing it.

Q. Is Pmalloc universal usage?

A. The term is starting to be recognized among developers and has been used in research. Various similar names have been used in early research prototypessuch as pmalloc in Mnemosyne and nvmalloc in SCMFS.

Q. So how would PM help in a (stock broking) requirement, where we currently prophesize an RDMA or iWARP solution?

A. With PM the answer is always lower latency.  PM can be litegrated like memory or like flash. RDMA network paths for both of these options were discussed in the presentation. In either case, PM is low-latency enough that networking and software overheads will completely determine performance, even when using RDMA. The performance boost from PM is greatest when it is accessed locally.  If remote access is a requirement then the new work being done in the RDMA community should help.

Q. If data stored in memory requires to be copied to a different host, memory (for consistency) how does PM assist, or is there an extension to PM? Coherency between multiple hosts in a cluster, if you will?

A. PM technology does not help with this; the methods of managing consistency across hosts remain unchanged by PM.  All PM offers is low latency persistence.

Coordination across hosts or nodes in a cluster must use existing clustering techniques such as locking and quorums. In addition, the relative timescales of memory access and network communication suggest the application of asynchronous remote replication techniques used in today’s storage solutions.

Regarding coherency, PM brings nothing new to the known techniques for managing coherency.  Classical cluster architecture must be applied outside of symmetric multi-processing coherency domains. Within coherency domains, all of the logic is above the PM level in a processor side memory controller or a software emulation of the same algorithms.

 

 

 

 

Questions Aplenty on NVMe over Fabrics

Our live SNIA-ESF Webcast, “Under the Hood with NVMe over Fabrics,” generated more questions than we anticipated, proving to us that this topic is worthy of future discussions. Here are answers to both the questions we took during the live event as well as those we didn’t have time for.

Q. So fabric is an alternative to PCIe, for those of us familiar with PCIe-attached NVMe devices, yes?

A. Yes, fabric is the term used in the specification that represents a variety of physical interconnects and transports for NVM Express.

Q. How are the namespaces shared in a fabric?

A. Namespaces are NVM subsystem resources and are accessible by all controllers in the NVM subsystem. Multi-host access may be coordinated using reservations.

 Q. If there are multiple subsystems accessing same NVMe devices over fabric then how is namespace shared?

A. The mapping of fabric NVM subsystem resources (namespaces and controllers) to PCIe NVMe device subsystems is implementation specific. They may be mapped 1 to 1 or N to 1, depends on the functionality of the NVMe bridge.

Q. Are namespace reservations similar to SCSI reservations?

A. Yes

Q. Are there plans for defining bindings for Intel Omni Path fabric?

A. Intel Omni-Path is a good candidate fabric for NVMe over Fabrics.

Q. Is hybrid attachment allowed? Could a single namespace be attached to a fabric and PCIe (through two controllers) concurrently?

A. At this moment, such hybrid configuration is not permitted within the specification

Q. Is a NVM sub-system purpose built or commodity server hardware?

A. This is a difficult question to answer. At the time of this writing there are not enough “off-the-shelf” commodity components to be able to construct NVMe over Fabric subsystems.

Q. Does NVMEoF use the same NVMe PCIe controller register map?

A. A subset of the NVMe controller register mapping was retained for fabrics but renamed to “Properties” to avoid confusion.

Q. So does NVMe over Fabric act like an extension of the PCIe bus? Meaning that I see the same MMIO registers and queues remotely? Or is it a completely different protocol that is solely message based? Will current NVMe host drivers work on the fabric or does it really require a different driver stack?

A. Fabrics is not an extension of PCIe, it’s an extension of NVMe. It uses the same NVMe Submission and Completion Queue model and Descriptors as the PCIe NVMe. Most of the original NVMe host driver stack is retained and shared between PCIe and Fabrics, the bottom side was modified to allow for multiple transports.

Q. Does NVMe over Fabrics support immediate data for writes, or must write data always be fetched by the NVMe controller?

A. Yes, immediate data is termed “in-capsule” and is used to send the NVMe command data with the NVMe submission entry.

Q. As far as I know, Linux introduced a multi-queue model at the block layer recently. Is it the same thing you are mentioning?

A. No, but NVMe uses the Linux Block-MQ layer. NVMe Multi-Queue is used between the host and the NVMe controller for both PCIe and fabric based controllers.

Q. Are there situations where you might want to have more than one queue pair per CPU? What are they?

A. Queue-Pairs are matched up by CPU cores, not CPUs, which allows the creation of multiple namespace entities per CPU. This, in turn, is very useful for virtualization and application separation.

Q. What are three mandatory commands? Do they refer to read/write/sync cache?

A. Actually, there are 13 required commands. Kevin Marks has a very good presentation from the Flash Memory Summit that provides a list of these commands within the broader NVMe context. You can download it here.  

Q. Please talk about queue depths? Arbitrary? Limited?

A. Controller defined maximum queue depths up to a maximum of 64K entries.

Q. Where will SQs and CQs be physically located? Are they on host memory or SSD memory?

A. For fabrics, the SQ is located on the controller side to avoid the inefficiency of having to pull SQE’s across a fabric. CQ’s reside on the host.

Q. How do you create ordering guarantee when that is needed for correctness?

A. For commands that require sequencing, there is a concept called “Fused Commands” which get sent as a single unit.

Q. In NVMeoF how are devices discovered?

A. NVMeoF devices are discoverable via a couple of different means, depending on whether you are using Fibre Channel (which has its own discovery and login process) or an iSCSI-like name server. Mike Shapiro goes over the discovery mechanism in considerable detail in this BrightTALK Webcast.
Q. I guess all new drivers will be required for NVMeoF?

A. Yes, new drivers are being written and will be required for NVMeoF.

Q. Why can’t the doorbell+ plus communication model apply to PCIe? I mean, why doesn’t PCIe use doorbell+?

A. NVMe 1.2 defines controller resident buffers that can be used for pushing SQ Entries from the host to the controller. Doorbells are still required for PCIe to inform the controller about the new SQ entries.

Q. If there are two hosts connected to the same subsystem then will NVMe controller have two queues :- one for each host

A. Yes

Q. So with your command and data description, does NVMe over Fabric require RDMA or does it have a “Data Ready” type message to tell the host when to send write data?

A. Data transfer operations are fabric dependent. RDMA uses RDMA_READ, another transport may use some form of Data Ready model.

Q. Can you quantify the protocol translation overhead? In reality, that does not look like that big from performance perspective.

A. Submission Queue entries are 64bytes and Completion Queue entries are 16bytes. These are sufficiently small for block storage traffic which typically is in 4K+ size requests. 

Q. Do Dual Port SSDs need to support two Admin Qs since they have two paths to the same host?

A. Dual-Port or multi-path capable NVM subsystems require using two NVMe controllers each with one AdminQ and one or more IO queues. 

Q. For a Dual Port SSD, does each port need to have its Submission Q on a different CPU core in the host? I assume the SQs for the two ports cannot be on the same CPU core.

A. The mapping of controller queues to host CPU cores is typically per controller. If the host was connected to two controllers, there would be two queues per core. One queue to controller 1 and one queue to controller 2 per host core.

Q. As you mentioned currently there is an LBA addressing in standard. What will happen when Intel will go to market with new media (3D Point), which is announced to be byte addressable?

A. The NVMe NVM command set is block based and is independent of the type and access method of the NVM media used in a subsystem implementation.  

Q. Is there a real benefit of this architecture in a NAS environment?

A. There is a natural advantage to making any storage access more efficient. A network-attached system still requires block access at the lower levels, and NVMe (either local or over a Fabric) can improve NAS design and flexibility immensely. This is particularly true for pNFS and scale-out SMB paradigms.

Q. How do you handle authentication across many servers (hosts) on the fabric? How do you decide what host can access what part of each device? Does it have to be namespace specific?

A. The fabrics specification defines an Authentication model and also defines the naming format for NVM subsystems and hosts. A target implementation can choose to provision NVM subsystems to specific host based on the naming format.

Q. Having same structure at all layers means at the transport layer of flash appliance also we should maintain the submission and completions Queue model and these mapped to physical Queue of NVMe sub controller?

A. The NVMe Submission Queue and Completion Queue entries are common between fabrics and PCIe NVMe. This simplifies the steps required to bridge between NVMe fabrics and NVMe PCIe. An implementation may choose to map the fabrics SQ directly to a PCIe NVMe SSD SQ to provide a very efficient simple NVMe transport bridge

Q. With an RDMA based transport, how will each host discover the NVME controller(s) that it has been granted access to?

A. Please see the answer above.

Q. Traditionally SAS supports SAS expander for scaling purpose. How does NVMe over fabric solve the issue as there is no expander concept in NVMe world?

A. Recall that SAS expanders compensate for SCSI’s inherent lack of scalability. NVMe perpetuates the multi-queue model (which does not exist for SCSI) natively, so SAS expander-like pieces are not required for scale-out.

 

 

 

 

Quick PTS Implementation

PTS ProcedureNeed an abbreviated version of the SNIA SSD Performance Test Specification (PTS) in a hurry?  Jamon Bowen of Texas Memory Systems (TMS) whipped up a simple implementation of certain key parts of the PTS that can be run on a Linux system and interpreted in Excel.

It’s a free download on his Storage Tuning blog.

This is a boon for anyone that might want to run a internal preliminary test before pursuing a more formal route.

The bash script uses the Flexible I/O utility (FIO) to run through part of the SSSI PTS.  FIO does the heavy lifting, and the script manages it.  The script outputs comma separated (CSV) data and the download includes an Excel pivot table that helps format the results and select the measurement window.

Since this is a bare-bones implementation the SSD must be initialized manually before the test script is run.

The test runs the IOPS Test from the PTS.  This test covers a range of block sizes, read/write ratios and iterates until the steady state for the device is reached (with a maximum of 25 iterations).  Altogether the test takes over a day to run.

Once the test is complete, the downloadable pivot tables allow users to select the steady-state measurement window and report the data in a recommended format.

See Mr. Bowen’s blog at http://storagetuning.wordpress.com/2011/11/07/sssi-performance-test-specification/ for details on this valuable download.