At this month’s SNIA Ethernet Storage Forum Webcast, “Architectural Principles for Networked Solid State Storage Access,” Doug Voigt, Chair of the SNIA NVM Programming Technical Working Group, and a member of the SNIA Technical Council, outlined key architectural principles surrounding the application of networked solid state technologies. We had a flurry of questions near the end of the Webcast that we did not have enough time to answer. Here are Doug’s answers to all the questions we received during the event:
Q. Are there wait cycles in accessing persistent memory?
A. It depends entirely on which persistent memory (PM) technology is being accessed and how the memory interconnect is used. Some technologies have write times that are quite different from read times. When using tightly timed interconnects such as DDR with those technologies it may be difficult to avoid wait cycles.
Q. How do Pmalloc and malloc share the virtual address space of the application?
A. This is entirely up to the OS and other libraries operating within any constraints of the processor architecture-specific memory management units. A good mental model would be fairly large regions of contiguous address space in both the physical and virtual domains, where each region will comprise a single type of memory. Capacity will be reserved for pmalloc and malloc in the appropriate regions.
Q. Always flush after doing your memory-mapped IO. Is that simply good hygiene?
A. Not exactly. The term “Memory Mapped IO” is used to reference control plane (as opposed to data plane) access. It is often reasonable to set up control plane memory as uncacheable. The need for strict order of access to physical control plane registers is so pervasive that caching is generally not useful. Uncacheable writes are always flushed by the processor, as opposed to the application.
Generally with memory mapped IO devices the data plane uses direct memory access (DMA). With memory mapped files (as opposed to memory mapped IO) Load/Store (more commonly referred to as “Ld/St”), not DMA, is used in the data plane. Disabling caching in the data plane is generally a big performance sacrifice for small byte range access.
In the Ld/St datapath, strategically placed flushing is required to retain both performance and power failure recovery. The SNIA NVM Programming Model describes this type of functionality.
Q. Once NVDIMM support become pervasive with support from NVMe drives in the server box, should network storage be more focused on SAS Flash or just SAS HDDs?
A. Not necessarily. NVMe over Fabric, Fibre Channel and iSCSI are also types of networked storage that will likely retain significant market share relative to SAS.
Q. Are the ‘Big Data’ Data Warehouse applications starting to use the persistence memory and domain technologies in their applications?
A. It is too early to see much of this yet. PM technologies might become a priority as a staging area for analytic applications with high ingest or checkpoint rates. NVDIMMs are likely to be too expensive to store anything “big” for quite a while.
Q. Also, is the persistence memory/domains being used in the Hyper-converged and Converged hardware infrastructures?
A. Persistent memory is quintessentially (Hyper-) converged. It wouldn’t be unreasonable to expect some traction with hyper-converged solutions that experience high storage-performance demand.
Q. What distance would you associate with 10’s of microseconds?
A. In terms of transmission delay, 10’s of uS align with a campus or small city scale, but the distance itself is often not the primary factor. Switching delays, transmission line properties and software overhead are generally bigger factors.
Q. So latency would be the binding factor for distances…not a question, an observation.
A. Yes, in effect, either through transmission or relay. See above.
Q. Aren’t there multi-threaded SSDs?
A. Yes, but since the primary metric in this presentation is latency we ignore multi-threading. It can enable more work to get done, but it generally increases latency rather than reducing it.
Q. Is Pmalloc universal usage?
A. The term is starting to be recognized among developers and has been used in research. Various similar names have been used in early research prototypessuch as pmalloc in Mnemosyne and nvmalloc in SCMFS.
Q. So how would PM help in a (stock broking) requirement, where we currently prophesize an RDMA or iWARP solution?
A. With PM the answer is always lower latency. PM can be litegrated like memory or like flash. RDMA network paths for both of these options were discussed in the presentation. In either case, PM is low-latency enough that networking and software overheads will completely determine performance, even when using RDMA. The performance boost from PM is greatest when it is accessed locally. If remote access is a requirement then the new work being done in the RDMA community should help.
Q. If data stored in memory requires to be copied to a different host, memory (for consistency) how does PM assist, or is there an extension to PM? Coherency between multiple hosts in a cluster, if you will?
A. PM technology does not help with this; the methods of managing consistency across hosts remain unchanged by PM. All PM offers is low latency persistence.
Coordination across hosts or nodes in a cluster must use existing clustering techniques such as locking and quorums. In addition, the relative timescales of memory access and network communication suggest the application of asynchronous remote replication techniques used in today’s storage solutions.
Regarding coherency, PM brings nothing new to the known techniques for managing coherency. Classical cluster architecture must be applied outside of symmetric multi-processing coherency domains. Within coherency domains, all of the logic is above the PM level in a processor side memory controller or a software emulation of the same algorithms.