The Everything You Want To Know About Storage Is On Again With Part Mauve – The Architecture Pod

The first installment of our “colorful” Webcast series, “Everything You Wanted To Know about Storage But Were Too Proud To Ask – Part Chartreuse,” covered the fundamental elements of storage systems. If you missed it, you can check it out on-demand. On November 1st, we’ll be back at it, focusing on the network aspect of storage systems with “Everything You Wanted To Know About Storage But Were Too Proud To Ask – Part Mauve.”

As with any technical field, it’s too easy to dive into the jargon of the pieces and expect people to know exactly what you mean. Unfortunately, some of the terms may have alternative meanings in other areas of technology. In this Webcast, we look at some of those terms specifically and discuss them as they relate to storage networking systems.

In particular, you’ll find out what we mean when we talk about:

  • Channel versus Busses
  • Control Plane versus Data Plane
  • Fabric versus Network

Register now for Part Mauve of “Everything You Wanted To Know About Storage But Were Too Proud to Ask.

For people who are familiar with data center technology, whether it be compute, programming, or even storage itself, some of these concepts may seem intuitive and obvious… until you start talking to people who are really into this stuff. This series of Webcasts will help be your Secret Decoder Ring to unlock the mysteries of what is going on when you hear these conversations. We hope to see you there!

 

 

 

Everything You Wanted to Know about Storage, but were too Proud to Ask

Many times we know things without even realizing it, or remembering how we came to know them. In technology, this often comes from direct, personal experience rather than some systematic process. In turn, this leads to “best practices” that come from tribal knowledge, rather than any inherent codified set of rules to follow.

In the world of storage, for example, it’s very tempting to simply think of component parts that can be swapped out interchangeably. Change out your spinning hard drives for solid state, for example, you can generally expect better performance. Change the way you connect to the storage device, get better performance… or do you?

Storage is more holistic than many people realize, and as a result there are often unintended consequences for even the simplest of modifications. With the ‘hockey stick-like’ growth in innovation over the past couple of years, many people have found themselves facing terms and concepts in storage that they feel they should have understood, but don’t.

These series of webcasts are designed to help you with those troublesome spots: everything you thought you should know about storage but were afraid to ask.

Here, we’re going to go all the way back to basics and define the terms so that people can understand what people are talking about in those discussions. Not only are we going to define the terms, but we’re going to talk about terms that are impacted by those concepts once you start mixing and matching.

For example, when we say that we have a “memory mapped” storage architecture, what does that mean? Can we have a memory mapped storage system at the other end of a network? If so, what protocol should we use – iSCSI? POSIX? NVMe over Fabrics? Would this be an idempotent system or an object-based storage system?

Now, if that above paragraph doesn’t send you into fits of laughter, then this series of webcasts is for you (hint: most of it was complete nonsense… but which part? Come watch to find out!).

On September 7th, we will start with the very basics – The Naming of the Parts. We’ll break down the entire storage picture and identify the places where most of the confusion falls. Join us in this first webcast – Part Chartreuse – where we’ll learn:

  • What an initiator is
  • What a target is
  • What a storage controller is
  • What a RAID is, and what a RAID controller is
  • What a Volume Manager is
  • What a Storage Stack is

Too proud to ask

 

With these fundamental parts, we’ll be able to place them into a context so that you can understand how all these pieces fit together to form a Data Center storage environment. Future webcasts will discuss:

Part Mauve – Architecture Pod:

  • Channel v. bus
  • Control plane v. data plane
  • Fabric v. network

Part Teal – Buffering Pod:

  • Buffering v. Queueing (with Queue Depth)
  • Flow Control
  • Ring Buffers

Part Rosé – iSCSI Pod:

  • iSCSI offload
  • TCP offload
  • Host-based iSCSI

Part Sepia – Getting-From-Here-To-There Pod:

  • Encapsulation v. Tuning
  • IOPS v. Latency v. Jitter

Part Vermillion – The What-if-Programming-and-Networking-Had-A-Baby Pod:

  • Storage APIs v. POSIX
  • Block v. File v. Object
  • Idempotent
  • Coherence v. Cache Coherence
  • Byte Addressable v. Logical Block Addressing

Part Taupe – Memory Pod:

  • Memory Mapping
  • Physical Region Page (PRP)
  • Scatter Gather Lists
  • Offset

Part Turquoise – Where-Does-My-Data-Go Pod:

  • Volatile v. Non-Volatile v Persistent Memory
  • NVDIMM v. RAM v. DRAM v. SLC v. MLC v. TLC v. NAND v. 3D NAND v. Flash v SSDs v. NVMe
  • NVMe (the protocol)

Part Burgundy – Orphans Pod

  • Doorbells
  • Controller Memory Buffers

Of course, you may already be familiar with some, or all, of these concepts. If you are, then these webcasts aren’t for you. However, if you’re a seasoned professional in technology in another area (compute, networking, programming, etc.) and you want to brush up on some of the basics without judgment or expectations, this is the place for you.

Oh, and why are the parts named after colors, instead of numbered? Because there is no order to these webcasts. Each is a standalone seminar on understanding some of the elements of storage systems that can help you learn about technology without admitting that you were faking it the whole time! If you are looking for a starting point – the absolute beginning place – please start with Part Chartreuse, “The Naming of the Parts.” We look forward to seeing you on September 7th at 10:00 a.m. PT. Register today.

Principles of Networked Solid State Storage – Q&A

At this month’s SNIA Ethernet Storage Forum Webcast, “Architectural Principles for Networked Solid State Storage Access,” Doug Voigt, Chair of the SNIA NVM Programming Technical Working Group, and a member of the SNIA Technical Council, outlined key architectural principles surrounding the application of networked solid state technologies. We had a flurry of questions near the end of the Webcast that we did not have enough time to answer. Here are Doug’s answers to all the questions we received during the event:

Q. Are there wait cycles in accessing persistent memory?

A. It depends entirely on which persistent memory (PM) technology is being accessed and how the memory interconnect is used.  Some technologies have write times that are quite different from read times.  When using tightly timed interconnects such as DDR with those technologies it may be difficult to avoid wait cycles.

Q. How do Pmalloc and malloc share the virtual address space of the application?

A. This is entirely up to the OS and other libraries operating within any constraints of the processor architecture-specific memory management units.  A good mental model would be fairly large regions of contiguous address space in both the physical and virtual domains, where each region will comprise a single type of memory. Capacity will be reserved for pmalloc and malloc in the appropriate regions.

Q. Always flush after doing your memory-mapped IO.  Is that simply good hygiene?

A. Not exactly. The term “Memory Mapped IO” is used to reference control plane (as opposed to data plane) access.  It is often reasonable to set up control plane memory as uncacheable. The need for strict order of access to physical control plane registers is so pervasive that caching is generally not useful. Uncacheable writes are always flushed by the processor, as opposed to the application.

Generally with memory mapped IO devices the data plane uses direct memory access (DMA).  With memory mapped files (as opposed to memory mapped IO) Load/Store (more commonly referred to as “Ld/St”), not DMA, is used in the data plane. Disabling caching in the data plane is generally a big performance sacrifice for small byte range access.

In the Ld/St datapath, strategically placed flushing is required to retain both performance and power failure recovery. The SNIA NVM Programming Model describes this type of functionality.

Q. Once NVDIMM support become pervasive with support from NVMe drives in the server box, should network storage be more focused on SAS Flash or just SAS HDDs?

A. Not necessarily.  NVMe over Fabric, Fibre Channel and iSCSI are also types of networked storage that will likely retain significant market share relative to SAS.

Q. Are the ‘Big Data’ Data Warehouse applications starting to use the persistence memory and domain technologies in their applications?

A. It is too early to see much of this yet. PM technologies might become a priority as a staging area for analytic applications with high ingest or checkpoint rates. NVDIMMs are likely to be too expensive to store anything “big” for quite a while.

Q. Also, is the persistence memory/domains being used in the Hyper-converged and Converged hardware infrastructures?

A. Persistent memory is quintessentially (Hyper-) converged.  It wouldn’t be unreasonable to expect some traction with hyper-converged solutions that experience high storage-performance demand.

Q. What distance would you associate with 10’s of microseconds?

A. In terms of transmission delay, 10’s of uS align with a campus or small city scale, but the distance itself is often not the primary factor.  Switching delays, transmission line properties and software overhead are generally bigger factors.

Q. So latency would be the binding factor for distances…not a question, an observation.

A. Yes, in effect, either through transmission or relay.  See above.

Q. Aren’t there multi-threaded SSDs?

A. Yes, but since the primary metric in this presentation is latency we ignore multi-threading.  It can enable more work to get done, but it generally increases latency rather than reducing it.

Q. Is Pmalloc universal usage?

A. The term is starting to be recognized among developers and has been used in research. Various similar names have been used in early research prototypessuch as pmalloc in Mnemosyne and nvmalloc in SCMFS.

Q. So how would PM help in a (stock broking) requirement, where we currently prophesize an RDMA or iWARP solution?

A. With PM the answer is always lower latency.  PM can be litegrated like memory or like flash. RDMA network paths for both of these options were discussed in the presentation. In either case, PM is low-latency enough that networking and software overheads will completely determine performance, even when using RDMA. The performance boost from PM is greatest when it is accessed locally.  If remote access is a requirement then the new work being done in the RDMA community should help.

Q. If data stored in memory requires to be copied to a different host, memory (for consistency) how does PM assist, or is there an extension to PM? Coherency between multiple hosts in a cluster, if you will?

A. PM technology does not help with this; the methods of managing consistency across hosts remain unchanged by PM.  All PM offers is low latency persistence.

Coordination across hosts or nodes in a cluster must use existing clustering techniques such as locking and quorums. In addition, the relative timescales of memory access and network communication suggest the application of asynchronous remote replication techniques used in today’s storage solutions.

Regarding coherency, PM brings nothing new to the known techniques for managing coherency.  Classical cluster architecture must be applied outside of symmetric multi-processing coherency domains. Within coherency domains, all of the logic is above the PM level in a processor side memory controller or a software emulation of the same algorithms.

 

 

 

 

Architectural Principles for Networked Solid State Storage Access

There are many permutations of technologies, interconnects and application level approaches in play with solid state storage today.  In fact, it is becoming increasingly difficult to reason clearly about which problems are best solved by various permutations of these. That’s why the SNIA Ethernet Storage Forum, together with the SNIA Solid State Storage Initiative, is hosting a live Webcast, “Architectural Principles for Networked Solid State Storage Access,” on June 2nd at 10:00 a.m. PT.

As our presenter, we are fortunate to have Doug Voigt, chair of the SNIA NVM Programming Technical Working Group and a member of the SNIA Technical Council. Doug will outline key architectural principals that may allow us to think about the application of networked solid state technologies more systematically, answering questions such as:

  • How do applications see IO and memory access differently?
  • What is the difference between a memory and an SSD technology?
  • How do application and technology views permute?
  • How do memory and network interconnects change the equation?
  • What are persistence domains and why are they important?

I hope you’ll register today and join us on June 2nd for an hour that is sure to be insightful.

Questions Aplenty on NVMe over Fabrics

Our live SNIA-ESF Webcast, “Under the Hood with NVMe over Fabrics,” generated more questions than we anticipated, proving to us that this topic is worthy of future discussions. Here are answers to both the questions we took during the live event as well as those we didn’t have time for.

Q. So fabric is an alternative to PCIe, for those of us familiar with PCIe-attached NVMe devices, yes?

A. Yes, fabric is the term used in the specification that represents a variety of physical interconnects and transports for NVM Express.

Q. How are the namespaces shared in a fabric?

A. Namespaces are NVM subsystem resources and are accessible by all controllers in the NVM subsystem. Multi-host access may be coordinated using reservations.

 Q. If there are multiple subsystems accessing same NVMe devices over fabric then how is namespace shared?

A. The mapping of fabric NVM subsystem resources (namespaces and controllers) to PCIe NVMe device subsystems is implementation specific. They may be mapped 1 to 1 or N to 1, depends on the functionality of the NVMe bridge.

Q. Are namespace reservations similar to SCSI reservations?

A. Yes

Q. Are there plans for defining bindings for Intel Omni Path fabric?

A. Intel Omni-Path is a good candidate fabric for NVMe over Fabrics.

Q. Is hybrid attachment allowed? Could a single namespace be attached to a fabric and PCIe (through two controllers) concurrently?

A. At this moment, such hybrid configuration is not permitted within the specification

Q. Is a NVM sub-system purpose built or commodity server hardware?

A. This is a difficult question to answer. At the time of this writing there are not enough “off-the-shelf” commodity components to be able to construct NVMe over Fabric subsystems.

Q. Does NVMEoF use the same NVMe PCIe controller register map?

A. A subset of the NVMe controller register mapping was retained for fabrics but renamed to “Properties” to avoid confusion.

Q. So does NVMe over Fabric act like an extension of the PCIe bus? Meaning that I see the same MMIO registers and queues remotely? Or is it a completely different protocol that is solely message based? Will current NVMe host drivers work on the fabric or does it really require a different driver stack?

A. Fabrics is not an extension of PCIe, it’s an extension of NVMe. It uses the same NVMe Submission and Completion Queue model and Descriptors as the PCIe NVMe. Most of the original NVMe host driver stack is retained and shared between PCIe and Fabrics, the bottom side was modified to allow for multiple transports.

Q. Does NVMe over Fabrics support immediate data for writes, or must write data always be fetched by the NVMe controller?

A. Yes, immediate data is termed “in-capsule” and is used to send the NVMe command data with the NVMe submission entry.

Q. As far as I know, Linux introduced a multi-queue model at the block layer recently. Is it the same thing you are mentioning?

A. No, but NVMe uses the Linux Block-MQ layer. NVMe Multi-Queue is used between the host and the NVMe controller for both PCIe and fabric based controllers.

Q. Are there situations where you might want to have more than one queue pair per CPU? What are they?

A. Queue-Pairs are matched up by CPU cores, not CPUs, which allows the creation of multiple namespace entities per CPU. This, in turn, is very useful for virtualization and application separation.

Q. What are three mandatory commands? Do they refer to read/write/sync cache?

A. Actually, there are 13 required commands. Kevin Marks has a very good presentation from the Flash Memory Summit that provides a list of these commands within the broader NVMe context. You can download it here.  

Q. Please talk about queue depths? Arbitrary? Limited?

A. Controller defined maximum queue depths up to a maximum of 64K entries.

Q. Where will SQs and CQs be physically located? Are they on host memory or SSD memory?

A. For fabrics, the SQ is located on the controller side to avoid the inefficiency of having to pull SQE’s across a fabric. CQ’s reside on the host.

Q. How do you create ordering guarantee when that is needed for correctness?

A. For commands that require sequencing, there is a concept called “Fused Commands” which get sent as a single unit.

Q. In NVMeoF how are devices discovered?

A. NVMeoF devices are discoverable via a couple of different means, depending on whether you are using Fibre Channel (which has its own discovery and login process) or an iSCSI-like name server. Mike Shapiro goes over the discovery mechanism in considerable detail in this BrightTALK Webcast.
Q. I guess all new drivers will be required for NVMeoF?

A. Yes, new drivers are being written and will be required for NVMeoF.

Q. Why can’t the doorbell+ plus communication model apply to PCIe? I mean, why doesn’t PCIe use doorbell+?

A. NVMe 1.2 defines controller resident buffers that can be used for pushing SQ Entries from the host to the controller. Doorbells are still required for PCIe to inform the controller about the new SQ entries.

Q. If there are two hosts connected to the same subsystem then will NVMe controller have two queues :- one for each host

A. Yes

Q. So with your command and data description, does NVMe over Fabric require RDMA or does it have a “Data Ready” type message to tell the host when to send write data?

A. Data transfer operations are fabric dependent. RDMA uses RDMA_READ, another transport may use some form of Data Ready model.

Q. Can you quantify the protocol translation overhead? In reality, that does not look like that big from performance perspective.

A. Submission Queue entries are 64bytes and Completion Queue entries are 16bytes. These are sufficiently small for block storage traffic which typically is in 4K+ size requests. 

Q. Do Dual Port SSDs need to support two Admin Qs since they have two paths to the same host?

A. Dual-Port or multi-path capable NVM subsystems require using two NVMe controllers each with one AdminQ and one or more IO queues. 

Q. For a Dual Port SSD, does each port need to have its Submission Q on a different CPU core in the host? I assume the SQs for the two ports cannot be on the same CPU core.

A. The mapping of controller queues to host CPU cores is typically per controller. If the host was connected to two controllers, there would be two queues per core. One queue to controller 1 and one queue to controller 2 per host core.

Q. As you mentioned currently there is an LBA addressing in standard. What will happen when Intel will go to market with new media (3D Point), which is announced to be byte addressable?

A. The NVMe NVM command set is block based and is independent of the type and access method of the NVM media used in a subsystem implementation.  

Q. Is there a real benefit of this architecture in a NAS environment?

A. There is a natural advantage to making any storage access more efficient. A network-attached system still requires block access at the lower levels, and NVMe (either local or over a Fabric) can improve NAS design and flexibility immensely. This is particularly true for pNFS and scale-out SMB paradigms.

Q. How do you handle authentication across many servers (hosts) on the fabric? How do you decide what host can access what part of each device? Does it have to be namespace specific?

A. The fabrics specification defines an Authentication model and also defines the naming format for NVM subsystems and hosts. A target implementation can choose to provision NVM subsystems to specific host based on the naming format.

Q. Having same structure at all layers means at the transport layer of flash appliance also we should maintain the submission and completions Queue model and these mapped to physical Queue of NVMe sub controller?

A. The NVMe Submission Queue and Completion Queue entries are common between fabrics and PCIe NVMe. This simplifies the steps required to bridge between NVMe fabrics and NVMe PCIe. An implementation may choose to map the fabrics SQ directly to a PCIe NVMe SSD SQ to provide a very efficient simple NVMe transport bridge

Q. With an RDMA based transport, how will each host discover the NVME controller(s) that it has been granted access to?

A. Please see the answer above.

Q. Traditionally SAS supports SAS expander for scaling purpose. How does NVMe over fabric solve the issue as there is no expander concept in NVMe world?

A. Recall that SAS expanders compensate for SCSI’s inherent lack of scalability. NVMe perpetuates the multi-queue model (which does not exist for SCSI) natively, so SAS expander-like pieces are not required for scale-out.

 

 

 

 

Storage Performance Benchmarking Q&A – Take 2

Our recent Ethernet Storage Forum Webcast, “Storage Performance Benchmarking: Part 2,” has already been viewed by more than 500 people. If you haven’t seen it yet, it’s now available on –demand. Our expert presenters, Ken Cantrell and Mark Rogov did a great job fielding questions during the live event, but of course there wasn’t time to get to them all. So, as promised here are their answers to all of them. If you have additional questions or thoughts, please comment on this blog and we’ll get back to you as soon as we can.

Q: “As an example, am I right to presume workloads are generated by VMs”

A: Ken: It is probably a good idea at this point to define a workload, since we continue to use the term. At a very high level, think of a workload as the mix of operations issued by an application related to the accessing of data. In our case, data stored or made available by a storage solution. With that in mind, absolutely, workloads can be generated by VMs. But they don’t have to be. In other words, “it depends.”

For example, consider these 3 cases:

1) If your SUT (solution under test) was just a simple laptop with no hypervisor and a traditional OS, then there would be no VM in the mix. Your workload would be generated by the application you were measuring (whether that was a simple file copy or something complex like a local database installation).

2) Your SUT is composed of a physical client (like the laptop above) attached to a machine with a hypervisor installed on it and a local guest OS installation that is capable of exporting NFS or SMB shares. The laptop sends I/O via Ethernet to the guest OS. In this example, there is a VM, but it is acting as the storage system, not the workload generator.

3) Now reverse the I/O of example 2. Have the laptop export an SMB share and have the guest OS issue I/Os to that share. Now you finally have VMs generating workloads.

A: Mark: If one examines the solution under test (SUT), and considers the general data flow, then the workload is generated by the clients/hosts layer. Yes, we indicated that the clients/hosts can be VMs, but they also could be physical systems, and, in the case of a SUT consisting of just one laptop, the workload is generated by the application.

Q: Are you going to get around to file performance benchmarking? This infrastructure stuff is not new to me. I have done block all my life, I am interested in stuff about file.

A: Ken: That’s the plan. We are still working out the exact sequence, timing and content for future presentations, but had a dedicated section on both block and file on the original roadmap. If you have specific topics within “file” that you’d like covered, respond in the comments. No promises to cover them, but knowing the desires of the audience is always a good thing.

Keep in mind the intention of the webcast series – lay a strong, but simple foundation, for storage performance fundamentals and then build on that foundation.

A: Mark: The main intent of the series is to lay down basic performance principles first, then build on them to go to more complex topics. Both Ken and I refer to ourselves as “File Heads” and we can’t wait to concentrate on just file, but it would only make sense given that the infrastructure foundations are firm and understood by our audience.

Q: Why doesn’t SPEC SFS have performance testing for such failure models?

A: Ken: Brighttalk provided this comment in isolation, so it isn’t entirely clear which failure models you’re asking for.  I’m assuming you mean a failure like a drive or controller failure. With that assumption, that said, the SFS subcommittee welcomes publications that illustrate a failure condition. SPEC SFS 2014 provides an excellent opportunity for someone to publish once in a non-failure scenario and then again in a failure condition of some sort – as long as that failure condition doesn’t violate any of the run rules regarding stable storage and the failure condition doesn’t generate user visible errors.

Note that SPEC SFS 2014 doesn’t mandate any demonstration of failure scenarios. We’ve discussed this in the past, but it has never been a priority for those that participate in SPEC (which is open to all – see http://www.spec.org for instructions on how to join the SPEC Open Systems Group).

Q: Why is write cache turned off for enterprise drives?

A: Ken: I knew this question was coming. :) This is related to “stable storage” – the guarantee from your storage provider that data they say is safely stored on disk is actually stored on disk. I should clarify that the comment is a little dated and refers to caches designed around volatile memory; this wouldn’t apply to a hybrid SSD/spinning media drive that used the SSD to cache/stage data, since SSDs are non-volatile.

Consider the failure scenario where the enterprise drive has write caching enabled and then experiences a power failure. In most every system sold, the storage controller treats drives pretty much as black boxes – they tell the drive to read or write data at a certain location and expect the drive to do as told. So, when the drive says “yup, I got that data, you’re good!” the storage solution trusts the drive and, when it doesn’t need it in memory any longer, throws it away (that data is safely stored on disk, so this is ok). If the drive chose to cache that information in volatile memory, and loses power, the information is gone.

Midrange and enterprise storage vendors often (I think I can say generally) provide some sort of battery backup in case of power failures. These battery units keep power to at least some of the drives – but remember that drives (especially spinning ones) suck down a lot of power, and often the implementation chooses to keep power only to certain drives that the storage controller uses to flush its own volatile memory structures to.

A quick Internet search shows some specific comments on this topic:

From Seagate (http://knowledge.seagate.com/articles/en_US/FAQ/187751en):

Windows 2000 Professional / Server, Windows XP Home / Professional, Windows Vista and Windows 7 have a nifty little feature called write caching buried within the depths of property tabs. Normally, this type of feature is used with SCSI drives in server applications to provide greater data integrity.

When drives employ write-back cache, any interruption of power to the drive or system may cause lost or corrupted data because the drive does not have time to write the cached data to the disk before the power is lost. However, when write cache is turned off, drive performance slows down.

From Microsoft (https://support.microsoft.com/en-us/kb/259716):

…In addition, enabling disk write caching may increase operating system performance. This article describes how to enable or disable disk write caching…

NOTE: Enabling write caching generates the following warning. This is normal:

By enabling write caching, file system corruption and/or data loss could occur if the machine experiences a power, device or system failure and cannot be shutdown properly.

Q: What’s the difference between CPU and ASIC? When to use which word?

A: Ken: Unfortunately, the SNIA dictionary doesn’t define either term. At the easiest level, both are acronyms. CPU = central processing unit, and ASIC = application specific integrated circuit. At the next level, think of a CPU as general purpose processing element and an ASIC as a custom designed microchip designed for a special application or purpose. Once created, ASICs are non-programmable – they do something very specific (and hopefully very well and very quickly/efficiently). A CPU can run your bitcoin mining program overnight, wake you to Spotify in the morning, let you use your favorite word processor in between games of Plants vs. Zombies, and still let you watch Hulu before you head off to bed.

An ASIP (Application Specific Instruction-Set Processor) bridges the gap between a general purpose processor (CPU) and the highly specific, targeted design of an ASIC. An ASIP will have a much reduced instruction set and a more targeted design towards a specific application (say, digital signal processing), but still allow the execution of a specific instruction set given to it.

Q: Can you mention tools to identify the bottlenecks?

A: Ken: We are trying very hard, particularly in the webcasts themselves, to stay vendor neutral. I don’t mind violating that though here in the Q&A a little bit.

From an open source standpoint, there are a number of tools. One of the more popular now is to use something like Graphana as a front-end to Graphite, and use that to monitor a set of open source (or privately designed) sensors, including sensors from OPM below, that you place throughout your environment.

Here are a few other open-source benchmarking and performance tools, and what aspect of performance to which they apply. Please note that this is not a comprehensive list, nor is this a recommendation for their use. We are providing the link as a convenience, not an endorsement. [http://www.opensourcetesting.org/performance.php]

Still free, but NetApp-centric, is OnCommand Performance Manager (OPM), specifically OPM v2.0 and later. In addition to providing performance metrics for your NetApp storage array, OPM offers up the concept of a “bully” and “victim” scenario – it specifically watches for components that are performing poorly (the “victims”) and helps identify which other components are causing that poor behavior (the “bullies”). My team helps develop OPM.

Not free, and not NetApp centric, but a NetApp product, is OnCommand Insight (OCI). This is a premier product for looking at the performance of the components across your datacenter.

A: Mark: I didn’t want to break the vouch to neutrality. :) EMC has a number of tools as well, vRealize suite, ViPR SRM, Unisphere, plus platform specific tools… However, it has been my experience that the most important tool a performance expert has is still the critical mind. One observes the problem, and then walks the entire set of the SUT layers looking for incongruences. Too often, the perceived bottleneck is not the problem, but a manifestation of a problem somewhere else. For example, as we pointed in the “MiB/s section” of this webcast, the network layer was a bottleneck due to the badly configured OS multipath drivers. Deciphering cause from reason requires several things: a good understanding of the SUT and its layers; a critical mind to analyze problem conditions; and a large dose of curiosity. The latter one a personal trait that drives us to question “what if I change this to that?” Asking questions while troubleshooting is, IMHO, a cornerstone requirement and inherently, a very human trait. My personal view is that tools are just tools, and they require a human hand to operate and a human mind to analyze the results.

Q: Did All flash arrays almost eliminate bottleneck..,at least the Storage controller bottleneck can be eliminated if enterprise can afford all flash arrays?

A: Ken: Actually, almost exactly the opposite. Spinning drives are (now, at least) relatively slow. Over the past 10 years the drives have gotten much bigger, although HDD drive speeds haven’t really changed all that much,. Because of this, what I’ve observed is that the IOPS/GB ratio for HDD has, if anything, been getting worse* and the most common bottleneck for an HDD-based customer turns out to be the speed of their drives.

Now consider what happens when a customer moves to SSDs. The SSDs that are sold (and folks can afford) are generally much smaller than the HDDs they are used to, so customers buy as many of them as they can in order to meet their capacity requirements. And the SSDs are, one-for-one, much faster. So what happens? High drive counts + really fast drives = the drives aren’t your bottleneck anymore. Instead, the bottleneck shifts upstream … in a well architected solution, generally to the storage controller or clients.

*For those that know the terms, we could have a long discussion about working set sizes over the years, how fast data ages, tiered storage and such, and the effect that these have on observed iops/GB … but I think we could agree that since HDD speeds aren’t increasing, the iops/GB ratio isn’t generally getting better.

Q: Can I download this slides?

A: Ken: Absolutely. Here are the links to Part 1 of our series:

PPT and PDF: http://www.snia.org/forums/esf/knowledge/webcasts (look for “Storage Performance Benchmarking: Introduction and Fundamentals (July 2015)”)

Presentation Recording: https://www.brighttalk.com/webcast/663/164323

Q&A Blog: http://sniaesfblog.org/?p=447

Here are the links to Part 2:

PPT and PDF: http://www.snia.org/forums/esf/knowledge/webcasts (look for “Storage Performance Benchmarking: Part 2 (October 2015)”)

Presentation Recording: https://www.brighttalk.com/webcast/663/164335

Q&A Blog: That’s what you’re reading now. :)

Q: Storage controller, is a compute node, right? And for hyper converged systems, storage controller and compute nodes are the same, right?

A: Mark: Most certainly, a Storage Controller can be a compute node, but in our webcast it is not. The term “compute node” is typically interpreted to be a part of the client/hosts layer. A compute node computes for the application, and such application is generating the workload (please see the question above about where workload is initiated).

A good example of the compute node would be a system that renders cartoons, or geodesic fields. As such compute node computes something (application does the work), and stores the results onto the storage controller.

However, in the case of hyper-converged infrastructure, the storage controller is often virtualized among the client hosts, making every compute node a part of a larger storage controller.

Q: Are the performance numbers that vendors publish typically front-end?

A: Mark: I don’t want to generalize published numbers as being one way or another. I recommend reading every publication for specific details. Vendors publish numbers to cover use cases, and each use case may come with its own set of expected measurement points and metrics. Ken and I talked about how metrics matter in the first Storage Performance Benchmarking webcast. :)

Q: “We did an R&D PoC using 32 flash 400GB elements attached on DIMM slots (not through SAS controller, not a direct PCIe attach) and seven 40Gbps cards. We were able to pump 5.5M 4KB-IOPS resulting in 30GB/s (240Gbps) of traffic on the front-end connect. When do you expect the front-end connect be the bottleneck for more standard environment?”

A: Ken: Woot! That sounds like a lot of fun. If you’re in the Raleigh, NC area and can talk about that not under NDA, we should have lunch. I’d like to hear more.

I have a suspicion that this answer won’t satisfy you, because it isn’t going to be as empirical as your example. The problem in answering with raw numbers is that there isn’t a standard configuration for a SUT. An enterprise class storage array with a mix of 40GbE and 32GB FC connections (with traffic over both) will look very different than someone using their old Windows XP box with a single 100Mbit to share out an SMB share, and both will look different than someone accessing their photos on their favorite cloud provider. So, I’ll answer the question by saying that I expect the front-end connect to be the bottleneck anytime the rest of the components in the SUT are capable of hitting your performance metrics (whether that be in terms of response time, IOPS, or data rates), and the front-end connect isn’t.

THAT said, you’d be astounded how often, even today, that MTU mismatches result in terrible front-end performance (and functionality).

Q: “Example of cache in front end connection?”

A: Ken: I’ll cheat and note that SUTs can be a lot more complicated than we showed. For example, our picture looked like this:

 

Benchmarking Image 1

Consider a SUT then where you have:

Benchmarking Image 2

At one level, “the internet” is just a big black box acting as our front-end connect. But if we zoom in on it, perhaps we find that somewhere along the line there’s a caching server. Then we have an easy answer to where you find cache in the front-end connect.

In the much simpler model that you’ll find in many enterprise data environments though, you’ll find that the front-end connect consists of some relatively short length cables and a set of switches – either SAN or NAS switches. And in those environments, you won’t find a lot of cache. You will find memory, but you’ll find it used for buffering more than for caching.

We tried to minimize this in the presentation since there’s not a universally agreed upon distinction between these two terms. I think of a buffer (in this context) primarily as memory set aside to hold data very briefly, after which it is consumed and removed from the buffer. I think of cache, on the other hand, as a storage medium that holds data specifically to speed up data access (storage or retrieval). Data held in a cache can be held for a very long time, and not all data in a cache may ever be consumed/used.

Q: Even SSDs suck at random write but they are good for random read, is there that much difference?

A: Ken: Yes. The data we pulled for drive speeds was real data. And keep in mind that “sucks” is pretty relative here. Enterprise SSDs still tend to be at least 4x faster than spinning media. And, very importantly, their performance is much more consistent and deterministic since seek time is irrelevant with an SSD. New NVRAM technologies, like 3D XPoint, promise to dramatically improve write performance.

DRAM is volatile though so replacing HDDs with that wouldn’t really work, right? But if capacity requirements are high, we cannot replace disks with the cache, right?

A: Mark: Cache should never replace capacity. Cache is temporary storage, and requires by design to move its data to a permanent storage location. The size of cache should be matched to the size of the data that application uses, so called “working area”. For example, if an application writes to a 4GB file (think VMware vmdk), then for best performance the entire 4GB should fit into cache. However, capacity requirements for a VMware datastore can be as high as several TB. If the application (ESX server) is running many VMs, perhaps only the performance few need to fit into cache, while all other VMs would use cache for sub-portions of their vmdks.

A: Ken: You’re right, not permanently. As Mark points out, it isn’t to permanently replace the slower storage with cache necessarily, just supplement it enough that the working set fits in.

Q: How can we make client do less IO? Will it make sense?

A: Mark: A client does less IO by using larger IO size, for example. A classic use case is read- and write-sizes within NFS protocol. It is possible to increase read- and write-size of the NFS protocol from the NFS client mount options side. By default, some Linux environments use 32KB for reads and writes. And reading a 1GB file takes 32768 32KB IOs. If the read size is increased to 1GB, then it takes only 1024 IOs – a 32x reduction!

A: Ken: Other options involve app re-writes (yes, sometimes these ARE possible) and OS upgrades. Perhaps in the “app-rewrite” category, or maybe a new category, I’ve also worked with developers to rewrite their DB queries to be much less disk intensive, for example.

Q: Can you elaborate on some of the client level cache types other than file system or OS?

A: Mark: Other than file system and OS? Hmm… Let’s see: PCI-based cards that cache block-device level cache, e.g. EMC VFCache, SanDisk Fusion-IO, NetApp Flash Cache. Native network protocols (CIFS and NFS) caches. Local database caching, e.g., SafePeak, TimesTen, Windows Azure Caching.

Q: Please add more sessions, which goes into every detail

A: Mark: Absolutely! We will! Promise! We’re shooting for Part 3 in Q1 2016.

Under the Hood with NVMe over Fabrics

Non-Volatile Memory Express (NVMe) has piqued the interest of many people in the storage world. Using a robust, efficient, and highly flexible transportation protocol for SSDs, Flash, and future Non-Volatile Memory storage devices, the NVM Express group is working on extending these advantages over a networked Fabric.

Our first Webcast on The Performance Impact of NVMe over Fabrics was very well received. If you missed it, check-it out on-demand. On December 15th, Dave Minturn, Storage Architect at Intel, will join me for a deeper dive in a live Webcast, “Under the Hood with NVMe over Fabrics.” At this Webcast we’ll explain not only what NVMe over Fabrics is, but also specifically pay attention to how it works. We’ll be exploring:

  • Key terms and concepts
  • Differences between NVMe-based fabrics and SCSI-based fabrics
  • Practical examples of NVMe over Fabrics solutions
  • Important future considerations

Register now and join us as we discuss the next iteration of NVMe. I hope to “see” you on the 15th when Dave and I will be anxious to answer your questions.

Life of a Storage Packet (Walk): Q&A

We got a lot of great questions at our recent Ethernet Storage Forum webcast “The Life of a Storage Packet (Walk).” As promised, we’ve compiled all the questions with fairly detailed answers. We hope this blog helps to clear up any confusion or uncertainties. If you think of additional questions, please comment below and we’ll get back to you as soon as we can. Thanks to everyone who watched the live webcast. If you missed it, it’s now available on-demand.

Q. Does size of a block depend on OS?

A. Yes, some OS’ only support 512 byte blocks. Some OS’ provide a method to support both 512 byte blocks and 4096 (aka 4K) byte blocks; for example, via a qualifier in their format command. Some devices are built using 4K block sizes, but then emulate 512 byte blocks to the host (aka 512E devices). Many modern versions of OS’ automatically detect the block size of each device they discover and do the right thing based on what they discover. You have to check the documentation for your OS to know its capabilities.

Q. CIFS is not a file system.

A. Not in the same way that ext, ntfs, FAT, HFS, UFS, etc. are, no. However, in terms of certain functionalities that we discussed on the presentation – that is, the ability to manipulate files with operations such as read, write, create, delete, and rename – are all file system functionalities. The difference being, of course, that the files are not on the local computer and are actually on a remote computer.

For what it’s worth, the term ‘CIFS’ has been deprecated from usage, and SMB is the preferred term for precisely the reasons that it should not be confused with local OS systems like the ones mentioned above.

Q. Is it safe to say that a “block” at the file system level is equal to an IO?

A. Not specifically. The difference in the use of those terms, is that a “block” is a place that data is stored – it has an address and it contains data (512 byte of data or 4K bytes of data). An IO is an operation that requests access to a block. An IO may perform a read operation on a block, or it may perform a write operation on a block.

Remember, the term IOPS is I/O operations per second – so it is not about blocks or bytes or bits – it is operations. If the operations are performed on a 512 byte block, they produce a different number of MB/sec than if they operate on a 4K byte block.

So, let’s take an example. A 1MB/second bandwidth on a 4K block size device is the same speed as 1MB/second bandwidth on a 512 byte block size device (observe, however, that the 4K block size device will have only 1/8 the IOPS that the 512 byte device has – because it will take only 1/8 the operations to transfer the same number of Mega-bytes). However, 1M IOPS on a 4K block size device is much better than 1M IOPS on a 512 byte block size device (because the 4K block size device is moving 8X the amount of data than the 512 byte block size device in each operation.

There is an excellent explanation and walk-through of IOPS in our Storage Performance Benchmarking webinar.

Q. Don’t all those Inodes also live on the disk and so don’t the IOs to read those blocks also have to go to the SCSI controller?

A. Well-spotted!

The communication back and forth between the file system and the Inodes traverses the controller for all access to blocks on disk. This is one of the reasons why needing to access disk is considered “expensive.” When you add in a network to the mix, these kinds of situations need particular careful consideration.

Q. Shall we allocate blocks and inodes or is it an automatic process?

A. It’s an automated process, and there is no user intervention at all.

Q. Are Inodes created during OS installation?

A. Inodes are a particular block type, among some other block types (e.g., data blocks, boot blocks, superblocks, group descriptor blocks). These block types are combined into functional groups. These groups are OS-dependent. The block layout therefore, including Inodes, is created during OS installation. If the filesystem needs more Inodes after the OS installation, the OS dynamically adds them to the Inode pool.

Q. Which physical hardware does volume manager reside?

A. Volume managers are not hardware. They are software layers that create pseudo devices that are presented to layers of the OS above them (typically to the file system layer). The volume manager software fits into the OS to accept requests from the file system and pass those requests down to the device driver. In some systems they might be called “filter drivers”.

Q. In flash media, is there also an iSCSI controller that converts PCIe into iSCSI to interact with the flash?

A. I want to make sure that the answer to the question is clear.

On the host, we need to convert PCIe commands to SCSI, so we send them to an iSCSI controller/adapter to be sent across the Ethernet wire. . That adapter can be either software or hardware.

Flash drives are basic media, just like spinning disk drives. Flash will have its own controller, which can be SCSI. If you wish to access the drive over Ethernet using the iSCSI protocol, you will have an adapter on the flash drive (which can be either software or hardware) that will do the SCSI translation between the Flash media and the SCSI commands. This is often called the FTL – or the flash translation layer. Again, the SCSI commands are translated into an Ethernet-friendly packet to be sent along the wire.

There are other types of communication forms for working with Flash, too. The most recent is NVMe. You can see the SNIA webinar on The Performance Impact of NVMe and NVMe over Fabrics for more information.

Q. Why is there a SCSI language in between Storage and the hosts?

A. Before storage standards (like SCSI and ATA), you would purchase storage from Vendor X, and you would also buy a storage controller for that vendor X storage.  If you wanted Vendor Y storage, you could not use the vendor X controller, you had to purchase a new controller from vendor Y. Every vendor had their own language, and you had to purchase matching components.  Once you got locked into one vendor, you were stuck – at the hardware level.

Today with standards (such as SCSI), you can buy a SCSI device from any vendor and connect it to any storage controller that you buy from any vendor – and it just works.  That is the point of the SCSI standard. When the storage standardization efforts began, there were many competing ideas. It just so happened that SCSI won out, and so now it’s everywhere.

Q. If the storage side is flash, do we still need a SCSI controller between host and flash storage or is the controller different for a flash storage?

Yes… and no.

Yes, to use the SCSI part of the OS, the flash device must continue to speak the SCSI protocol and so a SCSI controller is needed. This method of communicating with flash devices enables all the existing S/W on the host OS to just continue working without even knowing it is flash.

No, flash memory chips can be connected to the system using non-SCSI methods. Most of those methods are generally special purpose applications and so the existing OS S/W simply cannot use that device (only the special purpose S/W designed for that device can use it). However, NVMe is a new protocol that is enabling more general use of the flash memory technology with the hopes to provide new capabilities that are beyond what SCSI provides. This is also discussed in our webcast on The Performance Impact of NVMe and NVMe over Fabrics.

Q. What’s the difference between partition, logical disk, volume, LUN, etc?

Excellent question!

Let’s work this one backwards – LUN – that is a SCSI term (actually an acronym) that refers to the Logical Unit Number – it is part of the address used to access a logical unit. Small SCSI devices (such as a single spindle disk drive) have only a single logical unit. Large storage arrays may contain 100s of logical units. Each of those logical units appears to the host OS as if it were a single spindle disk. So, the logical unit is the SCSI object that contains the blocks where the data is stored (where that object may be an individual piece of hardware, or a logical entity within a larger SCSI device). To access a SCSI logical unit, the OS must specify the address of the SCSI device, and then the LUN (logical unit number) for the logical unit within that SCSI device.

The other terms (partition, logical disk, and volume) are OS terms that have to do with virtualization and how the storage blocks are managed. When a SCSI (or other storage device) is formatted by the OS, it may be broken into multiple partitions. Each partition is then treated by the OS as if it were a unique device (a virtual device, or a logical disk). Each of those partitions may then be used independently.

For example, on a Unix system, the “a” partition may contain a file system that has all the files that are necessary to boot. The “b” partition may be setup without a file system and used as the swap or paging storage for use by the virtual memory subsystem. The “d” partition may then contain a file system that contains all the user’s data files. Each partition is unique storage space and may even use a different file system to organize the data located there.

Notice, that I skipped the “c” partition. That partition is often setup to access all the blocks of the physical device. So, on a 500GB disk, maybe “a” contains 10GB, “b” contains 90GB, and “d” contains 400GB; while “c” contains all 500GB. Partitions “a” and “d” may be backed up or restored independently, and partition “c” may be used to perform an image copy of the entire device.

Now, to the other terms:

Logical disk and volume are terms often related to volume managers.

  • Logical disk may be a term used to refer to a partition, but usually that is not the case. Logical disks are typically the devices created by the volume management layer when they combine individual devices into a single larger device (a.k.a. a logical disk).
  • Volume managers also may be used to divide up a large device into small chunks (just like partitions), and those smaller chunks are referred to as logical disks.
  • Volume is a more vague term that typically is used as another term for a logical disk. In some circles, the term Volume is used to refer to a RAID set in a SCSI storage controller (but this is a much less often used definition for Volume).

Q. Will the “complete” presentation somewhere we can go review?

Yes. Click here to access the on-demand webcast as well as a PDF of the webcast slides

New Webcast: The Life of a Storage Packet (Walk)

Wonder how storage really works? When we talk about “Storage” in the context of data centers, it can mean different things to different people. Someone who is developing applications will have a very different perspective than someone who is responsible for managing that data on some form of media. Moreover, someone who is responsible for transporting data from one place to another has their own view that is related to, and yet different from, the previous two.

Add in virtualization and layers of abstraction, from file systems to storage protocols, and things can get very confusing very quickly. Pretty soon people don’t even know the right questions to ask! That’s why we’re hosting our next SNIA Ethernet Storage Webcast, “Life of a Storage Packet (Walk).”

Join us on November 19th to learn how applications and workloads get information. Find out what happens when you need more of it, or faster access to it, or move it far away. This Webcast will take a step back and look at “storage” with a “big picture” perspective, looking at the whole piece and attempting to fill in some of the blanks for you. We’ll be talking about:

  • Applications and RAM
  • Servers and Disks
  • Networks and Storage Types
  • Storage and Distances
  • Tools of the Trade/Offs

The goal of the Webcast is not to make specific recommendations, but equip you with information that will help you ask the relevant questions, as well as get a keener insight to the consequences of storage choices. As always, this event is live, so please bring your questions, we’ll answer as many as we can on the spot. I encourage you to register today. Hope to see you on November 19th!

Storage Performance Benchmarking – The Sequel

We at the Ethernet Storage Forum heard you loud and clear. You need more info on storage performance benchmarking. Our first Webcast, “Storage Performance Benchmarking: Introduction and Fundamentals” was tremendously popular – reaching over 2x the average audience, while hundreds more have read our Q&A blog on the same topic. So, back by popular demand Mark Rogov, Advisory Systems Engineer at EMC, Ken Cantrell, Performance Engineering Manager at NetApp, and I will move past the basics to the second Webcast in this series Storage Performance Benchmarking Part 2. With a focus on System Under Test (SUT), we’ll cover:

  • Commonalities and differences between basic Block and File terminology
  • Basic file components and the meaning of data workloads
  • Main characteristics of various workloads and their respective dependencies, assumptions and environmental components
  • The complexity of the technology benchmark interpretations
  • The importance to System Under Test:
    • What are the elements of a SUT?
    • Why are caches so important to understanding performance of a SUT?
    • Bottlenecks and threads and why they matter

I hope you’ll join us on October 21st at 9:00 a.m. PT. to learn why file performance benchmarking truly is an art. My colleagues and I plan to deliver another informative and interactive hour. Please register today and bring your questions. I hope to see you there.