Persistent Memory Featured at Open Server Summit and in New NVDIMM Webcast

April’s Open Server Summit brought thought leaders together for two days of keynotes, sessions, and a demonstration showcase on converged server-storage-networking infrastructures and open specifications shaping the data center. SNIA board member Rob Peglar of Micron Technology delivered a keynote on new persistent memory directions that create new approaches for system architects.

A Summit highlight was a SNIA’s Solid State Storage Initiative sponsored panel on Providing Storage at Memory Speed Using NVDIMMs, where booth and panelpanelists reviewed how NVDIMMs operate in new interest areas for persistent memory like databases, Web 2.0, analytics, OLTP, and video and image processing. NVDIMM technologies were also featured on the showfloor with demonstrations from SNIA NVDIMM Special Interest Group (SIG) members Diablo Technologies, Netlist, and SMART Modular.  Download the presentation from Open Server Summit here.

SNIA’s NVDIMM SIG followed up the interest at Open Server Summit with a comprehensive webcast answering today’s questions on NVDIMM and Non-Volatile Memory (NVM).  Jeff Chang, Co-Chair of the NVDIMM SIG from AgigA Tech, provided a quick refresh on NVDIMM types.  NVDIMM SIG member Mat Young from Netlist covered NVDIMM Performance Benchmarking.  Doug Voigt, Chair of the SNIA NVM Programming Technical Work Group from Hewlett Packard Enterprise, reviewed NVM Programming Model Updates, and Arthur Sainio, Co-Chair of the NVDIMM SIG from SMART Modular, wrapped up the session with answers to the NVDIMM questions raised at the January 2016 webcast and at Open Server Summit.  The webcast is now available for download on the SNIA BrightTALK channel at https://www.brighttalk.com/webcast/663/197009.

Next up from the SNIA Solid State Storage Initiative will be a keynote and demonstration at May 23-24 In-Memory Computing Summit at the Grand Hyatt in San Francisco.  Join us there!

A Topical, Timely, and Tuned Agenda – It’s Time for Data Storage Innovation 2016

There is a sense of anticipation that awaits the unveiling of SNIA’s Data Storage Innovation (DSI)  conference agenda. DSI is unique in its focus as a conference for those that use information technology.  Each year, the agenda expands as hundreds of submissions are received, reviewed, and considered for addition to the schedule.

2016 brings the most extensive topic agenda to date for DSI, set for June 13-15, 2016 and newly located in San Mateo for the convenience of those flying in to SFO, San Francisco and East good dsi logoBay attendees, as well as those in Silicon Valley.  DSI’s agenda committee, comprised of SNIA volunteer leaders from leading IT organizations, are attuned to the changing needs of an IT audience confronting next generation data centers, enterprise computing’s shift to the cloud, and the drive to data analytics for business decisions. The agenda’s 20 themed topics include hyperconvergence, cloud, data protection, big data, security, cloud, persistent memory, software defined storage, containers, and distributed storage.   And the 75 sessions within these topics are a mix of industry leadership, SNIA’s vendor neutral tutorial education, and sponsored sessions.

The DSI General Sessions emcee, reprising her highly received 2015 role, will be Camberley Bates, Managing Director and Senior Analyst, the Evaluator Group, who will also co-host a keynote on enterprise storage architectures.  Additional keynotes at this time include “Storage Architectures for Next Generation Cognitive Analytics” presented by Huawei, and an IBM industry perspective on Enterprise Computing and Data Storage.

New also for 2016 will be a session on results from a new survey co-sponsored by SNIA and the Evaluator group on enterprise deployment of hyperconverged solutions.

We invite you to peruse the DSI Conference agenda online, and make your plans to attend and participate in a conference rich in one-on-one discussions, lots of opportunities for “hallway networking”, and up-to-date technical knowledge presented without vendor bias. Register here.  If you want to understand more about how DSI works, a new video available on the DSI conference website walks you through the highlights of the conference, with perspectives from Wayne Adams, DSI Conference agenda chair, on how the sessions provide a roadmap of where the industry is going.  Camberley Bates also offers her feedback on why the conference is important in the ability to communicate with peers and learn firsthand their approaches to solving today’s enterprise issues.

Break Out of the Mold at Interop 2016

From SNIA Event Partner, Interop

Leadership isn’t jut for managers and executives – it’s also an essential skill for technologists who want to have a positive impact on their businesses. Interop Las Vegas provides a wide range of programs, including its IT Leadership Summit, workshops, and extensive conference track, that can help IT professionals find their voices and advocate for innovation. Check out the following videos on Network Computing:

In this video, Co-Founder and CEO at Creativity profiles, Rob Cordova provides a sneak peek into what attendees will learn at his workshops: Fueling Innovation and Leading Innovation.

In this video, Dan Roberts, leader of Interop’s IT Leadership Summit and CEO and President of Ouellette & Associates Consulting, discusses how to be successful in a world of increased complexity, rising expectations, and accelerating change. Find out how and why you can move away from the tired idea of “aligning business with IT” toward a place where IT is integral to your business and driving new initiatives forward.

Learn more about the IT Leadership Track and register for Interop, May 2-6 in Las Vegas. Use discount code SNIA20 to claim your Free Expo Pass or save 20% off Conference Passes.

A Q&A on Storage Performance Benchmarking: Block Components

For the third time, our storage performance benchmarking experts, Ken Cantrell and Mark Rogov, have generated an abundance of interest (in the form of questions) on block storage performance. If you missed the Webcast, “Storage Performance Benchmarking: Block Components,” it’s available on demand. It was no small effort to answer all the great questions that we received. And for those of you who have been waiting, we apologize, but we think the detailed and thoughtful answers Mark and Ken have put together are well worth the wait.

Q1: Are these numbers applicable to the 90th percentile for any given storage array, please?

Mark: These numbers represent HDD/SSD performance numbers. They aren’t meant to represent any particular storage array vendor’s performance. See the end of our presentation (bottleneck analysis) as to why it is really really hard to answer your question.

Q2: How about NVDIMM-F or NVDIMM-P or NVDIMM-X claiming 3-4M IOPS type of Enterprise storage devices?

Ken: Yup. They’re fast.

There’s a great presentation by Jim Handy titled “Understanding the Intel/Micron 3D XPoint Memory” presented at SDC2015 that I’d recommend you take a look at to understand more about this kind of memory and its possible positioning.

Mark: Great question. I think the conclusion of our presentation answers it. Flash (and we use flash as a collective term, defining everything that is not spinning storage to be “flash”) is drastically faster than spinning drives. But even within Flash, there are plenty of new technologies which compete with each other and improve the overall performance landscape. So, within the scope of our presentation, even a simple good old SLC drive tops the capability of a SAS line. If we improve on one drive, by switching the technology to a faster/newer/better variant (e.g., NVDIMM-F), or by stacking the drives, the resulting set will much more likely expose the limitations of the “regular” storage array.

Q3: I’d like to know which tool you are using to measure IOPS if possible. 

Ken: The SNIA Solid State Storage Initiative (SSSI) has developed substantial expertise in the area of SSD performance and behavior. The SSS Performance Test Specifications were developed by the SNIA SSS Technical Work Group (TWG) and define how to measure SSD performance in a manner that is accurate, repeatable and enables comparison between different manufacturers’ products. Learn more about the SSD Performance Project here.

All of the Flash and HDD numbers at the beginning of the presentation were taken directly from the Solid State Storage Performance Test Specification summary results (SSS PTS). The SSS PTS provides a comprehensive method for measuring flash performance in the most vendor neutral approach that I’ve seen.

The Flash and HDD numbers at the end of the presentation were 80% of the starting numbers – scaled down to make them slightly more like what we’ve seen in a greater number of environments (that aren’t pushing their drives as hard).

Q4: Throughputs with SSD is not as much as one can get from a spinning drive when one keeps cost/GB on the axis. Comments please.

Ken: Now we have 3 axes? I’m not even sure how to visualize what you’re asking, but I’m pretty sure I understand the intent … and this is a harder question than it would appear on the surface. Why?

  • First off, prices aren’t my thing – I tend to focus on the internals and let the sales guys talk prices. Additionally, vendors often engage in significant discounting or bundling that makes it difficult for the average person (i.e., me) to understand true costs.
  • The astounding random I/O performance of flash enables support for compression and deduplication without dramatically increasing client-perceived latency. There’s a reason you see so many vendors offering inline deduplication and inline compression now when they did not even five years ago – flash is the enabler that makes this happen. So what is the true comparison? Raw HDD vs Raw flash? Or Raw HDD vs flash plus the storage efficiency (SE) savings it enables? If flash with SE features (dedupe and compression), then what is the savings that you can/should expect for your dataset? 1.5x? 5x? 50x? Knowing this is a prerequisite to answering the question, and the answer will be dependent both on the vendor’s features and your own data set characteristics.
  • As we discussed in the first session, if your application/user base have some sort of minimum performance expectations, particularly around latency, then HDDs may simply not be able to provide you the performance you need. You DID mention throughput (IOPS?) explicitly and with IOPS, OPS, or data rates (MB/s), you can always match flash data rates with HDDs – it just might take a LOT more HDD drives than flash devices. Latency/response time is different though – depending on whether you are drive bound and what your I/O characteristics look like (read vs write, random vs sequential), you may simply be unable to ever hit your latency targets with HDD.
  • The world, it is a-changing. six years ago it was easy to say “SSD for performance sensitive niche applications!” and smile. Today, prices continue to drop, vendors are making new decisions around the use of consumer grade vs enterprise grade flash, and overall flash/SSD is moving much more mainstream. And … consider the new 16TB (yes 16 TERABYTE) SSD drives announced by Samsung. My personal view (and I’m explicitly disclaiming that I’m speaking on my behalf, and not NetApp’s – which honestly, you should assume for all my answers) is that these are going to change the landscape almost as dramatically as SSD itself has.
  • There are definitely vendors that believe in the cost benefits of HDD. We chose not to mention specific vendors in the webcast, but consider BackBlaze. In their blog, they are extremely open about how they have configured their data center – and they are an (all?) HDD shop. In fact, “by the end of 2015, the Backblaze datacenter had 56,224 spinning hard drives containing customer data.” Speaking of Backblaze, you might be interested in their assessment of the 16TB drive, for their shop.

You might also be interested in slide 21 of the following, which includes some price/performance numbers from EMC and Oracle.

Q5: Does NVMe drive technology move things to a higher level?

Ken: If you truly mean NAND-based flash accessed via NVMe instead of SAS/SATA, yes. Look at the perf results linked out of question 3. If you mean the use of next-generation non-volatile memory (NVM) instead of NAND-based flash, then yes. The following chart is contained in a lot of SNIA presentations; I It does a good job of pointing out just how much faster we can get.

I also strongly recommend a look through of Advances in Non-Volatile Storage Technologies by Tom Coughlin from Coughlin Associates. If you care about these topics, the SNIA Storage Developer Conference is a great opportunity to learn more.

 

Performance Benchmarking 3 Graphic

Q6: Why NAND gates and not AND gates?

Mark: NAND and NOR gates are known as “universal gates”–they can be combined in various groups and combinations to do any basic operations, i.e., AND, NOT, OR, etc. So, flash manufacturers had to choose between NAND and NOR. And just like with any technology, the price drove the choice. NAND gates are simply cheaper and slower. NORs are faster and more expensive. Actually, there are some NOR products in the market.

Q7: Mark accidently said 15K was 15,000/sec when it’s 15,000/minute.

Ken: Thanks! (Shame on you Mark!)

Mark: Thank you… I can’t believe that I misspoke! I never do! Never! Ahh!!!

Mark’s Lawyer: On behalf of my client, I move to remove this question and the digital recording from Exhibit A to Exhibit B (aka “never again section”)

Q8: Do you guys have any data about how expensive an erase-modify-write operation is, compared with spinning disks in terms of performance?

Ken: This is what we were attempting to demonstrate in the first set of slides. The PTS (see question 3) forces flash devices into a steady state mode where they are continuously doing program-erase cycles. So the results shown there demonstrate the difference between HDD writes (seek, spin, write) and flash writes (erase and program).

Your question made me wonder though … so I also did a quick literature search. Interesting to see how rates have changed over time, and how they vary by device:

From M-Systems, in 2002: Erase cycle was 3ms

From Micron, in 2006: The erase time for a 128KB erase block was 500 µs

From AnandTech, in 2012: Erase time for SLC was 1.5-2ms, MLC was 3ms and TLC was ~4.5ms (huh? SLC vs MLC vs TLC?)

Q9: Why can’t the pointer be at the page level instead of a block level (say, metadata within a block)? I’m sure that there is a reason. What do we gain by treating an entire block as a monolithic?

Mark: This is an excellent question to ask Google. I think the reasons for selecting a NAND gate technology, and for bundling a bunch of NAND gates into groups and for creating blocks (in essence, super groups) is power. It takes less power to operate the drives with NAND gates and blocks.

Q10: I heard someone mention NOR gates, instead of NAND, are NOR gates persistent, over a power cycle?

Ken. Yes.

Mark: There are plenty of other “Logic gates” see this article on Wikipedia for more information.

Q11: So, there is no advantage in keeping IO sequentially in an SSD?

Ken: Technically, or practically? Technically speaking, I think it does matter. Micron documented this in 2006, noting that “Random access time on NOR Flash is specified at 0.075μs; on NAND Flash, random access time for the first byte only is significantly slower—25μs (see Table 2 on page 5). However, after initial access has been made, the remaining 2111 bytes are shifted out of NAND at a mere 0.025μs per byte.” The raw numbers have changed over the years, but I don’t believe the principle has. Violin Memory stated in 2013 that, “The idea of sequential I/O doesn’t exist with flash memory, because there is no physical concept of blocks being adjacent or contiguous. Logically, two blocks may have consecutive block addresses, but this has no bearing on where the actual information is electronically stored. You might therefore say that all flash I/O is random, but in truth the principles of random I/O versus sequential I/O are disk concepts so they don’t really apply.”

Practically speaking, I agree. Sequential vs random I/O is irrelevant for flash. Given (a) average I/O sizes for workloads and (b) the incredible performance of flash devices compared to the needs of the vast majority of people using them, it doesn’t much matter if you can access subsequent bytes in a NAND-based flash device faster than you can access the first bytes. They are plenty fast enough.

Note that it is hard to find public info on this. Sequential I/O tends to use larger I/O sizes, and random I/O uses smaller I/O sizes. So finding apples-to-apples comparisons between sequential and random I/O is difficult.

Mark: Yes, the flash drive doesn’t care anymore. But the hosts and application still do. Where it matters is in the workloads. Ken and I are still planning to dedicate an entire hour talking about workloads, and Random vs. Sequential will surely be a large part of it. However, we will admit that in the future, when all storage will be flash (which is, of course, a pipe dream) it won’t matter anymore.

Q12: What is the acceptance level to Erasure Coding, and hence the change in the way Storage Performance testing will change?

Mark: As we said during the webcast, RAID is a special case of Erasure Coding. Therefore its acceptance rate is 100% J But on a more serious note, Erasure Coding is necessary for any scale out system: and every vendor uses their own N+M rules.

Q13: Is RAID-1 always half the write performance? If the writes go to both drives simultaneously, I could see write performance being less than 100% of what one drive can do, but not half.

Ken: This was asked in a dry run as well. You’ve hit on something that seems to be a sticking point for multiple people. Perhaps consider it this way. It looks mathy and complicated, but bear with me …

Consider two physical drives. Call them P1 and P2.

Let the write performance (in iops) of P1 be P1w.

Let the write performance (in iops) of P2 be P2w.

How fast can P1 write? P1w.

How fast can P2 write? P2w.

If you can write to both P1 and P2 at the same time, independently, and completely in parallel, how fast can you write in aggregate? P1w + P2w.

For the previous question, what if P1w = P2w?

Then P1w + P2w = P1w + P1w = (2)*P1w.

Now …

Consider a RAID-1 pair comprised of the same P1 and P2. Call it R1.

Writes can be sent (in a good implementation) to both P1 and P2 at the same time.

But, before a write is considered complete, it must be acknowledged by BOTH P1 and P2.

If P1w > P2w, what is the best performance of R1? P2w. P2 is slower, so we’ll always be waiting on it (assuming performance is consistent), so the best we can do is P2w.

Same logic if P1w < P2w.

What if P1w = P2w? What is the best performance of R1? Same logic … but since they are the same speed, it is simply P1w.

So …

In the non-RAID-1 case, our performance (assuming P1w = P2w) was 2 * P1w.

In the RAID-1 case, our performance (assuming P1w = P2w) is P1w.

50% reduction.

RAID-1 only achieves ½ of what the physical pair could. 

Mark: What Ken said.

Q14: Is there any kind of “asynch” RAID1 so that I can keep the performance of the disks but keep the mirroring?

Ken: See the previous answer also.

For reads, certainly. For writes, not that I know of, although you can make it much less visible. For example, if you have a caching RAID controller/system, your writes will go to memory and then go to disk whenever the controller/system decides to flush it. Perhaps it is big enough that it turns random I/O into sequential I/O (and you’re on HDDs) and the perf improvement from doing sequential instead (instead of random) is enough you don’t notice the effect of RAID itself.

Mark: I think that in reality, the behavior of a particular implementation is always vendor-dependent. Generally speaking, RAID1 does allow the reading from both drives, but budgets or software bugs or just plain ignorance could result in an implementation where that is not true. Address vendor documentation to know for sure.

Q15: Why do you need to read old parity to recalculate and write a new one? Isn’t the parity only calculated based on the data being written?

Ken: See answer to question #14.

Mark: It is a math trick… reading the parity saves reading the rest of the blocks on the full stripe. With 3 drives the savings are non-obvious, but with 5 or 14 there are significant.

Q16: This calculation is correct for 3 disks, right? If there are more disks and partial write is for stripe on single drive then you need to read more to calculate parity

Ken: No. There are some great write-ups about how RAID-5 works. Instead of pasting those here, I strongly encourage you to visit http://rickardnobel.se/how-raid5-works/ AND http://rickardnobel.se/raid-5-write-penalty/ and then tweet Mark (@markrogov) or Ken (@kencantrelljr) with questions/follow-up.

(I have no connection to Rickard … I just think he’s done a great job in his write-up.)

Mark: Yes, Rickard’s write up is spot on. Our goal is to introduce a fairly complex subject in a deceivingly simple manner. There are many edge cases that we don’t address: partial write to sector, partial write to a block, partial write a stripe… all those have their own consequences, and storage vendors deal with those differently.

Q17: I am also interested in Data Recovery on NAND technology

Ken: Me too. It isn’t a topic we’re planning to cover though.

Q18: Does caching write data help when one uses SSD?

Ken: It can. Memory is still faster than flash. It depends entirely on how the memory is used. For example, with writes, if memory were used as a write-through cache (look it up if you need), it wouldn’t make things faster. If it were used as a write-back cache, it would. If it is used as a read cache, it will almost certainly make reads of data faster. But even there, life is never simple. Why? Because if you’re using memory to cache data, you’re not using it for something else … and it is possible that the memory could be better used for caching metadata, for example.

Mark: Here, I’d like to recall our good friend, Dr. J Metz, who created an excellent presentation on comparing computer caches to pizza delivery in “Life of a Storage Packet (Walk)” And in his example, caching will keep the pizza warmer. Even if a flash drive is used.

Q19: If the customer is interested in throughput in MB/s then they probably won’t do IOs with 4KB size…

Ken: Agreed. I’m fairly certain that you’re referring to adding MB/s numbers on slide 41. We had a discussion about doing that when putting the slides together. The transition between slide 40 and 42 changed the I/O size from 4KiB to 128KiB, changed from writes to reads, and changed from random I/O to sequential I/O. Adding the MB/s numbers to slide 40/41 was meant to ease the transition between slide 40 and 42. You’re absolutely right though … rarely does anyone want to talk data rates (MB/s) when using small I/O sizes.

Mark: Agreed. Although a true performance guru would recognize that these are the two sides of the same coin.

Explore Interop’s Storage Track

From SNIA Event Partner, Interop

The skyrocketing increase in data, and the desire to better understand and use that data, has focused new attention on storage technologies with higher performance and capacity. Flash, cloud storage, software-defined storage, and converged and hyper-converged infrastructure offer new opportunities for businesses but also change the way organizations must plan for and manage storage. And businesses still must ensure they provide the critical services of backup, disaster recovery, and data protection in the most reliable way for the least cost possible. Interop’s Storage track presents independent experts who will help IT organizations understand the new technologies available and evaluate how and why they might fit into their enterprise storage strategy
Want to learn more about balancing real-world storage needs? Watch this video from Network Computing with Greg Schulz, chair of Interop’s Storage Track and Founder and Sr. Analyst of independent IT advisory consultancy firm Server and StorageIO. Greg talks about the perennial challenges of enterprise storage, how software abstraction fits in, managing legacy technology while unlocking the potential of new developments. Learn more about the Storage Track and register for Interop, May 2-6 in Las Vegas. Use discount code SNIA20 to claim your Free Expo Pass or save 20% off Conference Passes.

SNIA’s Persistent Memory Education To Be Featured at Open Server Summit 2016

sssi boothIf you are in Silicon Valley or the Bay Area this week, SNIA welcomes you to join them and the Solid State Storage Initiative April 13-14 at the Santa Clara Convention Center for Open Server Summit 2016, the industry’s premier event that focuses on the design of next- generation servers with topics on data center efficiency, SSDs, core OS, cloud server design, the future of open server and open storage, and other efforts toward combining industry-standard hardware with open-source software.

The SNIA NVDIMM Special Interest Group is featured at OSS 2016, and will host a panel Thursday April 14 on NVDIMM technology, moderated by Bill Gervasi of JEDEC and featuring SIG members Diablo Technology, Netlist, and SMART Modular. The panel will highlight the latest activities in the three “flavors” of NVDIMM , and offer a perspective on the future of persistent memory in systems. Also, SNIA board member Rob Peglar of Micron Technology will deliver a keynote on April 14, discussing how new persistent memory directions create new approaches for system architects and enable entirely new applications involving enormous data sets and real-time analysis.

SSSI will also be in booth 403 featuring demonstrations by the NVDIMM SIG, discussions on SSD data recovery and erase, and updates on solid state storage performance testing.  SNIA members and colleagues can register for $100 off using the code SNIA at http://www.openserversummit.com.

Questions Aplenty on NVMe over Fabrics

Our live SNIA-ESF Webcast, “Under the Hood with NVMe over Fabrics,” generated more questions than we anticipated, proving to us that this topic is worthy of future discussions. Here are answers to both the questions we took during the live event as well as those we didn’t have time for.

Q. So fabric is an alternative to PCIe, for those of us familiar with PCIe-attached NVMe devices, yes?

A. Yes, fabric is the term used in the specification that represents a variety of physical interconnects and transports for NVM Express.

Q. How are the namespaces shared in a fabric?

A. Namespaces are NVM subsystem resources and are accessible by all controllers in the NVM subsystem. Multi-host access may be coordinated using reservations.

 Q. If there are multiple subsystems accessing same NVMe devices over fabric then how is namespace shared?

A. The mapping of fabric NVM subsystem resources (namespaces and controllers) to PCIe NVMe device subsystems is implementation specific. They may be mapped 1 to 1 or N to 1, depends on the functionality of the NVMe bridge.

Q. Are namespace reservations similar to SCSI reservations?

A. Yes

Q. Are there plans for defining bindings for Intel Omni Path fabric?

A. Intel Omni-Path is a good candidate fabric for NVMe over Fabrics.

Q. Is hybrid attachment allowed? Could a single namespace be attached to a fabric and PCIe (through two controllers) concurrently?

A. At this moment, such hybrid configuration is not permitted within the specification

Q. Is a NVM sub-system purpose built or commodity server hardware?

A. This is a difficult question to answer. At the time of this writing there are not enough “off-the-shelf” commodity components to be able to construct NVMe over Fabric subsystems.

Q. Does NVMEoF use the same NVMe PCIe controller register map?

A. A subset of the NVMe controller register mapping was retained for fabrics but renamed to “Properties” to avoid confusion.

Q. So does NVMe over Fabric act like an extension of the PCIe bus? Meaning that I see the same MMIO registers and queues remotely? Or is it a completely different protocol that is solely message based? Will current NVMe host drivers work on the fabric or does it really require a different driver stack?

A. Fabrics is not an extension of PCIe, it’s an extension of NVMe. It uses the same NVMe Submission and Completion Queue model and Descriptors as the PCIe NVMe. Most of the original NVMe host driver stack is retained and shared between PCIe and Fabrics, the bottom side was modified to allow for multiple transports.

Q. Does NVMe over Fabrics support immediate data for writes, or must write data always be fetched by the NVMe controller?

A. Yes, immediate data is termed “in-capsule” and is used to send the NVMe command data with the NVMe submission entry.

Q. As far as I know, Linux introduced a multi-queue model at the block layer recently. Is it the same thing you are mentioning?

A. No, but NVMe uses the Linux Block-MQ layer. NVMe Multi-Queue is used between the host and the NVMe controller for both PCIe and fabric based controllers.

Q. Are there situations where you might want to have more than one queue pair per CPU? What are they?

A. Queue-Pairs are matched up by CPU cores, not CPUs, which allows the creation of multiple namespace entities per CPU. This, in turn, is very useful for virtualization and application separation.

Q. What are three mandatory commands? Do they refer to read/write/sync cache?

A. Actually, there are 13 required commands. Kevin Marks has a very good presentation from the Flash Memory Summit that provides a list of these commands within the broader NVMe context. You can download it here.  

Q. Please talk about queue depths? Arbitrary? Limited?

A. Controller defined maximum queue depths up to a maximum of 64K entries.

Q. Where will SQs and CQs be physically located? Are they on host memory or SSD memory?

A. For fabrics, the SQ is located on the controller side to avoid the inefficiency of having to pull SQE’s across a fabric. CQ’s reside on the host.

Q. How do you create ordering guarantee when that is needed for correctness?

A. For commands that require sequencing, there is a concept called “Fused Commands” which get sent as a single unit.

Q. In NVMeoF how are devices discovered?

A. NVMeoF devices are discoverable via a couple of different means, depending on whether you are using Fibre Channel (which has its own discovery and login process) or an iSCSI-like name server. Mike Shapiro goes over the discovery mechanism in considerable detail in this BrightTALK Webcast.
Q. I guess all new drivers will be required for NVMeoF?

A. Yes, new drivers are being written and will be required for NVMeoF.

Q. Why can’t the doorbell+ plus communication model apply to PCIe? I mean, why doesn’t PCIe use doorbell+?

A. NVMe 1.2 defines controller resident buffers that can be used for pushing SQ Entries from the host to the controller. Doorbells are still required for PCIe to inform the controller about the new SQ entries.

Q. If there are two hosts connected to the same subsystem then will NVMe controller have two queues :- one for each host

A. Yes

Q. So with your command and data description, does NVMe over Fabric require RDMA or does it have a “Data Ready” type message to tell the host when to send write data?

A. Data transfer operations are fabric dependent. RDMA uses RDMA_READ, another transport may use some form of Data Ready model.

Q. Can you quantify the protocol translation overhead? In reality, that does not look like that big from performance perspective.

A. Submission Queue entries are 64bytes and Completion Queue entries are 16bytes. These are sufficiently small for block storage traffic which typically is in 4K+ size requests. 

Q. Do Dual Port SSDs need to support two Admin Qs since they have two paths to the same host?

A. Dual-Port or multi-path capable NVM subsystems require using two NVMe controllers each with one AdminQ and one or more IO queues. 

Q. For a Dual Port SSD, does each port need to have its Submission Q on a different CPU core in the host? I assume the SQs for the two ports cannot be on the same CPU core.

A. The mapping of controller queues to host CPU cores is typically per controller. If the host was connected to two controllers, there would be two queues per core. One queue to controller 1 and one queue to controller 2 per host core.

Q. As you mentioned currently there is an LBA addressing in standard. What will happen when Intel will go to market with new media (3D Point), which is announced to be byte addressable?

A. The NVMe NVM command set is block based and is independent of the type and access method of the NVM media used in a subsystem implementation.  

Q. Is there a real benefit of this architecture in a NAS environment?

A. There is a natural advantage to making any storage access more efficient. A network-attached system still requires block access at the lower levels, and NVMe (either local or over a Fabric) can improve NAS design and flexibility immensely. This is particularly true for pNFS and scale-out SMB paradigms.

Q. How do you handle authentication across many servers (hosts) on the fabric? How do you decide what host can access what part of each device? Does it have to be namespace specific?

A. The fabrics specification defines an Authentication model and also defines the naming format for NVM subsystems and hosts. A target implementation can choose to provision NVM subsystems to specific host based on the naming format.

Q. Having same structure at all layers means at the transport layer of flash appliance also we should maintain the submission and completions Queue model and these mapped to physical Queue of NVMe sub controller?

A. The NVMe Submission Queue and Completion Queue entries are common between fabrics and PCIe NVMe. This simplifies the steps required to bridge between NVMe fabrics and NVMe PCIe. An implementation may choose to map the fabrics SQ directly to a PCIe NVMe SSD SQ to provide a very efficient simple NVMe transport bridge

Q. With an RDMA based transport, how will each host discover the NVME controller(s) that it has been granted access to?

A. Please see the answer above.

Q. Traditionally SAS supports SAS expander for scaling purpose. How does NVMe over fabric solve the issue as there is no expander concept in NVMe world?

A. Recall that SAS expanders compensate for SCSI’s inherent lack of scalability. NVMe perpetuates the multi-queue model (which does not exist for SCSI) natively, so SAS expander-like pieces are not required for scale-out.

 

 

 

 

NFS FAQ – Test Your Knowledge

How would you rate your NFS knowledge? That’s the question Alex McDonald and I asked our audience at our recent live Webcast, “What is NFS?.” From those who considered themselves to be an NFS expert to those who thought NFS was a bit of a mystery, we got some great questions. As promised, here are answers to all of them. If you think of additional questions, please comment in this blog and we’ll get back to you as soon as we can.

Q. I hope you touch on dNFS in your presentation

A. Oracle Direct NFS (dNFS) is a client built into Oracle’s database system that Oracle claims provides faster and more scalable access to NFS servers. As it’s proprietary, SNIA doesn’t really have much to say about it; we’re vendor neutral, and it’s not the only proprietary NFS client out there. But you can read more here if you wish at the Oracle site.

Q. Will you be talking about pNFS?

A. We did a series of NFS presentations that covered pNFS a little while ago. You can find them here.

Q. What is the difference between SMB vs. CIFS? And what is SAMBA? Is it a type of SMB protocol?

A. It’s best explained in this tutorial that covers SMB. Samba is the open source implementation of SMB for Linux. Information on Samba can be found here.

Q. Will you touch upon how file permissions are maintained when users come from an SMB or a non-SMB connection? What are best practices?

A. Although NFS and SMB share some common terminology for security (ACLs or Access Control Lists) the implementations are different. The ACL security model in SMB is richer than the NFS file security model. I touched on some of those differences during the Webcast, but my advice is; don’t expect the two security domains of SMB (or Samba, the open source equivalent) and NFS to perfectly overlap. Where possible, try to avoid the requirement, but if you do need the ability to file share, talk to your NFS server supplier. A Google search on “nfs smb mixed mode” will also bring up tips and best practices.

Q. How do you tune and benchmark NFSv4?

A. That’s a topic in its own right! This paper gives an overview and how-to of benchmarking NFS; but it doesn’t explain what you might do to tune the system. It’s too difficult to give generic advice here, except to say that vendors should be relied on to provide their experience. If it’s a commercial solution, they will have lots of experience based on a wide variety of use cases and workloads.

Q. Is using NFS to provide block storage a common use case?

A. No, it’s still fairly unusual. The most common use case is for files in directories. Object and block support are relatively new, and there are more NFS “personalities” being developed, see our ESF Webcast on NFSv4.2 for more information. 

Q. Can you comment about file locking issues over NFS?

A. Locking is needed by NFS to maintain file consistency in the face of multiple readers and writers. Locking in NVSv3 was difficult to manage; if a server failed or clients went AWOL, then the lock manager would be left with potentially thousands of stale locks. They often required manual purging. NFSv4 simplifies that by being a stateful protocol, and by integrating the lock management functions and employing timeouts and state, it can manage client and server recovery much more gracefully. Locks are, in the main, automatically released or refreshed after a failure.

 Q. Where do things like AFS come into play? Above NFS? Below NFS? Something completely different?

A. AFS is another distributed file system, but it is not POSIX compliant. It influenced but is not directly related to NFS. Its use is relatively small; SMB and NFS dominate. Wikipedia has a good overview.

Q. As you said NFSv4 can hide some of the directories when exporting to clients. Can this operation hide different folders for different clients?

A. Yes. It’s possible to maintain completely different exports to expose or hide whatever directories on the server you wish. The pseudo file system is built separately for each server export. So you can have export X with subdirectories A B and C; or export Y with subdirectories B and C only.

Q. Similar to DFS-N and DFS-R in combination, if a user moves to a different location, does NFS have a similar methodology?

A. I’m not sure what DFS-N and DFS-R do in terms of location transparency. NFS can be set up such that if you can contact a particular server, and if you have the correct permissions, you should be able to see the same exports regardless of where the client is running.

Q. Which daemons should be running on server side and client side for accessing filesystem over NFS?

A. This is NFS server and client specific. You need to look at the documentation that comes with each.

Q. Regarding VMware 6.0. Why use NFS over FC?

A. Good question but you’ll need to speak to VMware to get that question answered. It depends on the application, your infrastructure, your costs, and the workload.

Curious about Your Storage Knowledge? It’s a Quick “Test” with SNIA Storage Foundations Certification Practice Exam

Whether you’ve recently mastered the basics, or are a storage technology expert,  letting the industry know you are credentialed can (and probably should) be part of your career development process.  SNIA’s Storage Networking Certification Program (SNCP) provides a strong foundation of vendor-neutral, systems-level credentials that integrate with and complement individual vendor certifications. SNCP’s three knowledge “domains” – Concepts – Standards, and Solutions – each provide a standard by which your knowledge and skill set can be assessed on a consistent, industry-wide basis without any vendor specializations.Education_continuum_new_resize

Many storage professionals choose to begin with the SNIA Storage Foundations Certification, according to Michael Meleedy, SNIA’s Director of Education.  “The SNIA Foundations Exam (S10-110), newly revised to integrate new technologies and industry practices, is the entry-level exam within the SNIA Storage Networking Certification Program (SNCP),” Meleedy explained. “It has been widely accepted by the storage industry as the benchmark for basic vendor-neutral storage credentials.  In fact, vendors like Dell require this certification.”

Try the Practice Exam!

We recommend considering Spring as the best time to test your skills – and a NEW SNIA Storage Foundations Certification Practice exam makes it very easy.  This practice exam is short (easy to squeeze into your busy day) and the sample of questions from the real exam will help you quickly determine if you have the skills required to pass the industry’s only vendor-neutral certification exam. It’s open to everyone free of charge with the results available immediately.  Take the practice exam.

Why Should I Explore the SNCP?

Professionals often wonder about the real value of IT related certifications.  Is it worth your time and money to become certified?  “Yes, especially in today’s global marketplace,” said Paul Talbut, SNIA Global Education and Regional Affiliate Program Director. “SNIA certifications provide storage and data management practitioners worldwide with an industry recognised uniform standard by which individual knowledge and skill-sets can be judged.   We’re reaching a variety of professional audiences; for example, SNIA’s Foundations Exam is available both in English and Japanese, and is offered at all Prometric testing centers worldwide.”

Learn more about the new SNIA Foundations exam (S10–110) and study materials, the entire range of SNIA Certification Testing, and the six good reasons why you should be SNIA certified! Visit http://www.snia.org/education/certification.