RDMA – Page 2 – SNIA on Storage

Fibre Channel vs. iSCSI – The Great Debate Generates Questions Galore

March 7, 2018March 7, 2018 AlexMcDonald

The SNIA Ethernet Storage Forum recently hosted the first of our “Great Debates” webcasts on Fibre Channel vs. iSCSI. The goal of this series is not to have a winner emerge, but rather provide vendor-neutral education on the capabilities and use cases of these technologies so that attendees can become more informed and make educated decisions. And it worked! Over 1,200 people have viewed the webcast in the first three weeks! And the comments from attendees were exactly what we had hoped for:

“A good and frank discussion about the two technologies that don’t always need to compete!”

“Really nice and fair comparison guys. Always well moderated, you hit a lot of material in an hour. Thanks for your work!”

“Very fair and balanced overview of the two protocols.”

“Excellent coverage of the topic. I will have to watch it again.”

If you missed the webcast, you can watch it on-demand at your convenience and download a copy of the slides. The debate generated many good questions and our expert speakers have answered them all: Read More

Ethernet Networked Storage – FAQ

December 8, 2016December 8, 2016 Fred Zhang

At our SNIA Ethernet Storage Forum (ESF) webcast “Re-Introduction to Ethernet Networked Storage,” we provided a solid foundation on Ethernet networked storage, the move to higher speeds, challenges, use cases and benefits. Here are answers to the questions we received during the live event.

Q. Within the iWARP protocol there is a layer called MPA (Marker PDU Aligned Framing for TCP) inserted for storage applications. What is the point of this protocol?

A. MPA is an adaptation layer between the iWARP Direct Data Placement Protocol and TCP/IP. It provides framing and CRC protection for Protocol Data Units. MPA enables packing of multiple small RDMA messages into a single Ethernet frame. It also enables an iWARP NIC to place frames received out-of-order (instead of dropping them), which can be beneficial on best-effort networks. More detail can be found in IETF RFC 5044 and IETF RFC 5041.

Q. What is the API for RDMA network IPC?

The general API for RDMA is called verbs. The OpenFabrics Verbs Working Group oversees the development of verbs definition and functionality in the OpenFabrics Software (OFS) code. You can find the training content from OpenFabrics Alliance here. General information about RDMA for Ethernet (RoCE) is available at the InfiniBand Trade Association website. Information about Internet Wide Area RDMA Protocol (iWARP) can be found at IETF: RFC 5040, RFC 5041, RFC 5042, RFC 5043, RFC 5044.

Q. RDMA requires TCP/IP (iWARP), InfiniBand, or RoCE to operate on with respect to NVMe over Fabrics. Therefore, what are the advantages of disadvantages of iWARP vs. RoCE?

A. Both RoCE and iWARP support RDMA over Ethernet. iWARP uses TCP/IP while RoCE uses UDP/IP. Debating which one is better is beyond the scope of this webcast, but you can learn more by watching the SNIA ESF webcast, “How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics.”

Q. 100Gb Ethernet Optical Data Center solution?

A. 100Gb Ethernet optical interconnect products were first available around 2011 or 2012 in a 10x10Gb/s design (100GBASE-CR10 for copper, 100GBASE-SR10 for optical) which required thick cables and a CXP and a CFP MSA housing. These were generally used only for switch-to-switch links. Starting in late 2015, the more compact 4x25Gb/s design (using the QSFP28 form factor) became available in copper (DAC), optical cabling (AOC), and transceivers (100GBASE-SR4, 100GBASE-LR4, 100GBASE-PSM4, etc.). The optical transceivers allow 100GbE connectivity up to 100m, or 2km and 10km distances, depending on the type of transceiver and fiber used.

Q. Where is FCoE being used today?

A. FCoE is primarily used in blade server deployments where there could be contention for PCI slots and only one built-in NIC. These NICs typically support FCoE at 10Gb/s speeds, passing both FC and Ethernet traffic via connect to a Top-of-Rack FCoE switch which parses traffic to the respective fabrics (FC and Ethernet). However, it has not gained much acceptance outside of the blade server use case.

Q. Why did iSCSI start out mostly in lower-cost SAN markets?

A. When it first debuted, iSCSI packets were processed by software initiators which consumed CPU cycles and showed higher latency than Fibre Channel. Achieving high performance with iSCSI required expensive NICs with iSCSI hardware acceleration, and iSCSI networks were typically limited to 100Mb/s or 1Gb/s while Fibre Channel was running at 4Gb/s. Fibre Channel is also a lossless protocol, while TCP/IP is lossey, which caused concerns for storage administrators. Now however, iSCSI can run on 25, 40, 50 or 100Gb/s Ethernet with various types of TCP/IP acceleration or RDMA offloads available on the NICs.

Q. What are some of the differences between iSCSI and FCoE?

A. iSCSI runs SCSI protocol commands over TCP/IP (except iSER which is iSCSI over RDMA) while FCoE runs Fibre Channel protocol over Ethernet. iSCSI can run over layer 2 and 3 networks while FCoE is Layer 2 only. FCoE requires a lossless network, typically implemented using DCB (Data Center Bridging) Ethernet and specialized switches.

Q. You pointed out that at least twice that people incorrectly predicted the end of Fibre Channel, but it didn’t happen. What makes you say Fibre Channel is actually going to decline this time?

A. Several things are different this time. First, Ethernet is now much faster than Fibre Channel instead of the other way around. Second, Ethernet networks now support lossless and RDMA options that were not previously available. Third, several new solutions–like big data, hyper-converged infrastructure, object storage, most scale-out storage, and most clustered file systems–do not support Fibre Channel. Fourth, none of the hyper-scale cloud implementations use Fibre Channel and most private and public cloud architects do not want a separate Fibre Channel network–they want one converged network, which is usually Ethernet.

Q. Which storage protocols support RDMA over Ethernet?

A. The Ethernet RDMA options for storage protocols are iSER (iSCSI Extensions for RDMA), SMB Direct, NVMe over Fabrics, and NFS over RDMA. There are also storage solutions that use proprietary protocols supporting RDMA over Ethernet.

Common Questions on Clustered File Systems

November 18, 2016November 18, 2016 John Kim

More than 350 people have already seen our SNIA Ethernet Storage Forum (ESF) webcast “Clustered File Systems: No Limits.” Our presenters, James Coomer and Jerry Lotto, did a great job explaining what clustered file systems are, key considerations, choices and performance. As we expected, there were plenty of questions, so as promised, here are answers to them all.

Q: Parallel NFS (pNFS) has been in development/standard effort for a long time, and I believe pNFS is not in the Linux kernel it appears pNFS is yet to be prime time.

A: pNFS has been in Linux for over a decade! Clients and server are widely available, and you should look at the SNIA White Paper “An Updated Overview of NFSv4; NFSv4.0, NFSv4.1, pNFS, and NFSv4.2” for more information on the current state of play.

Q: Why the emphasis on parallel I/O? Any single storage server can feed results at link capacity, so you do not need multiple storage servers to feed a client at full speed. Isn’t the more critical issue the bottleneck on access to metadata for a single directory or file? Federated NAS bottlenecks updates for each directory behind a single master server?

A: Any one storage server can usually saturate one client, but often there are multiple hungry clients making requests simultaneously. Using parallel I/O allows multiple servers to feed multiple high-bandwidth clients across a narrow or wide set of data. This smooths out the I/O load on the servers in a near-perfect manner regardless of the number of clients performing I/O. It is absolutely true that metadata serving can become a bottleneck, so parallel file systems use cached and/or distributed metadata to overcome this and again, every client takes part in that interaction and shares some responsibility for managing communicating metadata updates.

Q: Can any application access parallel file system (i.e. through an agent in the driver level)? Or does it require specific code within the application?

A: Native access to a parallel file system requires a specific client or agent in the host, but many parallel file systems allow any client to access the data through a NAS protocol gateway. No changes are needed to applications to use a parallel file system – These parallel file systems are mounted as a POSIX compliant file system and therefore adhere to basically the same standards as an NFS mount for example.

Q: Are parallel file system clients compatible with scale-out NAS servers?

A: Nearly all scale-out NAS servers speak a standard NAS protocol like NFS or SMB. Clients running a parallel file system client can also access NAS via these standard protocols. Exceptions to this may possibly (but none that we know of) occur for scale-out NAS servers that support a modified NFS/SMB protocol or a custom NAS client which might conceivably conflict with the parallel file system client when installed on an OS.

Q: Of course I am biased, but I am fond of the AFS (Andrew File System) Family of File Systems. There is OpenAFS, but there is also what we are doing at AuriStor extending beyond the core AFS global namespace model (security functionality, and performance)

A: AFS is another distributed file system which supports large scale deployments, native clients for many platforms, and strong security features. It also uses local caching of files to improve performance. It uses a weakly consistent file locking system so multiple clients can access the same file simultaneously but they cannot both update the same file at the same time. OpenAFS is an open-source implementation of AFS. Auristor (formerly Your File System, Inc.) is a startup providing a commercial parallel file system that is compatible with AFS.

Q: I am more familiar with Veritas Cluster File System, could you please do a quick compare with Lustre or GPFS?

A: The Veritas Cluster File System (formerly VxCFS, now part of Veritas InfoScale) is a distributed file system that runs on Linux and popular flavors of Unix. It supports up to 64 nodes and allows multiple nodes to share the same back-end storage hardware. Comparing it to Lustre and GPFS is beyond the scope of this webinar, but in basic terms, parallel file systems can offer far greater scalability and bandwidth for example, through the use of optimized RDMA clients for high performance networks.

Q: Why do file apps need shared access to data, but block apps do not?

A: Traditionally block storage did not offer shared access to data (except when used as shared back-end storage for a clustered file system), while apps that needed shared access to data usually chose to use a NAS protocol such as SMB or NFS. So in many cases file-based apps use file sharing protocols because they need shared access to data from multiple clients. (In other cases file-based applications do not require sharing but the storage administrators believe it’s easier to manage or less expensive than networked block storage.)

Q: Do Lustre and GPFS have SMB Direct support?

A: Not today. SMB Direct is an option to use RDMA and multi-channel with the SMB 3 protocol. Both Lustre and GPFS support the ability to export a file system via NFS or SMB, but generally they do not support SMB Direct yet. Both Lustre and GPFS support RDMA access through their clients.

How to the clients avoid doing simultaneous writes to the same file?

A: Some parallel file systems allow this by letting different clients write to different parts of the same file. Others do not allow this. In either case, distributed file locking is used to prevent two clients from writing simultaneously to the same part of a file (or to the same file if it’s not allowed).

Q: How can you say that the application “does not have to worry about” how the clustered file system serializes writes? Doesn’t this require continuous end-to-end connectivity?

A: When the application writes data it generally writes to a POSIX-compliant file system and does not need to worry about how the parallel file system serializes, distributes, or protects the data because this is virtualized (managed) by the file system. It usually does require continuous end-to-end connectivity from the clients to the servers, though in some cases caching could allow for brief gaps in connectivity and in some systems not every client needs to have network connectivity to every server. There are multiple mechanisms within parallel file systems to manage the various cases of clients/servers disappearing from the network, temporarily or permanently (whilst for example holding a lock).

Q: How does a parallel file system handle the sequences of write on a same file? Just append one by one? What if a client modified a line?

A: This is the biggest challenge for and reason to use a parallel file system. Beneath the covers, coherency is maintained by Spectrum Scale using a token management server process which issues locks for object requests. Similar functionality is implemented in Lustre using a distributed lock manager. These objects are most commonly blocks within files rather than entire files, but this is application controlled. The end result is a POSIX-compliant interface that scales to thousands of clients.

Q: What does FPO stand for?

A: File Placement Optimizer – a shared-nothing architecture and licensing model for IBM Spectrum Scale (aka GPFS). Learn more here.

Q: Is there a concept in parallel file systems for “auto-tuning” yet? Seems like the early days of SAN management and tuning…

A: Default tuning values are optimized for general purpose workloads, but the whole purpose of tuning parameters is to adjust away from those defaults to optimize the file system for a particular application workload or fil esystem architecture. Both IBM and OpenSFS with the support of Intel have published extensive documentation on best practices for optimization and tuning for either file system. We are not aware of any work on “automating” that process but there has been recent work (e.g. in spectrum scale) to simplify the tuning process.

Q: Which is better as interconnect between disk and servers, shared access or share-nothing?

A: The use of shared access in the interconnect between disks and servers is limited to providing HA functionality in Lustre or Spectrum Scale, the ability to service I/O requests to a storage device if the server which has primary responsibility for that device is not available. This usually involves multiple server-attached external storage which can add cost to building the file system. The alternative approach to HA is to replicate blocks of data to different disks on different servers, cutting back on the usable capacity of the file system. If HA is not a requirement, a share-nothing architecture will generally involve less hardware and therefore be less expensive to build.

If you have more questions, please comment on this blog. And I encourage you to check out the SNIA ESF webcast library for educational, vendor-neutral content on Ethernet networked storage topics

2017 Ethernet Roadmap for Networked Storage

November 2, 2016November 2, 2016 Fred Zhang

When SNIA’s Ethernet Storage Forum (ESF) last looked at the Ethernet Roadmap for Networked Storage in 2015, we anticipated a world of rapid change. The list of advances in 2016 is nothing short of amazing

New adapters, switches, and cables have been launched supporting 25, 50, and 100Gb Ethernet speeds including support from major server vendors and storage startups
Multiple vendors have added or updated support for RDMA over Ethernet
The growth of NVMe storage devices and release of the NVMe over Fabrics standard are driving demand for both faster speeds and lower latency in networking
The growth of cloud, virtualization, hyper-converged infrastructure, object storage, and containers are all increasing the popularity of Ethernet as a storage fabric

The world of Ethernet in 2017 promises more of the same. Now we revisit the topic with a look ahead at what’s in store for Ethernet in 2017. Join us on December 1, 2016 for our live webcast, “2017 Ethernet Roadmap to Networked Storage.”

With all the incredible advances and learning vectors, SNIA ESF has assembled a great team of experts to help you keep up. Here are some of the things to keep track of in the upcoming year:

Learn what is driving the adoption of faster Ethernet speeds and new Ethernet storage models
Understand the different copper and optical cabling choices available at different speeds and distances
Debate how other connectivity options will compete against Ethernet for the new cloud and software-defined storage networks
And finally look ahead with us at what Ethernet is planning for new connectivity options and faster speeds such as 200 and 400 Gigabit Ethernet

The momentum is strong with Ethernet, and we’re here to help you stay informed of the lightning-fast changes. Come join us to look at the future of Ethernet for storage and join this SNIA ESF webcast on December 1^st. Register here.

SNIA Storage Developer Conference-The Knowledge Continues

October 13, 2016October 13, 2016 khauser Leave a comment

SNIA’s 18th Storage Developer Conference is officially a success, with 124 general and breakout sessions; Cloud Interoperability, Kineti plugfest 5 c Storage, and SMB3 plugfests; ten Birds-of-a-Feather Sessions, and amazing networking among 450+ attendees. Sessions on NVMe over Fabrics won the title of most attended, but Persistent Memory, Object Storage, and Performance were right behind. Many thanks to SDC 2016 Sponsors, who engaged attendees in exciting technology discussions.

For those not familiar with SDC, this technical industry event is designed for a variety of storage technologists at various levels from developers to architects to product managers and more. And, true to SNIA’s commitment to educating the industry on current and future disruptive technologies, SDC content is now available to all – whether you attended or not – for download and viewing.

20160919_120059 You’ll want to stream keynotes from Citigroup, Toshiba, DSSD, Los Alamos National Labs, Broadcom, Microsemi, and Intel – they’re available now on demand on SNIA’s YouTube channel, SNIAVideo.

All SDC presentations are now available for download; and over the next few months, you can continue to download SDC podcasts which combine audio and slides. The first podcast from SDC 2016 – on hyperscaler (as well as all 2015 SDC Podcasts) are available here, and more will be available in the coming weeks.

SNIA thanks all its members and colleagues who contributed to make SDC a success! A special thanks goes out to the SNIA Technical Council, a select group of acknowledged industry experts who work to guide SNIA technical efforts. In addition to driving the agenda and content for SDC, the Technical Council oversees and manages SNIA Technical Work Groups, reviews architectures submitted by Work Groups, and is the SNIA’s technical liaison to standards organizations. Learn more about these visionary leaders at http://www.snia.org/about/organization/tech_council.

And finally, don’t forget to mark your calendars now for SDC 2017 – September 11-14, 2017, again at the Hyatt Regency Santa Clara. Watch for the Call for Presentations to open in February 2017.

Find out How iSCSI is Evolving

May 4, 2016May 4, 2016 David Fair

The next Ethernet Storage Forum Webcast. “Evolution of iSCSI including iSER, iSCSI over RDMA Ethernet,” will focus on developments with iSCSI – the Internet Protocol standard for transferring SCSI commands across an Ethernet network, enabling hosts to link to storage devices wherever they may be. At this Webcast on May 24th, I will be joined by Fred Knight, Standards Technologist at NetApp, and Andy Banta, Storage Janitor at SolidFire/NetApp, who will discuss the evolution of iSCSI up to iSER, which takes advantage of Ethernet RDMA fabric technologies to enhance performance. Register now to hear:

A brief history of iSCSI
How iSCSI works
IETF refinements to the specification
Enhancing iSCSI performance with iSER

The Webcast will be live, so please bring your questions for Andy and Fred. We hope to see you there!

Ethernet RDMA Protocols Support for NVMe over Fabrics – Your Questions Answered

March 21, 2016March 21, 2016 David Fair

Our recent SNIA Ethernet Storage Forum Webcast on How Ethernet RDMA Protocols iWARP and RocE Support NVMe over Fabrics generated a lot of great questions. We didn’t have time to get to all of them during the live event, so as promised here are the answers. If you have additional questions, please comment on this blog and we’ll get back to you as soon as we can.

Q. Are there still actual (memory based) submission and completion queues, or are they just facades in front of the capsule transport?

A. On the host side, they’re “facades” as you call them. When running NVMe/F, host reads and writes do not actually use NVMe submission and completion queues. That data just comes from and to RNIC RDMA queues. On the target side, there could be real NVMe submissions and completion queues in play. But the more accurate answer is that it is “implementation dependent.”

Q. Who places the command from NVMe queue to host RDMA queue from software standpoint?

A. This is managed by the kernel host software in code written to the NVMe/F specification. The idea is that any existing application that thinks it is writing to the existing NVMe host software will in fact cause the SQE entry to be encapsulated and placed in an RDMA send queue.

Q. You say “most enterprise switches” support NVMe/F over RDMA, I guess those are ‘new’ ones, so what is the exact question to ask a vendor about support in an older switch?

A. For iWARP, any switch that can handle Internet traffic will do. Mellanox and Intel have different answers for RoCE / RoCEv2. Mellanox says that for RoCE, it is recommended, but not required, that the switch support Priority Flow Control (PFC). Most new enterprise switches support PFC, but you should check with your switch vendor to be sure. Intel believes RoCE was architected around DCB. The name itself, RoCE, stands for “RDMA over Converged Ethernet,” i.e., Ethernet with DCB. Intel believes RoCE in general will require PFC (or some future standard that delivers equivalent capabilities) for efficient RDMA over Ethernet.

Q. Can you comment on when one should use RoCEv2 vs. iWARP?

A. We gave a high-level overview of some of the deployment considerations on slide 30. We refer you to some of the vendor links on slide 32 for “non-vendor neutral” perspectives.

Q. If you take RDMA out of equation, what is the key advantage of NVMe/F over other protocols? Is it that they are transparent to any application?

A. NVMe/F allows the application to bypass the SCSI stack and uses native NVMe commands across a network. Most other block storage protocols require using the SCSI protocol layer, translating the NVMe commands into SCSI commands. With NVMe/F you also gain parallelism, simplicity of the command set, a separation between administrative sessions and data sessions, and a reduction of latency and processing required for NVMe I/O operations.

Q. Is ROCE v1 compatible with ROCE v2?

A. Yes. Adapters speaking RoCEv2 can also maintain RDMA connections with adapters speaking RoCEv1 because RoCEv2 ports are backwards interoperable with RoCEv1. Most of the currently shipping NICs supporting RoCE support both RoCEv1 and RoCEv2.

Q. Are RoCE and iWARP the only way to use Ethernet as a fabric for NMVe/F?

A. Initially yes; only iWARP and RoCE are supported for NVMe over Ethernet. But the NVM Express Working Group is also targeting FCoE. We should have probably been clearer about that, though it is noted on slide 11.

Q. What about doing NVMe over Fibre Channel? Is anyone looking at, or doing this?

A. Yes. This is not in scope for the first spec release, but the NVMe WG is collaborating with the FCIA on this. So NVMe over Fibre Channel is expected as another standard in the near future, to be promoted by T11.

Q. Do RoCE and iWARP both use just IP addresses for management or is there a higher level addressing mechanism, and management?

A. RoCEv2 uses the RoCE Connection Manager, and iWARP uses TCP connection management. They both use IP for addressing.

Q. Are there other fabrics to run NVMe over fabrics? Can you do this over OmniPath or Infiniband?

A. InfiniBand is in scope for the first spec release. Also, there is a related effort by the FCIA to support NVMe over Fibre Channel in a standard that will be promoted by T11.

Q. You indicated NVMe stack is in kernel while RDMA is a user level verb. How are NVMe SQ/ CQ entries transferred from NVMe to RDMA and vice versa? Also, could smaller transfers in NVMe (e.g. SGL of 512B) combined to larger sizes before being sent to RDMA entries and vice versa?

A. NVMe/F supports multiple scatter gather entries to combine multiple incontinuous transfers, nevertheless, the protocol doesn’t support chaining multiple NVMe commands on the same command capsule. A command capsule contains only a single NVMe command. Please also refer to slide 18 from the presentation.

Q. 1) How do implementers and adopters today test NVMe deployments? 2) Besides latency, what other key performance indicators do implements and adopters look for to determine whether the NVMe deployment is performing well or not?

A. 1) Like any other datacenter specification, testing is done by debugging, interop testing and plugfests. Local NVMe is well supported and can be tested by anyone. NVMe/F can be tested using pre-standard drivers or solutions from various vendors. UNH-IOH is an organization with an excellent reputation for helping here. 2) Latency, yes. But also sustained bandwidth, IOPS, and CPU utilization, i.e., the “usual suspects.”

Q. If RoCE CM supports ECN, why can’t it be used to implement a full solution without requiring PFC?

A. Explicit Congestion Notification (ECN) is an extension to TCP/IP defined by the IETF. First point is that it is a standard for congestion notification, not congestion management. Second point is that it operates at L3/L4. It does nothing to help make the L2 subnet “lossless.” Intel and Mellanox agree that generally speaking, all RDMA protocols perform better in a “lossless,” engineered fabric utilizing PFC (or some future standard that delivers equivalent capabilities). Mellanox believes PFC is recommended but not strictly required for RoCE, so RoCE can be deployed with PFC, ECN, or both. In contrast, Intel believes that for RoCE / RoCEv2 to deliver the “lossless” performance users expect from an RDMA fabric, PFC is in general required.

Q. How involved are Ethernet RDMA efforts with the SDN/OCP community? Is there a coming example of RoCE or iWarp on an SDN switch?

A. Good question, but neither RoCEv2 nor iWARP look any different to switch hardware than any other Ethernet packets. So they’d both work with any SDN switch. On the other hand, it should be possible to use SDN to provide special treatment with respect to say congestion management for RDMA packets. Regarding the Open Compute Project (OCP), there are various Ethernet NICs and switches available in OCP form factors.

Q. Is there a RoCE v3?

A. No. There is no RoCEv3.

Q. iWARP and RoCE both fall back to TCP/IP in the lowest communication sense? So they are somewhat compatible?

A. They can speak sockets to each other. In that sense they are compatible. However, for the usage model we’re considering here, NVMe/F, RDMA is required. Because of L3/L4 differences, RoCE and iWARP RNICs cannot speak RDMA to each other.

Q. So in case of RDMA (ROCE or iWARP), the NVMe controller’s fabric port is Ethernet?

A. Correct. But it must be RDMA-enabled Ethernet.

Q. What if I am using soft RoCE, do I still need an RNIC?

A. Functionally, soft RoCE or soft iWARP should work on a regular NIC. Whether the performance is sufficient to keep up with NVMe SSDs without the hardware offloads is a different matter.

Q. How would the NVMe controller know that a command is placed in the submission queue by the Fabric host driver? Is the fabric host driver responsible for notifying the NVMe controller through remote doorbell trigger or the Fabric target driver should trigger the doorbell?

A. No separate notification by the host required. The fabric’s host driver simply sends a command capsule to notify its companion subsystem driver that there is a new command to be processed. The way that the subsystem side notifies the backend NVMe drive is out of the scope of the protocol.

Q. I am chair of ETSI NFV working group on NFV acceleration. We are working on virtual RDMA and how VM can benefit from hardware independent RDMA. One corner stone of this is virtual-RDMA pseudo device. But there is not yet consensus on minimal set of verbs to be supported: Do you think this minimal verb set can be identified? Last, the transport address space is not consistent between IB, Ethernet. How supporting transport independent RDMA?

A. You know, the NVM Express Working Group is working on exactly these questions. They have to define a “minimal verb set” since NVMe/F generates the verbs. Similarly, I’d suggest looking to the spec to see how they resolve the transport address space differences.

Q. What’s the plan for Linux submission of NVMe over Fabric changes? What releases are being targeted?

A. The Linux Driver WG in the NVMe WG expects to submit code upstream within a quarter of the spec being finalized. At this time it looks like the most likely Linux target will be kernel 4.6, but it could end up being kernel 4.7.

Q. Are NVMe SQ/CQ transferred transparently to RDMA Queues or can they be modified?

A. The method defined in the NVMe/F specification entails a transparent transfer. If you wanted to modify an SQE or CQE, do so before initiating an NVMe/F operation.

Q. How common are rNICs for recent servers? i.e. What’s a quick check I can perform to find out if my NIC is an rNIC?

A. rNICs are offered by nearly all major server vendors. The best way to check is to ask your server or NIC vendor if your NIC supports iWARP or RoCE.

Q. This is most likely out of the scope of this talk but could you perhaps share about 30K level on the differences between “NVMe controller” hardware versus “NVMeF” hardware. It’s most likely a combination of R-NIC+NVMe controller, but would be great to get your take on this.

A goal of the NVMe/F spec is that it work with all existing NVMe controllers and all existing RoCE and iWARP RNICs. So on even a very low level, we can say “no difference.” That said, of course, nothing stops someone from combining NVMe controller and rNIC hardware into one solution.

Q. Are there any example Linux targets in the distros that exercise RDMA verbs? An iWARP or iSER target in a distro?

A. iSER allows this using a LIO or TGT SCSI target.

Q. Is there a standard or IP for RDMA NIC?

A. The various RNICs are based on IBTA, IETF, and IEEE standards are shown on slide 26.

Q. What is the typical additional latency introduced comparing NVMe over Fabric vs. local NVMe?

A. In the 2014 IDF demo, the prototype NVMe/F stack matched the bandwidth of local NVMe with a latency penalty of only 8µs over a local iWARP connection. Other demonstrations have shown an added fabric latency of 3µs to 15µs. The goal for the final spec is under 10µs.

Q. How well is NVME over RDMA supported for Windows ?

A. It is not currently supported, but then the spec isn’t even finished. Contract Microsoft if you are interested in their plans.

Q. RDMA over Ethernet would not support Layer 2 switching? How do you deal with TCP over head?

A. L2 switching is supported by both iWARP and RoCE. Both flavors of RNICs have MAC addresses, etc. iWARP had to deal with TCP/IP in hardware, a TCP/IP Offload Engine or TOE. The TOE used in an iWARP RNIC is significantly constrained compared to a general purpose TOE and therefore can operate with very high performance. See the Chelsio website for proof points. RoCE does not use TCP so does not need to deal with TCP overhead.

Q. Does RDMA not work with fibre channel?

A. They are totally different Transports (L4) and Networks (L3). That said, the FCIA is working with NVMe, Inc. on supporting NVMe over Fibre Channel in a standard to be promoted by T11.

How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics

January 6, 2016January 6, 2016 David Fair

NVMe (Non-Volatile Memory Express) over Fabrics is of tremendous interest among storage vendors, flash manufacturers, and cloud and Web 2.0 customers. Because it offers efficient remote and shared access to a new generation of flash and other non-volatile memory storage, it requires fast, low latency networks, and the first version of the specification is expected to take advantage of RDMA (Remote Direct Memory Access) support in the transport protocol.

Many customers and vendors are now familiar with the advantages and concepts of NVMe over Fabrics but are not familiar with the specific protocols that support it. Join us on January 26^th for this live Webcast that will explore and compare the Ethernet RDMA protocols and transports that support NVMe over Fabrics and the infrastructure needed to use them. You’ll hear:

Why NVMe Over Fabrics requires a low-latency network
How the NVMe protocol is mapped to the network transport
How RDMA-capable protocols work
Comparing available Ethernet RDMA transports: iWARP and RoCE
Infrastructure required to support RDMA over Ethernet
Congestion management methods

The event is live, so please bring your questions. We look forward to answering them.

Ethernet Roadmap for Networked Storage Q&A

July 17, 2015July 17, 2015 David Fair

Almost 200 people attended our joint Webcast with the Ethernet Alliance: “The 2015 Ethernet Roadmap for Networked Storage.” We had a lot of great questions during the live event, but we did not have time to answer them all. As promised, we’ve complied answers for all of the questions that came in. If you think of additional questions, please feel free to comment on this blog.

Q. What did you mean by parity of flash with HDD?

A. We were referring to the O’Reilly article in “Network Computing.” O’Reilly is predicting parity in BOTH capacity and price in 2016.

Q. When do we expect IEEE standards ratification for 25G speed?

A. 2016. You can see the exact schedule here.

Q. Do you envision the Enterprise, Cloud Providers, HPC, Financials getting rid of their 10/40GbE infrastructure and replacing that with 25/100GbE infrastructure in 2017? Will these customers deploy 100GbE/25GbE switch in the leaf layer in 2017?

A. Deployment will occur over a multi-year time span overall if only because switch infrastructure is expensive to upgrade, as reflected in the Crehan Research forecast. New deployments will likely move to 25/100GbE as new switches with 100GbE downstream ports become available in 2016. Just because the Cloud Service Providers are currently the most aggressive in driving new infrastructure purchases, they represent the largest early volumes for 25/100 GbE. Enterprise is still in the midst of the transition from 1GbE to 10GbE.

Q. What are some of the developments on spanning-tree derivatives vs. Dykstra based derivatives such as OSPF, FSPF for switches?

A. Beyond the scope of this presentation on Ethernet. Ethernet is defined by the IEEE for L1 and L2 in the ISO model. Your questions are at L3 and L4, which is handled by organizations like IETF.

Q. With all the speeds possible who is working on flow control?

A. Flow control at the 802.1 level is supported in the Layer 1/2 PHY & MAC by setting upper bounds on the delay through each layer which allows higher layers to comprehend the delays & response times to pause frames. Each new speed & PHY in 802.3 is accompanied by delay constraint specifications to support this.

Q. Do you have an overlay graphic that shows the Ethernet RDMA roadmap? If so, is Ethernet storage the primary driver for that technology?

A. Beyond the scope of this presentation on Ethernet. Ethernet is defined by the IEEE for L1 and L2 in the ISO model. Your questions are at L3 and L4, which is handled by organizations like IETF and the InfiniBand Trade Association.

Q. The adoption of faster and new Ethernet always has to do with the costs of acquiring new technology. How long do you think it will take to adopt/acquire faster Ethernet in datacenters now that the development is happening much faster than the last 20 years?

A. Please see the chart on slide 7 where Crehan Research predicts how fast the technology will diffuse into deployments.

Q. What do you expect as cost comparison between Ethernet and InfiniBand going forward?
Also, what work is being done to reduce latency?

A. Beyond the scope of this presentation. Latency is primarily a consequence of design methodologies and semiconductor process technology, and thus under the control of the silicon device manufacturers. Some vendors prioritize latency more than others.

Q. What’s the technical limitation as speeds go higher and higher?

A. A number of factors limit speeds going faster and faster, but the main problem is that materials attenuate signals as they travel at higher frequencies.

Q. Will 1GbE used for manageability purposes disappear from public cloud? If so, what is the expected time frame?

A. This is a choice for end users. Most equipment is managed on a separate network for security concerns, but users can eliminate these management networks at any time.

Q. What are the relative market size predictions for the expanding number of standards (25G, 50G, 100G, 200G, etc.)?

A. See the Crehan Research forecast in the presentation.

Q. What is the major difference between SMF & MMF for the not so initiated?

A. The SMF has a 9um core while the MMF has a 50um core. Different lasers are used for each fiber type and MMF typically goes 100 meters above 10GbE and SMF goes from 500m to 10km.

Q. Will 25G be available through both copper and fibre connectivity?

A. Yes. IEEE 802.3 work is currently underway to specify 25Gb/s on twinax (“direct attach copper)” to 5 meters, printed circuit backplane up to ~1m, twisted pair copper to 30m, multimode fiber to 100m. There is no technology barrier to 25G on SMF, just that a standards project to specify it has not started yet.

Q. This is interesting from a hardware viewpoint, but has nothing to do with storage yet. Are we going to get to how this relates to storage other than saying flash drives are fast and only Ethernet can keep up?

A. Beyond the scope of this presentation on Ethernet. Ethernet is defined by the IEEE for L1 and L2 in the ISO model. Your questions are directed at the higher layers. The key point of this webcast is that storage networking engineers need to pay much more attention to the Ethernet roadmap than they have historically, primarily because of NVM.

Q. How does “SFP 28″ fit in this mix? Is it required for 25G?

A. SFP28 connectors and modules are required for 25GbE because they give better performance than SFP+ that only works to 10GbE.

Q. Can you provide the quick difference between copper & optical on speed & costs?

A. Copper and optical Ethernet links are usually standardized at the same speed. 400GbE is not defining a copper link but an active Direct Attached Cable (DAC) will probably support 400GbE. Cost depends on volume and many factors and is beyond the scope of this presentation. Copper is usually a fraction of the cost of optical links.

Q. Do you think people will try to use multiple CAT 5e to get more aggregate bandwidth to the access points to avoid having to run Fibre to them?

A. IEEE is defining 2.5GBASE-T and 5GBASE-T to enable Cat5e to support faster wireless access points.

Q. When are higher speeds and PoE going to reach the point when copper based Ethernet will become a viable heat source for buildings thus helping the environment?

A. IEEE is defining 4 wire PoE to deliver at least 60W to end devices. You can find out more here.

Q. What are the use cases for 2.5Gb and 5.0Gb Base-T?

A. The leading use case for 2.5G/5GBASE-T is to provide the uplink for wireless LAN access points that support 802.11ac and future wireless technology. Wireless LAN technology has advanced to the point where >1Gb/s BW is needed upstream from the AP, and 2.5G/5G provide a higher speed uplink while preserving the user’s investment in Cat5e/Cat6 cabling.

Q. Why not have only CFP2 sockets right away with things disabled for lower speeds for all the intervening years leading to full-fledged CFP2?

A. CFP2 is defined for 100GbE and 8 ports can be used on a 1U switch. 100GbE switches are shifting to QSFP28 so that 32 ports of 100GbE is supported in a 1U switch at low cost. The CFP2 is much more expensive than QSFP28 and will not be used for lower speeds because of the high cost.

Benefits of RDMA in Accelerating Ethernet Storage Q&A

March 9, 2015March 9, 2015 Mike Jochimsen

At our recent live Webcast “Benefits of RDMA in Accelerating Ethernet Storage Connectivity” experts from Emulex, Intel and Microsoft had an insightful discussion on the ways RDMA is having an impact on Ethernet storage. The live event was attended by nearly 200 people and feedback was overwhelming positive with several attendees thanking us for our vendor neutral presentation and one attendee commenting that it was, “Probably the most clearly comprehensible yet comprehensive webinar I’ve attended in some time.” If you missed the Webcast, it’s now available on demand. We did not have time to get to everyone’s questions, so as promised, below are answers to all of them. If you have additional questions, please ask them in the comments section in this blog and we’ll get back to you as soon as possible.

Q. Is RDMA over RoCEv2 in production?

A. The IBTA released the RoCEv2 Specification in September 2014. In order to support that specification changes may be required across the RDMA stack, including firmware, drivers & operating systems. Schedules for implementation of that specification will vary by operating system. For example, the OpenFabrics Alliance (OFA) has not released an Open Fabrics Enterprise Distribution (OFED) version that implements that standard yet, although it is in process now. Once OFA completes their OFED stack implementation, the Linux distribution vendors will then incorporate and support the updated OFED stack. Implementations provided prior to full OFA and Distro vendor support would be preliminary, potentially incompatible with the OFED release, and require confirmation by the distro vendor with regard to the nature/level of support they would be providing

Q. I would have liked a list of Windows applications that take advantage of SMB Direct – both in a Hyper-V host or bare metal.

A. In Windows, any file-based application can make use of SMB3 and SMB Direct due to the native file-based programming interface support. No application changes are required. For certain enterprise applications such as Hyper-V and SQL Server, SMB3 is officially supported, and more information can be found in the product catalog at www.microsoft.com.

Q. Are there any particular benefits in using one network protocol over another for SMB Direct/RDMA (iWARP vs. RoCE vs. IB)?

A. There are no hard and fast rules; any adapter or protocol can be suitable for many scenarios. Of the Ethernet-based protocols we considered in today’s webcast

iWARP offers the benefit of operation over TCP with its reliability and routability, well-suited to a broad range of installed infrastructure.
RoCE offers a lightweight, efficient protocol when a DCB-enabled switched fabric is available. RoCE, however, is not routable.
RoCEv2 offers similar properties to RoCE, with the possibility to scale to larger routed and DCB-enabled fabrics.

Q. Who are the vendors offering iWARP capable RNICs?

A. Chelsio Communications has production iWARP adapters today, and both Intel and Qlogic have publicly committed to future iWARP controllers.

Q. How much testing has been done with SMB3, and in particular SMB direct, over WAN connections?

A. The SMB2 protocol was originally designed to adapt to WAN scenarios, and supports a credit-based management of large amounts of data to be outstanding, to make best use of WAN-type long pipes. The SMB3 protocol retains these design attributes, and the SMB Direct protocol also supports similar deep pipelining. The iWARP protocol, being layered on standard TCP, is well suited to such deployments, and RoCE WAN adapters are potentially available. Please contact the respective technology vendors for information on any available testing results.

Q. I love a future webcast for RDMA enabled distributed filesystems.

A. Thanks for the suggestion! We’re always looking for ideas for future webcasts and SNIA-ESF will consider this as a potential follow-on.

Q. Is Live Migration the scenario where “packet size” is 1MB?

A. All SMB Direct scenarios have workloads that range anywhere up to 8MB. For large file copies, most SMB3 clients request from 1MB to 8MB per operation, for Hyper-V live migration, transfers are typically similar, during the bulk transfer phase.

Q. SMB3 is being compared to FC for enterprise. If Ethernet based protocols are of interest, wouldn’t FCoE give the same performance as FC (same stack) vs. SMB3?

A. SMB3 with SMB Direct enables many workloads not possible with Fibre Channel over Ethernet, and performance comparisons are therefore difficult. Perhaps another SNIA webcast could investigate this!

Q. Regarding your SMB direct example with lots of small operations, how do you deal with the overhead of registering and unregistering buffers for the RDMA operations?

A. As answered later in the session, the registration and unregistration is not a protocol matter, but in the case of the Windows implementation, it is strictly performed for the specific buffers of each operation, which is critical for security, data integrity, and system protection. The standard “Fast Register Work Request” method is used, and careful implementation has shown that the overhead does not negatively impact performance, even for small I/O (4KB/operation). Check out Jose Barreto’s blog, which contains many benchmark results.

Q. But isn’t Live Migration done in 1MB “chunks”? So not “small” I/Os?

A. As answered later in the session, Hyper-V Live Migration is done in several phases, the first phase is the initial bulk copy of memory, done in large chunks, but immediately after it a second phase of copying individual pages which were dirtied by the live-running VM is performed. These operations are typically 4KB. Note: The faster the initial phase goes, the less work there is in this second phase, but in both phases, the faster, the better, and RDMA accelerates both.

Q. Are iSER and iWARP alternatives to one another?

A. iWARP is an RDMA protocol, and iSER is a mapping of iSCSI to iWARP, as well as RoCE/InfiniBand.

Q. What’s Intel’s roadmap for RoCE and/or iWARP?

A. Intel is committed to iWARP and plans to incorporate it in future server chipsets and SOCs. See http://www.intel.com/content/www/us/en/ethernet-products/accelerating-ethernet-iwarp-video.html for more information.

Q. Is there any other Transport being used other than IB to create a reliable transport for RoceV2? Puristically it is possible?

A. RoCE was developed to leverage Infiniband as much as possible. For that reason, the Infiniband transport was chosen when the RoCE standard was developed. As the RoCEv2 standard was developed, the underlying Infiniband network protocol was replaced with IPv4 / IPv6 in order to provide the layer 3 routability and UDP to provide stateless encapsulation (and indication) of the Infiniband transport header that was retained. While it may be possible to develop a reliable transport to replace Infiniband, the RoCE standards body has elected not to go that route as of this writing.