• Home
  • About

    The Great Debates – Our Next Webcast Series

    February 5th, 2018
    The SNIA ESF is announcing a new series of webcasts, following our hugely successful “Everything You Wanted To Know About Storage But Were Too Proud To Ask” webcasts. Those focussed on explaining storage technology from the ground up, and while they were pretty all encompassing in their storage technology coverage, they didn’t compare or contrast similar technologies that perform broadly similar functions. That’s what we’re going to do in our new “Great Debates” series, the first of which was “FC vs. iSCSI.” It’s now available on-demand. I encourage you to check it out. It’s a great debate with experts who really know their stuff. But wait… FC vs. iSCSI? That “versus” sounds more like an argument than a discussion. Was there a winner? Was this a technology fight, with a clear-cut winner and a loser? The answer is an emphatic “No!”  Continue Reading...

    FC vs. iSCSI – The Debate Continues

    December 31st, 2017
    It’s one of the great IT debates: Fibre Channel (FC) or iSCSI. We at the SNIA Ethernet Storage Forum thought this was this perfect way to kick off the New Year, so we’re hosting a live webcast “FC vs. iSCSI” on January 31st with experts who will not be afraid to highlight differences and compare and contrast these two storage protocols.  Continue Reading...

    Comparing iSCSI, iSER, and NVMe over Fabrics (NVMe-oF): Ecosystem, Interoperability, Performance, and Use Cases

    August 17th, 2017
    iSCSI is one of the most broadly supported storage protocols, but traditionally has not been associated with the highest performance. Newer protocols like iSER and NVMe over Fabrics promise extreme performance but are still maturing and lack the broad feature and platform support of iSCSI. Storage vendors and customers face interesting tradeoffs and options when evaluating how to achieve the highest block storage performance on Ethernet networks, while preserving the major software and hardware investment in iSCSI.  Continue Reading...

    Q&A on All Things iSCSI

    April 7th, 2017
    In the recent SNIA Ethernet Storage Forum iSCSI pod webcast, from our “Everything You Wanted To Know About Storage Part Were Too Proud to Ask” series, we discussed all things iSCSI. If you missed the live event, it’s now available on-demand. As promised, we’ve compiled all the webcast questions with answers from our panel of experts. If you have additional questions, please feel free to ask them in the comment field of this blog. I also encourage you to check out the other on-demand webcasts in this “Too Proud To Ask” series here and stay informed on upcoming events in this series by following us on Twitter @SNIAESF.  Continue Reading...

    Would You Like Some Rosé with Your iSCSI?

    February 3rd, 2017

    Would you like some rosé with your iSCSI? I’m guessing that no one has ever asked you that before. But we at the SNIA Ethernet Storage Forum like to get pretty colorful in our “Everything You Wanted To Know about Storage But Were Too Proud To Ask” webcast series as we group common storage terms together by color rather than by number.

    In our next live webcast, Part Rosé – The iSCSI Pod, we will focus entirely on iSCSI, one of the most used technologies in data centers today. With the increasing speeds for Ethernet, the technology is more and more appealing because of its relative low cost to implement. However, like any other storage technology, there is more here than meets the eye.

    We’ve convened a great group of experts from Cisco, Mellanox and NetApp who will start by covering the basic elements to make your life easier if you are considering using iSCSI in your architecture, diving into:

    • iSCSI definition
    • iSCSI offload
    • Host-based iSCSI
    • TCP offload

    Like nearly everything else in storage, there is more here than just a protocol. I hope you’ll register today to join us on March 2nd and learn how to make the most of your iSCSI solution. And while we won’t be able to provide the rosé wine, our panel of experts will be on-hand to answer your questions.

    Q&A on Exactly How iSCSI has Evolved

    June 3rd, 2016

    Our recent SNIA ESF Webcast, “The Evolution of iSCSI” drew a big and diverse group of attendees. From beginners looking for iSCSI basics, to experts with a lot of iSCSI deployment experience, there were plenty of good questions. Our presenters, Andy Banta and Fred Knight, did a great job answering as many as they could during the live event, but we didn’t have time to get to them all. So here are answers to them all. And by the way, if you missed the Webcast, it’s now available on-demand.

    Q. What are the top 3 reasons to choose iSCSI over FC SAN?

    A. 1. Use of commodity equipment and protocols. It means that you don’t have to set up a completely separate network. It means you don’t have to buy separate HBAs. 2. Inherent networking capability. Built on top of TCP/IP, it benefits from any networking technology to come along. These include routing, tunneling, authentication, encryption, etc. 3. Ease of automation and configuration. In it’s simplest form, an iSCSI host only needs to know the IP address of the target system. In more complex systems, hosts and storage provide APIs to allow automation through scripting or management tools.

    Q. Please comment on why SCSI went from being a widely used protocol for all sorts of devices to being focused as only essentially a storage protocol?

    A. SCSI was originally designed as both a protocol and a bus (original Parallel SCSI). Because there were no other busses, the SCSI bus did it all; disks, tapes, scanners, printers, Optical (CDs), media changers, etc. As other busses came onto the market (think USB), many of those devices moved to the new bus (CDs, printers, scanners, etc.) Commodity devices used commodity busses (IDE, SATA, USB), and enterprise devices used enterprise busses (FC, SAS); and so, disks, tapes, and media changers mostly stayed on SCSI.

    The name SCSI can be confusing for some, as the term originally was used for both the SCSI protocol and the SCSI bus. The term for the SCSI protocol is all that remains today; the SCSI bus (the old SCSI parallel bus) is no longer in wide use. Today, the FC bus, or the SAS bus, or the SoP bus, or the SRP bus are used to carry the SCSI protocol. The SCSI Architecture Model (SAM) describes a very distinct separation between the device layer (the SCSI protocol) and the transport layer (the bus).

    And, the SCSI command set has become the basis for many subsequent command sets. The JEDEC group used the SCSI command set as a model (JEDEC devices are in your cell phone), the ATAPI devices used SCSI commands, and many SCSI commands and SATA commands have a common heritage. The Mt. Fuji group (a standards group in Japan) also uses SCSI as the basis for new DVD and BlueRay devices. So, while not widely known, the SCSI command family has grown well beyond what is managed by the ANSI/INCITS T10 committee that originally defined SCSI in to a broad set of capabilities that are used across the industry, by a broad group of organizations. But, that all said, scanners and printers are still on USB, and SCSI is almost all about storage in one form or another.

    Q. How does iSCSI support software-defined storage?

    A. Answered during the talk. SDS provides more automation and knobs on the storage capabilities. But SDS still needs a way to transport the storage and iSCSI works perfectly fine for that. They are complementary technologies, not competing.

    Q. With 40Gb and faster coming soon to a server near you, what kind of impact will that have on CPU utilization? Will smaller servers be able to push that much traffic?

    A. More throughput simply requires more CPU. With good multithreaded drivers available, this can mean simply adding cores to keep the pipe as full as possible. As we mentioned near the end, using iSCSI with RDMA lightens the load on the CPU even more, so you’ll probably be seeing more of that.

    Q. Is IPSec commonly supported on iSCSI targets?  

    A. Yes, IPsec is required to be implemented on an iSCSI target to be a compliant device.  However, it is not commonly enabled by customers. If they MUST provide IPsec there are a lot of non-compliant initiators and targets on the market.

    Q. I’m told direct connect with iSCSI is discouraged, that there should be a switch in place to handle the buffering, latency, acknowledgement etc….. Is this true or a best practice to make sure switches are part of the design?

    A. If you have no need to connect to multiple targets or multiple initiators, there’s no harm in direct connections.

    Q. Ethernet was not designed to support storage traffic. The TCP/IP protocol suite was not designed to support storage traffic. SCSI was not designed to be encapsulated. So TCP/IP FTW? I think not. The reason iSCSI is exists is [perceived] cost savings. I get fed up with people constantly looking for ways to squeeze another penny out of something. To me it illustrates that they’re not very creative. Fibre Channel is a stupid name, but it is a purpose built protocol that works as designed to.

    A. Ethernet is a general purpose network. It is capable of handling lots of different traffic (including storage). By putting iSCSI onto an existing Ethernet infrastructure, it can (as you point out) create a substantial cost savings over installing a FC network (although that infrastructure savings comes with other costs – such as the impact of a shared wire). However, installing a dedicated Ethernet network provides many of the advantages of a dedicated FC network, but at an added cost over that of a shared Ethernet infrastructure. While most consider FC a purpose-built storage network, it is worth pointing out that some also consider it a general purpose network (for example FC-Avionics is built into Fighter Jets, and it’s not for storage). And while not designed to be encapsulated, (it was designed for a parallel bus), SCSI today is encapsulated on every transport that carries it (yes, that includes FCP and SAS).
    There are many kinds of storage at different price points, USB storage, SATA devices, rotating media (at different RPMs), SSD devices, SAS devices, FC devices, single spindles, arrays, cloud, drop boxes, etc., all with the corresponding transport wires. iSCSI is one of those wires. Each protocol and wire offer specific advantages and disadvantages.   There can be a lot of confusion about which to use, but just as everyone does not drive the same type car (a FORD FUSION for example), everyone does not need the same type of storage (FC devices/arrays). Yes, I drive a FORD FUSION, and I like FC storage, but I use a USB stick on my laptop, and I pray my bank never puts my financial records out in the cloud. Selecting the right storage (and wire) for the job at hand can be one of a system administrators most interesting problems to solve.As for the name – that is often what happens in committees…

    Q. As a best practice for Windows servers, disable hardware acceleration features in NICs (TOE etc.)? Are any NIC features valuable given modern multicore CPUs?

    A. Yes. Typically the only reason to disable TOE is that multiple or virtual TCP/IP stacks are going to be using the same NIC. TSO, LRO and jumbo frames will benefit any OS that can take advantage of them.

    Q. What is the advantage of iSCSI when compared with NVMe?

    A. NVMe and iSCSI are very different protocols. NVMe started life as a direct attach protocol to communicate to native PCIe devices (not even outside the box). iSCSI was a network protocol from day one. iSCSI has to deal with the potential for long network induced delays, and complex out of order error recovery issues. NVMe operates over an interlocked bus, and as such, does not have those issues.

    But, NVMe is now being extended over fabrics. NVMe over a RoCE V1 transport will be a data center network (since there is no IP routing). NVMe over a RoCE V2 transport or an iWARP transport will have the same routing capabilities that iSCSI has.   When it comes to the raw command set, they are very similar (but there are some differences). SCSI is a more full featured command set than NVMe – it has been developed over a span of over 25 years, and has developed solutions for all the problems that have been discovered during that time span. NVMe has a more limited (or more focused) command set (for example, there are no tape commands in the NVMe command set). iSCSI is available today, as is direct attach NVMe, but NVMe over Fabrics is still in the development phases (the specification is expected to be available the first week of June, 2016). NVMe products will take some time to mature and to develop solutions for the problems they have not discovered yet. Another example of this is the ability to support shared storage – it existed on day one in iSCSI, but did not exist in the first NVMe specification. To support shared storage in NVMe over Fabrics, that capability has since been added, and it was done using a SCSI compatible method (to make it easier for host S/W that already performs this function).

    There is a large community working to develop NVMe over Fabrics. As memory based storage device get cheaper, and the solution space matures, NVMe will become more attractive.

    Q. How often do iSCSI installations provide encryption of data in flight? How: IPsec, IKEv2-SCSI + ESP-SCSI, etc.?

    A. Rarely. More often than not, if in-flight data security is needed, it will be run on an isolated network. Well under 100% of installations are 100% compliant.  VMware never qualified IPsec with iSCSI and didn’t have any obvious switch to turn it on. Side note: We standards guys can be overly picky about words.  Since the question is “provide” the answer is – 100% of compliant installations PROVIDE encryption (IPsec V2 – see above), however, in practice, installations that require that type of security typically run on isolated networks, rather than turn on encryption.

    Q. How do multiple independent applications inside the same initiator map to iSCSI sessions to the same target? E.g., iSCSI session one-to-one with application?

    A. There is no relationship between applications and sessions. When an iSCSI initiator discovers a target, the initiator logs in and establishes a session. If iSCSI MCS (multi connection session) is being used, multiple TCP connections may be established and used in parallel to process operations for that session.

    Applications send reads and writes to the operating system. Those IO requests make their way through the file system and caching layers into the device driver. The device driver issues the IO request to the device (over the iSCSI session) and retains information about that IO. When a completion is received from a device (the WRITE command or READ command completed), it is matched up with the request. That completion status (success or error) is passed back through the operating system (file system, etc.) to the application. So it is the responsibility of the device driver to mux/demux the requests from all the applications out over the iSCSI session and track the responses as the operations are completed.

    When an operating system is using MPIO (multi-pathing), then the device driver may create multiple sessions between the initiator and the target. This is where operating system MPIO policies such as round-robin, shortest queue, LRU, etc. come into play. In this case, the MPIO driver will send an IO operation to the device using what it considers to be the most appropriate path (based on the selected policy). But again, there is no relationship between the application and the path used for IO (any application can have it’s IO send via any path).

    Today, MPIO is used more commonly than MCS.

    Q. Will Microsoft iSCSI implement iSER?

    A. This is a question for Microsoft or iSER-capable NIC vendor that provides Microsoft drivers.

    Q.Zadara has some iSER deployments using Linux and VMware clients going to the Zadara cloud storage.

    A. There’s an answer, all by itself.

    Q. In the case of iWARP, the TCP layer takes care of out-of-order IP packet receptions. What layer does the out-of-order management of packets in ROCE ?

    A. RoCE headers contain a 24 bit “Packet Sequence Number” that is used to validate the required ordering and detect lost packets. As such, ordering still occurs, just in a different way.

    Q. Correction: RoCE is over Ethernet packets and is not routable. RoCEv2 is the one over UDP/IP and *is* routable.

    A. You are correct. RoCE is not routable by IP. RoCE transmits raw Ethernet frames with just Ethernet MAC headers and no IP headers, and as such, it is not routable by IP. RoCE V2 puts the information into UDP packets (with appropriate IP headers), and therefore it is routable by IP.

    Q. How prevalent is iSER today in deployment? And what are some of the typical applications that leverage iSER?

    A. Not terribly prevalent today, but higher speed Ethernet might drive more adoption, due to the CPU savings demonstrated.




    Find out How iSCSI is Evolving

    May 4th, 2016

    The next Ethernet Storage Forum Webcast. “Evolution of iSCSI including iSER, iSCSI over RDMA Ethernet,” will focus on developments with iSCSI – the Internet Protocol standard for transferring SCSI commands across an Ethernet network, enabling hosts to link to storage devices wherever they may be.  At this Webcast on May 24th, I will be joined by Fred Knight, Standards Technologist at NetApp, and Andy Banta, Storage Janitor at SolidFire/NetApp, who will discuss the evolution of iSCSI up to iSER, which takes advantage of Ethernet RDMA fabric technologies to enhance performance. Register now to hear:

    • A brief history of iSCSI
    • How iSCSI works
    • IETF refinements to the specification
    • Enhancing iSCSI performance with iSER

    The Webcast will be live, so please bring your questions for Andy and Fred. We hope to see you there!

    Ethernet RDMA Protocols Support for NVMe over Fabrics – Your Questions Answered

    March 21st, 2016

    Our recent SNIA Ethernet Storage Forum Webcast on How Ethernet RDMA Protocols iWARP and RocE Support NVMe over Fabrics generated a lot of great questions. We didn’t have time to get to all of them during the live event, so as promised here are the answers. If you have additional questions, please comment on this blog and we’ll get back to you as soon as we can.

    Q. Are there still actual (memory based) submission and completion queues, or are they just facades in front of the capsule transport?

    A. On the host side, they’re “facades” as you call them. When running NVMe/F, host reads and writes do not actually use NVMe submission and completion queues. That data just comes from and to RNIC RDMA queues. On the target side, there could be real NVMe submissions and completion queues in play. But the more accurate answer is that it is “implementation dependent.”

    Q. Who places the command from NVMe queue to host RDMA queue from software standpoint?

    A. This is managed by the kernel host software in code written to the NVMe/F specification. The idea is that any existing application that thinks it is writing to the existing NVMe host software will in fact cause the SQE entry to be encapsulated and placed in an RDMA send queue.

    Q. You say “most enterprise switches” support NVMe/F over RDMA, I guess those are ‘new’ ones, so what is the exact question to ask a vendor about support in an older switch?

    A. For iWARP, any switch that can handle Internet traffic will do. Mellanox and Intel have different answers for RoCE / RoCEv2. Mellanox says that for RoCE, it is recommended, but not required, that the switch support Priority Flow Control (PFC). Most new enterprise switches support PFC, but you should check with your switch vendor to be sure. Intel believes RoCE was architected around DCB. The name itself, RoCE, stands for “RDMA over Converged Ethernet,” i.e., Ethernet with DCB. Intel believes RoCE in general will require PFC (or some future standard that delivers equivalent capabilities) for efficient RDMA over Ethernet.

    Q. Can you comment on when one should use RoCEv2 vs. iWARP?

    A. We gave a high-level overview of some of the deployment considerations on slide 30. We refer you to some of the vendor links on slide 32 for “non-vendor neutral” perspectives.

    Q. If you take RDMA out of equation, what is the key advantage of NVMe/F over other protocols? Is it that they are transparent to any application?

    A. NVMe/F allows the application to bypass the SCSI stack and uses native NVMe commands across a network. Most other block storage protocols require using the SCSI protocol layer, translating the NVMe commands into SCSI commands. With NVMe/F you also gain parallelism, simplicity of the command set, a separation between administrative sessions and data sessions, and a reduction of latency and processing required for NVMe I/O operations.

    Q. Is ROCE v1 compatible with ROCE v2?

    A. Yes. Adapters speaking RoCEv2 can also maintain RDMA connections with adapters speaking RoCEv1 because RoCEv2 ports are backwards interoperable with RoCEv1. Most of the currently shipping NICs supporting RoCE support both RoCEv1 and RoCEv2.

    Q. Are RoCE and iWARP the only way to use Ethernet as a fabric for NMVe/F?

    A. Initially yes; only iWARP and RoCE are supported for NVMe over Ethernet. But the NVM Express Working Group is also targeting FCoE. We should have probably been clearer about that, though it is noted on slide 11.

    Q. What about doing NVMe over Fibre Channel? Is anyone looking at, or doing this?

    A. Yes. This is not in scope for the first spec release, but the NVMe WG is collaborating with the FCIA on this. So NVMe over Fibre Channel is expected as another standard in the near future, to be promoted by T11.

    Q. Do RoCE and iWARP both use just IP addresses for management or is there a higher level addressing mechanism, and management?

    A. RoCEv2 uses the RoCE Connection Manager, and iWARP uses TCP connection management. They both use IP for addressing.

    Q. Are there other fabrics to run NVMe over fabrics? Can you do this over OmniPath or Infiniband?

    A. InfiniBand is in scope for the first spec release. Also, there is a related effort by the FCIA to support NVMe over Fibre Channel in a standard that will be promoted by T11.

    Q. You indicated NVMe stack is in kernel while RDMA is a user level verb. How are NVMe SQ/ CQ entries transferred from NVMe to RDMA and vice versa? Also, could smaller transfers in NVMe (e.g. SGL of 512B) combined to larger sizes before being sent to RDMA entries and vice versa?

    A. NVMe/F supports multiple scatter gather entries to combine multiple incontinuous transfers, nevertheless, the protocol doesn’t support chaining multiple NVMe commands on the same command capsule. A command capsule contains only a single NVMe command. Please also refer to slide 18 from the presentation.

    Q. 1) How do implementers and adopters today test NVMe deployments? 2) Besides latency, what other key performance indicators do implements and adopters look for to determine whether the NVMe deployment is performing well or not?

    A. 1) Like any other datacenter specification, testing is done by debugging, interop testing and plugfests. Local NVMe is well supported and can be tested by anyone. NVMe/F can be tested using pre-standard drivers or solutions from various vendors. UNH-IOH is an organization with an excellent reputation for helping here. 2) Latency, yes. But also sustained bandwidth, IOPS, and CPU utilization, i.e., the “usual suspects.”

    Q. If RoCE CM supports ECN, why can’t it be used to implement a full solution without requiring PFC?

    A. Explicit Congestion Notification (ECN) is an extension to TCP/IP defined by the IETF. First point is that it is a standard for congestion notification, not congestion management. Second point is that it operates at L3/L4. It does nothing to help make the L2 subnet “lossless.” Intel and Mellanox agree that generally speaking, all RDMA protocols perform better in a “lossless,” engineered fabric utilizing PFC (or some future standard that delivers equivalent capabilities). Mellanox believes PFC is recommended but not strictly required for RoCE, so RoCE can be deployed with PFC, ECN, or both. In contrast, Intel believes that for RoCE / RoCEv2 to deliver the “lossless” performance users expect from an RDMA fabric, PFC is in general required.

    Q. How involved are Ethernet RDMA efforts with the SDN/OCP community? Is there a coming example of RoCE or iWarp on an SDN switch?

    A. Good question, but neither RoCEv2 nor iWARP look any different to switch hardware than any other Ethernet packets. So they’d both work with any SDN switch. On the other hand, it should be possible to use SDN to provide special treatment with respect to say congestion management for RDMA packets. Regarding the Open Compute Project (OCP), there are various Ethernet NICs and switches available in OCP form factors.

    Q. Is there a RoCE v3?

    A. No. There is no RoCEv3.

    Q. iWARP and RoCE both fall back to TCP/IP in the lowest communication sense? So they are somewhat compatible?

    A. They can speak sockets to each other. In that sense they are compatible. However, for the usage model we’re considering here, NVMe/F, RDMA is required. Because of L3/L4 differences, RoCE and iWARP RNICs cannot speak RDMA to each other.

    Q. So in case of RDMA (ROCE or iWARP), the NVMe controller’s fabric port is Ethernet?

    A. Correct. But it must be RDMA-enabled Ethernet.

    Q. What if I am using soft RoCE, do I still need an RNIC?

    A. Functionally, soft RoCE or soft iWARP should work on a regular NIC. Whether the performance is sufficient to keep up with NVMe SSDs without the hardware offloads is a different matter.

    Q. How would the NVMe controller know that a command is placed in the submission queue by the Fabric host driver? Is the fabric host driver responsible for notifying the NVMe controller through remote doorbell trigger or the Fabric target driver should trigger the doorbell?

    A. No separate notification by the host required. The fabric’s host driver simply sends a command capsule to notify its companion subsystem driver that there is a new command to be processed. The way that the subsystem side notifies the backend NVMe drive is out of the scope of the protocol.

    Q. I am chair of ETSI NFV working group on NFV acceleration. We are working on virtual RDMA and how VM can benefit from hardware independent RDMA. One corner stone of this is virtual-RDMA pseudo device. But there is not yet consensus on minimal set of verbs to be supported: Do you think this minimal verb set can be identified? Last, the transport address space is not consistent between IB, Ethernet. How supporting transport independent RDMA?

    A. You know, the NVM Express Working Group is working on exactly these questions. They have to define a “minimal verb set” since NVMe/F generates the verbs. Similarly, I’d suggest looking to the spec to see how they resolve the transport address space differences.

    Q. What’s the plan for Linux submission of NVMe over Fabric changes? What releases are being targeted?

    A. The Linux Driver WG in the NVMe WG expects to submit code upstream within a quarter of the spec being finalized. At this time it looks like the most likely Linux target will be kernel 4.6, but it could end up being kernel 4.7.

    Q. Are NVMe SQ/CQ transferred transparently to RDMA Queues or can they be modified?

    A. The method defined in the NVMe/F specification entails a transparent transfer. If you wanted to modify an SQE or CQE, do so before initiating an NVMe/F operation.

    Q. How common are rNICs for recent servers? i.e. What’s a quick check I can perform to find out if my NIC is an rNIC?

    A. rNICs are offered by nearly all major server vendors. The best way to check is to ask your server or NIC vendor if your NIC supports iWARP or RoCE.

    Q. This is most likely out of the scope of this talk but could you perhaps share about 30K level on the differences between “NVMe controller” hardware versus “NVMeF” hardware. It’s most likely a combination of R-NIC+NVMe controller, but would be great to get your take on this.

    A goal of the NVMe/F spec is that it work with all existing NVMe controllers and all existing RoCE and iWARP RNICs.  So on even a very low level, we can say “no difference.”  That said, of course, nothing stops someone from combining NVMe controller and rNIC hardware into one solution.

    Q. Are there any example Linux targets in the distros that exercise RDMA verbs? An iWARP or iSER target in a distro?

    A. iSER allows this using a LIO or TGT SCSI target.

    Q. Is there a standard or IP for RDMA NIC?

    A. The various RNICs are based on IBTA, IETF, and IEEE standards are shown on slide 26.

    Q. What is the typical additional latency introduced comparing NVMe over Fabric vs. local NVMe?

    A. In the 2014 IDF demo, the prototype NVMe/F stack matched the bandwidth of local NVMe with a latency penalty of only 8µs over a local iWARP connection. Other demonstrations have shown an added fabric latency of 3µs to 15µs. The goal for the final spec is under 10µs.

    Q. How well is NVME over RDMA supported for Windows ?

    A. It is not currently supported, but then the spec isn’t even finished. Contract Microsoft if you are interested in their plans.

    Q. RDMA over Ethernet would not support Layer 2 switching? How do you deal with TCP over head?

    A. L2 switching is supported by both iWARP and RoCE. Both flavors of RNICs have MAC addresses, etc. iWARP had to deal with TCP/IP in hardware, a TCP/IP Offload Engine or TOE. The TOE used in an iWARP RNIC is significantly constrained compared to a general purpose TOE and therefore can operate with very high performance. See the Chelsio website for proof points. RoCE does not use TCP so does not need to deal with TCP overhead.

    Q. Does RDMA not work with fibre channel?

    A. They are totally different Transports (L4) and Networks (L3). That said, the FCIA is working with NVMe, Inc. on supporting NVMe over Fibre Channel in a standard to be promoted by T11.

    Life of a Storage Packet (Walk): Q&A

    November 27th, 2015

    We got a lot of great questions at our recent Ethernet Storage Forum webcast “The Life of a Storage Packet (Walk).” As promised, we’ve compiled all the questions with fairly detailed answers. We hope this blog helps to clear up any confusion or uncertainties. If you think of additional questions, please comment below and we’ll get back to you as soon as we can. Thanks to everyone who watched the live webcast. If you missed it, it’s now available on-demand.

    Q. Does size of a block depend on OS?

    A. Yes, some OS’ only support 512 byte blocks. Some OS’ provide a method to support both 512 byte blocks and 4096 (aka 4K) byte blocks; for example, via a qualifier in their format command. Some devices are built using 4K block sizes, but then emulate 512 byte blocks to the host (aka 512E devices). Many modern versions of OS’ automatically detect the block size of each device they discover and do the right thing based on what they discover. You have to check the documentation for your OS to know its capabilities.

    Q. CIFS is not a file system.

    A. Not in the same way that ext, ntfs, FAT, HFS, UFS, etc. are, no. However, in terms of certain functionalities that we discussed on the presentation – that is, the ability to manipulate files with operations such as read, write, create, delete, and rename – are all file system functionalities. The difference being, of course, that the files are not on the local computer and are actually on a remote computer.

    For what it’s worth, the term ‘CIFS’ has been deprecated from usage, and SMB is the preferred term for precisely the reasons that it should not be confused with local OS systems like the ones mentioned above.

    Q. Is it safe to say that a “block” at the file system level is equal to an IO?

    A. Not specifically. The difference in the use of those terms, is that a “block” is a place that data is stored – it has an address and it contains data (512 byte of data or 4K bytes of data). An IO is an operation that requests access to a block. An IO may perform a read operation on a block, or it may perform a write operation on a block.

    Remember, the term IOPS is I/O operations per second – so it is not about blocks or bytes or bits – it is operations. If the operations are performed on a 512 byte block, they produce a different number of MB/sec than if they operate on a 4K byte block.

    So, let’s take an example. A 1MB/second bandwidth on a 4K block size device is the same speed as 1MB/second bandwidth on a 512 byte block size device (observe, however, that the 4K block size device will have only 1/8 the IOPS that the 512 byte device has – because it will take only 1/8 the operations to transfer the same number of Mega-bytes). However, 1M IOPS on a 4K block size device is much better than 1M IOPS on a 512 byte block size device (because the 4K block size device is moving 8X the amount of data than the 512 byte block size device in each operation.

    There is an excellent explanation and walk-through of IOPS in our Storage Performance Benchmarking webinar.

    Q. Don’t all those Inodes also live on the disk and so don’t the IOs to read those blocks also have to go to the SCSI controller?

    A. Well-spotted!

    The communication back and forth between the file system and the Inodes traverses the controller for all access to blocks on disk. This is one of the reasons why needing to access disk is considered “expensive.” When you add in a network to the mix, these kinds of situations need particular careful consideration.

    Q. Shall we allocate blocks and inodes or is it an automatic process?

    A. It’s an automated process, and there is no user intervention at all.

    Q. Are Inodes created during OS installation?

    A. Inodes are a particular block type, among some other block types (e.g., data blocks, boot blocks, superblocks, group descriptor blocks). These block types are combined into functional groups. These groups are OS-dependent. The block layout therefore, including Inodes, is created during OS installation. If the filesystem needs more Inodes after the OS installation, the OS dynamically adds them to the Inode pool.

    Q. Which physical hardware does volume manager reside?

    A. Volume managers are not hardware. They are software layers that create pseudo devices that are presented to layers of the OS above them (typically to the file system layer). The volume manager software fits into the OS to accept requests from the file system and pass those requests down to the device driver. In some systems they might be called “filter drivers”.

    Q. In flash media, is there also an iSCSI controller that converts PCIe into iSCSI to interact with the flash?

    A. I want to make sure that the answer to the question is clear.

    On the host, we need to convert PCIe commands to SCSI, so we send them to an iSCSI controller/adapter to be sent across the Ethernet wire. . That adapter can be either software or hardware.

    Flash drives are basic media, just like spinning disk drives. Flash will have its own controller, which can be SCSI. If you wish to access the drive over Ethernet using the iSCSI protocol, you will have an adapter on the flash drive (which can be either software or hardware) that will do the SCSI translation between the Flash media and the SCSI commands. This is often called the FTL – or the flash translation layer. Again, the SCSI commands are translated into an Ethernet-friendly packet to be sent along the wire.

    There are other types of communication forms for working with Flash, too. The most recent is NVMe. You can see the SNIA webinar on The Performance Impact of NVMe and NVMe over Fabrics for more information.

    Q. Why is there a SCSI language in between Storage and the hosts?

    A. Before storage standards (like SCSI and ATA), you would purchase storage from Vendor X, and you would also buy a storage controller for that vendor X storage.  If you wanted Vendor Y storage, you could not use the vendor X controller, you had to purchase a new controller from vendor Y. Every vendor had their own language, and you had to purchase matching components.  Once you got locked into one vendor, you were stuck – at the hardware level.

    Today with standards (such as SCSI), you can buy a SCSI device from any vendor and connect it to any storage controller that you buy from any vendor – and it just works.  That is the point of the SCSI standard. When the storage standardization efforts began, there were many competing ideas. It just so happened that SCSI won out, and so now it’s everywhere.

    Q. If the storage side is flash, do we still need a SCSI controller between host and flash storage or is the controller different for a flash storage?

    Yes… and no.

    Yes, to use the SCSI part of the OS, the flash device must continue to speak the SCSI protocol and so a SCSI controller is needed. This method of communicating with flash devices enables all the existing S/W on the host OS to just continue working without even knowing it is flash.

    No, flash memory chips can be connected to the system using non-SCSI methods. Most of those methods are generally special purpose applications and so the existing OS S/W simply cannot use that device (only the special purpose S/W designed for that device can use it). However, NVMe is a new protocol that is enabling more general use of the flash memory technology with the hopes to provide new capabilities that are beyond what SCSI provides. This is also discussed in our webcast on The Performance Impact of NVMe and NVMe over Fabrics.

    Q. What’s the difference between partition, logical disk, volume, LUN, etc?

    Excellent question!

    Let’s work this one backwards – LUN – that is a SCSI term (actually an acronym) that refers to the Logical Unit Number – it is part of the address used to access a logical unit. Small SCSI devices (such as a single spindle disk drive) have only a single logical unit. Large storage arrays may contain 100s of logical units. Each of those logical units appears to the host OS as if it were a single spindle disk. So, the logical unit is the SCSI object that contains the blocks where the data is stored (where that object may be an individual piece of hardware, or a logical entity within a larger SCSI device). To access a SCSI logical unit, the OS must specify the address of the SCSI device, and then the LUN (logical unit number) for the logical unit within that SCSI device.

    The other terms (partition, logical disk, and volume) are OS terms that have to do with virtualization and how the storage blocks are managed. When a SCSI (or other storage device) is formatted by the OS, it may be broken into multiple partitions. Each partition is then treated by the OS as if it were a unique device (a virtual device, or a logical disk). Each of those partitions may then be used independently.

    For example, on a Unix system, the “a” partition may contain a file system that has all the files that are necessary to boot. The “b” partition may be setup without a file system and used as the swap or paging storage for use by the virtual memory subsystem. The “d” partition may then contain a file system that contains all the user’s data files. Each partition is unique storage space and may even use a different file system to organize the data located there.

    Notice, that I skipped the “c” partition. That partition is often setup to access all the blocks of the physical device. So, on a 500GB disk, maybe “a” contains 10GB, “b” contains 90GB, and “d” contains 400GB; while “c” contains all 500GB. Partitions “a” and “d” may be backed up or restored independently, and partition “c” may be used to perform an image copy of the entire device.

    Now, to the other terms:

    Logical disk and volume are terms often related to volume managers.

    • Logical disk may be a term used to refer to a partition, but usually that is not the case. Logical disks are typically the devices created by the volume management layer when they combine individual devices into a single larger device (a.k.a. a logical disk).
    • Volume managers also may be used to divide up a large device into small chunks (just like partitions), and those smaller chunks are referred to as logical disks.
    • Volume is a more vague term that typically is used as another term for a logical disk. In some circles, the term Volume is used to refer to a RAID set in a SCSI storage controller (but this is a much less often used definition for Volume).

    Q. Will the “complete” presentation somewhere we can go review?

    Yes. Click here to access the on-demand webcast as well as a PDF of the webcast slides

    Upcoming Plugfests at SDC

    September 4th, 2014

    This year’s SNIA Storage Developer Conference (SDC) will take place in Santa Clara, CA Sept. 15-18.  In addition to an exciting agenda with great speakers, there is an opportunity for vendors to participate in SNIA Plugfests. Two Plugfests that I think are worth noting are: SMB2/SMB3 and iSCSI.

    These Plugfests enable a vendor to bring their implementations of SMB2/SMB3 and/or iSCSI to test, identify, and fix bugs in a collaborative setting with the goal of providing a forum in which companies can develop interoperable products. SNIA provides and supports networks and infrastructure for the Plugfest, creating a collaborative framework for testing. Plugfest participants work together to define the testing process, assuring that objectives are accomplished.

    Still Time to Register

    Great news! There is still time to register. Setup for the Plugfest begins on September 13, 2014 and testing begins on the September 14th.

    Register here for the SMB2/SMB3 Plugfest

    Register here for the iSCSI Plugfest

    What to Expect at a Plugfest

    Learn more about what takes place at the Plugfests by watching the video interview of Jeremy Allison, Co-Creator of Samba, as he candidly talks about what to expect at an SDC Plugfest.

    Learn more about the Plugfest registration process. If you have additional questions, please contact Arnold Jones (arnold@snia.org).