The IETF, Consensus and NFSv4

The Internet Engineering Task Force is one of the older – and more unusual – internet organizations. It first met in 1986, and has regularly met since then several times a year. The last meeting was the March 2-7, 2014 IETF89 in London,  and I was fortunate to be in attendance.

What Makes the IETF Unique

What’s unusual about the IETF? From my perspective as someone who spends most of his working day dealing with more traditional standards bodies, two things stand out.

One, (in its own words) “it exists as a collection of happenings, but is not a corporation and has no board of directors, no members, and no dues.” The non-members divide themselves into loosely organized groups that agree on an agenda, discuss the stuff of the internet on mailing lists, generate documents that reflect consensus, and then agree to them as standards.

Two, the London IETF89 meeting was not a conference. The IETF doesn’t do conferences; there are no formal papers given by luminaries or industry experts. There is an agenda, agreed beforehand by consensus (there’s that word again) and then a few short and brief presentations on topics of interest. There are questions from the floor, discussions, and agreement of one form or another. I didn’t see a single formal vote; just that ill-defined and unquantifiable consensus where the outcome is just, well, agreed on.

Why the IETF Works

Revolution! Anarchy! This is unusual for a standards body, and it sounds like a recipe for disaster. But strangely, it isn’t, and from what I saw of the process, I think I see why.

It’s because it’s attended by software and network engineers who see code as the concrete representation of a good idea. They value running code, or stuff that works. That’s a powerful advantage over academic discussions, or codifying and formalizing a good (sometimes not-so-good) idea that no-one has yet implemented or is ever likely to.

Why face to face though? I reckon that even revolutionaries and anarchists need validation and a sense of community, and there was much of that in evidence in the corridors and public spaces outside of the formal meeting. Everyone talks like there’s no tomorrow. Ideas everywhere, grounded in what can be shown to actually work.

I attended, amongst others, the NFSv4 workgroup meetings. The agenda and notes from the meeting give some flavor of this consensus, and I am truly impressed by the process. I’m also thankful that there is some organization; Sorin Faibish (EMC) took notes, Tome Haynes (NetApp) chaired the meeting and kept it moving along, and all in all it was a great illustration of the best the industry can do.

As to the technical content… well, you can read the minutes. There are notes on security discussions led by Andy Adamson, on features proposed for NFSv4.2, and getting an RFC in place that accurately reflects implementations of earlier versions of NFSv4 and more. I’ll be blogging about this and more over the next few months. In the meanwhile, in the spirit of the IETF that favors working code over ideas and the concrete over the abstract, I’ll be presenting “Practical Steps to Implementing pNFS and NFSv4.1” at DSIcon on April 22-24 in Santa Clara, CA. OK, this one’s a conference, and anarchy will be in short supply, but we can still have great discussions and arguments in the corridors and public spaces outside of the formal meetings. I look forward to seeing you there!

SMB 3.0 – Your Questions Asked and Answered

Last week we had a large and highly-engaged audience at our live Webcast: “SMB 3.0 – New Opportunities for Windows Environments.” We ran out of time answering all the questions during our event so, as promised, here is a recap of all the questions and answers to attendees’ questions. The Webcast is now available on demand at http://snia.org/forums/esf/knowledge/webcasts. You can also download a copy of the presentation slides there.

Q. Have you tested SMB Direct over 40Gb Ethernet or using RDMA?

A. SMB Direct has been demonstrated using 40Gb Ethernet using TCP or RDMA and Infiniband using RDMA.

Q. 100 iops, really?

A. If you look at the bottom right of slide 27 (Performance Test Results) you will see that the vertical axis is IOPs/sec (Normalized). This is a common method for comparing alternative storage access methods on the same storage server. I think we could have done a better job in making this clear by labeling the vertical axis as “IOPs (Normalized).”

Q. How does SMB 3.0 weigh against NFS-4.1 (with pNFS)?

A. That’s a deep question that probably deserves a webcast of its own. SMB 3 doesn’t have anything like pNFS. However many Windows workloads don’t need the sophisticated distributed namespace that pNFS provides. If they do, the namespace is stitched together on the client using mounts and DFS-N.

Q. In the iSCSI ODX case, how does server1 (source) know about the filesystem structure being stored on the LUN (server2) i.e. how does it know how to send the writes over to the LUN?

A. The SMB server (source) does not care about the filesystem structure on the LUN (destination). The token mechanism only loosely couples the two systems. They must agree that the client has permission to do the copy and then they perform the actual copy of a set of blocks. Metadata for the client’s file system representing the copied file on the LUN is part of the client workflow. Client drag/drops file from share to mounted LUN. Client subsystem determines that ODX is available. Client modifies file system metadata on the LUN as part of the copy operation including block maps. ODX is invoked and the servers are just moving blocks.

Q. Can ODX copies be within the same share or only between?

A. There is no restriction to ODX in this respect. The resource and destination of the copy can be on same shares, different shares, or even completely different protocols as illustrated in the presentation.

Q. Does SMB 3 provide API for integration with storage vendor snapshot other MS VSS?

A. Each storage vendor has to support Microsoft Remote VSS protocol, which is part of SMB 3.0 protocol specification. In Windows 2012 or Windows 8 the VSS APIs were extended to support UNC share path.

Q. How does SMB 3 compare to iSCSI rather than FC?

A. Please examine slide 27, which compares SMB 3, FC and iSCSI on the same storage server configuration.

Q. I have a question between SMB and CIFS. I know both are the protocols used for sharing. But why is CIFS adopted by most of the storage vendors? We are using CIFS shares on our NetApps, and I have seen that most of the other storage vendors are also using CIFS on their NAS devices.

A. There has been confusion between the terms “SMB” and “CIFS” ever since CIFS was introduced in the 90s. Fundamentally, the protocol that manages the data transfer between and client and server is SMB. Always has been. IMO CIFS was a marketing term created in response to Sun’s WebNFS. CIFS became popularized with most SMB server vendors calling their product a CIFS server. Usage is slowly changing but if you have a CIFS server it talks SMB.

Q. What is required on the client? Is this a driver with multi-path capability? Is this at the file system level in the client? What is needed in transport layer for the failover?

A. No special software or driver is required on the client side as long as it is running Windows 8 and later operating environment.

Q. Are all these new features cross-platform or is it something only supported by Windows?

A. SMB 3 implementations by different storage vendors will have some set of these features.

Q. Are virtual servers (cloud based) vs. non-virtual transition speeds greatly different?

A. The speed of a transition, i.e. failover is dependent on two steps. The first is the time needed to detect the failure and the second is the time needed to recover from that failure. While both a virtual and a physical server support transition the speed can significantly vary due to different network configurations. See more with next question.

Q. Is there latency as it fails over?

A. Traditionally SMB timeouts were associated with lower level, i.e. TCP timeouts. Client behavior has varied over the years but a rule-of-thumb was detection of a failure in 45 sec. This error would be passed up the stack to the user/application. With SMB 3 there is a new protocol called SMB Witness. A scale-out SMB server will include nodes providing SMB shares as well those providing Witness service. A client connects to SMB and Witness. If the node hosting the SMB share fails, the Witness node will notify the client indicating the new location for the SMB share. This can significantly reduce the time needed for detection. The scale-out SMB server can implement a proprietary mechanism to quickly detect node failure and trigger a Witness notification.

Q. Sync or Async?

A. Whether state movement between server nodes is sync or async depends on vendor implementation. Typically all updated state needs to be committed to stable storage before returning completion to the client.

Q. How fast is this transition with passing state id’s between hosts?

A. The time taken for the transition includes the time needed to detect the failure of Client A and the time needed to re-establish things using Client B. The time taken for both is highly dependent on the nature of the clustered app as well as the supported use case.

Q. We already have FC (using VMware), why drop down to SMB?

A. If you are using VMware with FC, then moving to SMB is not an option. VMware supports the use of NFS for hypervisor storage but not SMB.

Q. What are the top applications on SMB 3.0?

A. Hyper-V, MS-SQL, IIS

Q. How prevalent is true “multiprotocol sharing” taking place with common datasets being simultaneously accessed via SMB and NFS clients?

A. True “multiprotocol sharing” i.e. simultaneous access of a file by NFS & SMB clients is extremely rare. The NFS and SMB locking models don’t lend themselves to that. Sharing of a multiprotocol directory is an important use case. Users may want access to a common area from Linux, OS X and Windows. But this is sequential access by one OS/protocol at a time not all at once.

Q. Do we know growth % split between NFS and SMB?

A. There is no explicit industry tracker for the protocol split and probably not that much point in collecting them either, as the protocols aren’t really in competition. There is affinity among applications, OSes and protocols – MS products tend to SMB (Hyper-V, SQL Server,…), and non-Microsoft to NFS (VMware, Oracle, …). Cloud products at the point of consumption are normally HTTP RESTless protocols.

SUSE Announces NFSv4.1 and pNFS Support

SUSE, founded in 1992, provides an enterprise ready Linux distribution in the form of SLES; the SUSE Linux Enterprise Server. As of late last month (October 22, 2013), SUSE announced that SLES 11 with service pack 3 now supports the Linux client for NFSv4.1 and pNFS client. This major distribution joins RedHat’s RHEL (RedHat Enterprise Linux) 6.4 in supporting enterprise quality Linux distributions with support for files based NFSv4.1 and pNFS.

For the adventurous, block and object pNFS support is available in the upstream kernel. Most regularly maintained distributions based on a Linux 3.1 or better kernel (if not all distributions now – check with the supplier of the distribution if you’re unsure) should provide the files, block and object compliant client directly in the download.

The future of pNFS looks very exciting. We now have a fully pNFS compliant Linux client, and a number of commercial files, blocks and object servers. Remember that although pNFS block and object support is available, currently these distributions support only the pNFS files layout. For those users not needing pNFS with block or objects support and requiring enterprise quality support, SUSE and RedHat are an excellent solution.

pNFS and Future NFSv4.2 Features

In this third and final blog post on NFS (see previous blog posts Why NFSv4.1 and pNFS are Better than NFSv3 Could Ever Be and The Advantages of NFSv4.1) I’ll cover pNFS (parallel NFS), an optional feature of NFSv4.1 that improves the bandwidth available for NFS protocol access, and some of the proposed features of NFSv4.2 – some of which are already implemented in commercially available servers, but will be standardized with the ratification of NFSv4.2 (for details, see the IETF NFSv4.2 draft documents).

Finally, I’ll point out where you can get NFSv4.1 clients with support for pNFS today.

Parallel NFS (pNFS) and Layouts

Parallel NFS (pNFS) represents a major step forward in the development of NFS. Ratified in January 2010 and described in RFC-5661, pNFS depends on the NFS client understanding how a clustered filesystem stripes and manages data. It’s not an attribute of the data, but an arrangement between the server and the client, so data can still be accessed via non-pNFS and other file access protocols.  pNFS benefits workloads with many small files, or very large files, especially those run on compute clusters requiring simultaneous, parallel access to data.

 NFS 3 image 1

Clients request information about data layout from a Metadata Server (MDS), and get returned layouts that describe the location of the data. (Although often shown as separate, the MDS may or may not be standalone nodes in the storage system depending on a particular storage vendor’s hardware architecture.) The data may be on many data servers, and is accessed directly by the client over multiple paths. Layouts can be recalled by the server, as in the case for delegations, if there are multiple conflicting client requests. 

By allowing the aggregation of bandwidth, pNFS relieves performance issues that are associated with point-to-point connections. With pNFS, clients access data servers directly and in parallel, ensuring that no single storage node is a bottleneck. pNFS also ensures that data can be better load balanced to meet the needs of the client.

The pNFS specification also accommodates support for multiple layouts, defining the protocol used between clients and data servers. Currently, three layouts are specified; files as supported by NFSv4, objects based on the Object-based Storage Device Commands (OSD) standard (INCITS T10) approved in 2004, and block layouts (either FC or iSCSI access). The layout choice in any given architecture is expected to make a difference in performance and functionality. For example, pNFS object based implementations may perform RAID parity calculations in software on the client, to allow RAID performance to scale with the number of clients and to ensure end-to-end data integrity across the network to the data servers.

So although pNFS is new to the NFS standard, the experience of users with proprietary precursor protocols to pNFS shows that high bandwidth access to data with pNFS will be of considerable benefit.

Potential performance of pNFS is definitely superior to that of NFSv3 for similar configurations of storage, network and server. The management is definitely easier, as NFSv3 automounter maps and hand-created load balancing schemes are eliminated; and by providing a standardized interface, pNFS ensures fewer issues in supporting multi-vendor NFS server environments.

Some Proposed NFSv4.2 features

NFSv4.2 promises many features that end-users have been requesting, and that makes NFS more relevant as not only an “every day” protocol, but one that has application beyond the data center. As the requirements document for NFSv4.2 puts it, there are requirements for: 

  • High efficiency and utilization of resources such as, capacity, network bandwidth, and processors.
  • Solid state flash storage which promises faster throughput and lower latency than magnetic disk drives and lower cost than dynamic random access memory.

Server Side Copy

Server-Side Copy (SSC) removes one leg of a copy operation. Instead of reading entire files or even directories of files from one server through the client, and then writing them out to another, SSC permits the destination server to communicate directly to the source server without client involvement, and removes the limitations on server to client bandwidth and the possible congestion it may cause.

Application Data Blocks (ADB)

ADB allows definition of the format of a file; for example, a VM image or a database. This feature will allow initialization of data stores; a single operation from the client can create a 300GB database or a VM image on the server.

Guaranteed Space Reservation & Hole Punching

As storage demands continue to increase, various efficiency techniques can be employed to give the appearance of a large virtual pool of storage on a much smaller storage system.  Thin provisioning, (where space appears available and reserved, but is not committed) is commonplace, but often problematic to manage in fast growing environments. The guaranteed space reservation feature in NFSv4.2 will ensure that, regardless of the thin provisioning policies, individual files will always have space available for their maximum extent.

 NFS 3 image 2

While such guarantees are a reassurance for the end-user, they don’t help the storage administrator in his or her desire to fully utilize all his available storage. In support of better storage efficiencies, NFSv4.2 will introduce support for sparse files. Commonly called “hole punching”, deleted and unused parts of files are returned to the storage system’s free space pool.

Obtaining Servers and Clients

With this background on the features of NFS, there is considerable interest in the end-user community for NFSv4.1 support from both servers and clients. Many Network Attached Storage (NAS) vendors now support NFSv4, and in the last 12 months, there has been a flurry of activity and many developments in server support of NFSv4.1 and pNFS.

For NFS server vendors, there are NFSv4.1 and files based, block based and object based implementations of pNFS available; refer to the vendor websites, where you will get the latest up-to-date information.

On the client side, there is RedHat Enterprise Linux 6.4 that includes full support for NFSv4.1 and pNFS (see www.redhat.com), Novell SUSE Linux Enterprise Server 11 SP2 with NFSv4.1 and pNFS based on the 3.0 Linux kernel (see www.suse.com), and Fedora available at fedoraproject.org.

Conclusion     

NFSv4.1 includes features intended to enable its use in global wide area networks (WANs).  These advantages include:

  • Firewall-friendly single port operations
  • Advanced and aggressive cache management features
  • Internationalization support
  • Replication and migration facilities
  • Optional cryptography quality security, with access control facilities that are compatible across UNIX® and Windows®
  • Support for parallelism and data striping

The goal for NFSv4.1 and beyond is to define how you get to storage, not what your storage looks like. That has meant inevitable changes. Unlike earlier versions of NFS, the NFSv4 protocol integrates file locking, strong security, operation coalescing, and delegation capabilities to enhance client performance for data sharing applications on high-bandwidth networks.

NFSv4.1 servers and clients provide even more functionality such as wide striping of data to enhance performance.  NFSv4.2 and beyond promise further enhancements to the standard that increase its applicability to today’s application requirements. It is due to be ratified in August 2012, and we can expect to see server and client implementations that provide NFSv4.2 features soon after this; in some cases, the features are already being shipped now as vendor specific enhancements. 

With careful planning, migration to NFSv4.1 (and NFSv4.2 when it becomes generally available) from prior versions can be accomplished without modification to applications or the supporting operational infrastructure, for a wide range of applications; home directories, HPC storage servers, backup jobs and a variety of other applications.

FOOTNOTE: Parts of this blog were originally published in Usenix ;login: February 2012 under the title The Background to NFSv4.1. Used with permission.

 

The Advantages of NFSv4.1

In a previous blog post Why NFSv4.1 and pNFS are Better than NFSv3 Could Ever Be, some of the issues with NFSv3 that made it difficult to implement as a WAN based or data center wide protocol were discussed. The question then becomes; why not move to NFSv4 instead of NFSv4.1? Isn’t that a bigger leap from NFSv3?

Well, practical experience and some issues with NFSv4 made NFSv4.1 a necessity; for one, it introduces the key concept of sessions, and provides a foundation for pNFS (parallel NFS) which we’ll discuss in a later blog post. And all the features of NFSv4 were carried over into NFSv4.1, since it was a minor version update; there’s little more to do to take advantage of NFSv4.1, so that’s where your focus evaluation and implementation should be.

TCP for Transport

NFSv3 supports both TCP (Transmission Control Protocol) and UDP (User Datagram Protocol), and UDP is sometimes employed (for those applications that support it) because it is perceived to be lightweight and faster in comparison to TCP.

The downside of UDP is that it’s connectionless (that is, stateless) and an unreliable protocol. There is no guarantee that the datagrams will be delivered in any given order to the destination host — or even delivered at all — so applications must be specifically designed to handle missing, duplicate or incorrectly ordered data. UDP is also not a good network citizen; there is no concept of congestion or flow control, and no ability to apply quality of service (QoS) criteria.

The NFSv4 specification requires that any transport used provides congestion control. The easiest way to do this is via TCP. By using TCP, NFSv4 clients and servers are able to adapt to known frequent spikes in unreliability on the Internet; and retransmission is managed in the transport layer instead of in the application layer, greatly simplifying applications and their management on a shared network.

NFSv4 also introduces strict rules about retries over TCP in contrast to the complete lack of rules in NFSv3 for retries over TCP. As a result, if NFSv3 clients have timeouts that are too short, NFSv3 servers may drop requests. NFSv4 uses the timers that are built into the connection-oriented transport.

Network Ports

To access an NFS server, an NFSv3 client must contact the server’s portmapper to find the port of the mountd server. It then contacts the mount server to get an initial file handle, and again contacts the portmapper to get the port of the NFS server. Finally, the client can access the NFS server.

This creates problems for using NFS through firewalls, because firewalls typically filter traffic based on well-known port numbers. If the client is inside a firewalled network, and the server is outside the network, the firewall needs to know what ports the portmapper, mountd and nfsd servers are listening on. The mount server can listen on any port, so telling the firewall what port to permit is not practical. While the NFS server usually listens on port 2049, sometimes it does not. While the portmapper always listens on the same port (111), many firewall administrators, out of excessive caution, block requests to port 111 from inside the firewalled network to servers outside the network. As a result, NFSv3 is not practical to use through firewalls. (Aside from which, without security, it’s risky too.)

NFSv4 uses a single port number by mandating the server will listen on port 2049. There are no “auxiliary” protocols like statd, lockd and mountd required as the mounting and locking protocols have been incorporated into the NFSv4 protocol. This means that NFSv4 clients do not need to contact the portmapper, and do not need to access services on floating ports.

As NFSv4 uses a single TCP connection with a well-defined destination TCP port, it traverses firewalls and network address translation (NAT) devices with ease, and makes firewall configuration as simple as configuration for HTTP servers.

Mounts and Automounter

The automounter daemons and the utilities on different flavors of UNIX and Linux are capable of identifying different NFS versions. However, using the automounter will require at least port 111 to be permitted through any firewall between server and client, as it uses the portmapper.

This is undesirable if you are extending the use of NFSv4 beyond traditional NFSv3 environments, so in preference the widely available “mirror mount” facility can be used. It enhances the behavior of the NFSv4 client by creating a new mountpoint whenever it detects that a directory’s fsid differs from that of its parent and automatically mounts filesystems when they are encountered at the NFSv4 server .

This enhancement does not require the use of the automounter and therefore does not rely on the content or propagation of automounter maps, the availability of NFSv3 services such as mountd, or opening firewall ports beyond the single port 2049 required for NFSv4.

Internationalization Support; UTF-8

Yes, those funny characters outside of US-ASCII are supported. In a welcome recognition that it set no longer provides the descriptive capabilities demanded by languages with larger alphabets or those that use an extensive range of non-Roman glyphs, NFSv4 uses UTF-8 for file names, directories, symlinks and user and group identifiers. As UTF-8 is backwards compatible with 7 bit encoded ASCII, any names that are 7 bit ASCII will continue to work.

Compound RPCs

Latency in a wide area network (WAN) is a perennial issue, and is very often measured in tenths of a second to seconds. NFS uses Remote Procedure Calls (RPCs) to undertake all its communication with the server, and although the payload is normally small, meta-data operations are largely synchronous and serialized. Operations such as file lookup (LOOKUP), the fetching of attributes (GETATTR) and so on, make up the largest percentage by count of the average traffic load on NFS.

This mix of a typical NFS set of RPC calls in versions prior to NFSv4 requires each RPC call is a separate transaction over the wire. NFSv4 avoids the expense of single RPC requests and the attendant latency issues and allows these calls to be bundled together. For instance, a lookup, open, read and close can be sent once over the wire, and the server can execute the entire compound call as a single entity. The effect is to reduce latency considerably for multiple operations.

Delegations

Servers are employing ever more quantities of RAM and flash technologies, and very large caches in the orders of terabytes are not uncommon. Applications running over NFSv3 can’t take advantage of these caches unless they have specific application support. With increasing WAN latencies doing every IO over the wire introduces significant delay.

NFSv4 allows the server to delegate certain responsibilities to the client, a feature that allows caching locally where the data is being accessed. Once delegated, the client can act on the file locally with the guarantee that no other client has a conflicting need for the file; it allows the application to have locking, reading and writing requests serviced on the application server without any further communication with the NFS server. To prevent deadlocking conditions, the server can recall the delegation via an asynchronous callback to the client should there be a conflicting request for access to the file from a different client.

Migration, Replicas and Referrals

For broader use within a datacenter, and in support of high availability applications such as databases and virtual environments, copying data for backup and disaster recovery purposes, or the ability to migrate it to provide VM location independence are essential. NFSv4 provides facilities for both transparent replication and migration of data, and the client is responsible for ensuring that the application is unaware of these activities. An NFSv4 referral allows servers to redirect clients from this server’s namespace to another server; it allows the building of a global namespace while maintaining the data on discrete and separate servers.

Sessions

Perhaps one of the most significant features of NFSv4.1 is the introduction of stateful sessions. Sessions bring the advantages of correctness and simplicity to NFS semantics. In order to improve on the correctness of NFSv4, NFSv4.1 sessions introduce “exactly-once” semantics.
Servers maintain one or more session states in agreement with the client; they maintain the server’s state relative to the connections belonging to a client. Clients can be assured that their requests to the server have been executed, and that they will never be executed more than once.

Sessions extend the idea of NFSv4 delegations, which introduced server-initiated asynchronous callbacks; clients can initiate session requests for connections to the server. For WAN based systems, this simplifies operations through firewalls.

Security

An area of great confusion, many believe that NFSv4 requires the use of strong security. The NFSv4 specification simply states that implementation of strong RPC security by servers and clients is mandatory, not the use of strong RPC security. This misunderstanding may explain the reluctance of users from migrating to NFSv4 due to the additional work in implementing or modifying their existing Kerberos security.

Security is increasingly important as NFSv4 makes data more easily available over the WAN. This feature was considered so important by the IETF NFS working group that the security specification using Kerberos v5 was “retrofitted” to the NFSv2 and NFSv3 specifications.

Graphic for Advantages of NFS

Although access to an NFS filesystem without strong security such as provided by Kerberos is possible, across a WAN it should really be considered only as a temporary measure. In that spirit, it should be noted that NFSv4 can be used without implementing Kerberos security. The fact that it is possible does not make it desirable! A fuller description of the issues and some migration considerations can be found in the SNIA White Paper “Migrating from NFSv3 to NFSv4”.

Many of the practical issues faced in implementing robust Kerberos security in a UNIX environment can be eased by using a Windows Active Directory (AD) system. Windows uses the standard Kerberos protocol as specified in RFC 1510; AD user accounts are represented to Kerberos in the same way as accounts in UNIX realms. This can be a very attractive solution in mixed-mode environments.

In the next post, we’ll discuss one of the primary features of NFSv4.1; pNFS, or parallelized NFS, and some of the new work being done in support of NFSv4.2.
FOOTNOTE: Parts of this blog were originally published in Usenix ;login: February 2012 under the title The Background to NFSv4.1. Used with permission.

The Advantages of NFSv4.1

In a previous blog post Why NFSv4.1 and pNFS are Better than NFSv3 Could Ever Be, some of the issues with NFSv3 that made it difficult to implement as a WAN based or data center wide protocol were discussed. The question then becomes; why not move to NFSv4 instead of NFSv4.1? Isn’t that a bigger leap from NFSv3?

Well, practical experience and some issues with NFSv4 made NFSv4.1 a necessity; for one, it introduces the key concept of sessions, and provides a foundation for pNFS (parallel NFS) which we’ll discuss in a later blog post. And all the features of NFSv4 were carried over into NFSv4.1, since it was a minor version update; there’s little more to do to take advantage of NFSv4.1, so that’s where your focus evaluation and implementation should be.

TCP for Transport

NFSv3 supports both TCP (Transmission Control Protocol) and UDP (User Datagram Protocol), and UDP is sometimes employed (for those applications that support it) because it is perceived to be lightweight and faster in comparison to TCP.

The downside of UDP is that it’s connectionless (that is, stateless) and an unreliable protocol. There is no guarantee that the datagrams will be delivered in any given order to the destination host — or even delivered at all — so applications must be specifically designed to handle missing, duplicate or incorrectly ordered data. UDP is also not a good network citizen; there is no concept of congestion or flow control, and no ability to apply quality of service (QoS) criteria.

The NFSv4 specification requires that any transport used provides congestion control. The easiest way to do this is via TCP. By using TCP, NFSv4 clients and servers are able to adapt to known frequent spikes in unreliability on the Internet; and retransmission is managed in the transport layer instead of in the application layer, greatly simplifying applications and their management on a shared network.

NFSv4 also introduces strict rules about retries over TCP in contrast to the complete lack of rules in NFSv3 for retries over TCP. As a result, if NFSv3 clients have timeouts that are too short, NFSv3 servers may drop requests. NFSv4 uses the timers that are built into the connection-oriented transport.

Network Ports

To access an NFS server, an NFSv3 client must contact the server’s portmapper to find the port of the mountd server. It then contacts the mount server to get an initial file handle, and again contacts the portmapper to get the port of the NFS server. Finally, the client can access the NFS server.

This creates problems for using NFS through firewalls, because firewalls typically filter traffic based on well-known port numbers. If the client is inside a firewalled network, and the server is outside the network, the firewall needs to know what ports the portmapper, mountd and nfsd servers are listening on. The mount server can listen on any port, so telling the firewall what port to permit is not practical. While the NFS server usually listens on port 2049, sometimes it does not. While the portmapper always listens on the same port (111), many firewall administrators, out of excessive caution, block requests to port 111 from inside the firewalled network to servers outside the network. As a result, NFSv3 is not practical to use through firewalls. (Aside from which, without security, it’s risky too.)

NFSv4 uses a single port number by mandating the server will listen on port 2049. There are no “auxiliary” protocols like statd, lockd and mountd required as the mounting and locking protocols have been incorporated into the NFSv4 protocol. This means that NFSv4 clients do not need to contact the portmapper, and do not need to access services on floating ports.

As NFSv4 uses a single TCP connection with a well-defined destination TCP port, it traverses firewalls and network address translation (NAT) devices with ease, and makes firewall configuration as simple as configuration for HTTP servers.

Mounts and Automounter

The automounter daemons and the utilities on different flavors of UNIX and Linux are capable of identifying different NFS versions. However, using the automounter will require at least port 111 to be permitted through any firewall between server and client, as it uses the portmapper.

This is undesirable if you are extending the use of NFSv4 beyond traditional NFSv3 environments, so in preference the widely available “mirror mount” facility can be used. It enhances the behavior of the NFSv4 client by creating a new mountpoint whenever it detects that a directory’s fsid differs from that of its parent and automatically mounts filesystems when they are encountered at the NFSv4 server .

This enhancement does not require the use of the automounter and therefore does not rely on the content or propagation of automounter maps, the availability of NFSv3 services such as mountd, or opening firewall ports beyond the single port 2049 required for NFSv4.

Internationalization Support; UTF-8

Yes, those funny characters outside of US-ASCII are supported. In a welcome recognition that it set no longer provides the descriptive capabilities demanded by languages with larger alphabets or those that use an extensive range of non-Roman glyphs, NFSv4 uses UTF-8 for file names, directories, symlinks and user and group identifiers. As UTF-8 is backwards compatible with 7 bit encoded ASCII, any names that are 7 bit ASCII will continue to work.

Compound RPCs

Latency in a wide area network (WAN) is a perennial issue, and is very often measured in tenths of a second to seconds. NFS uses Remote Procedure Calls (RPCs) to undertake all its communication with the server, and although the payload is normally small, meta-data operations are largely synchronous and serialized. Operations such as file lookup (LOOKUP), the fetching of attributes (GETATTR) and so on, make up the largest percentage by count of the average traffic load on NFS.

This mix of a typical NFS set of RPC calls in versions prior to NFSv4 requires each RPC call is a separate transaction over the wire. NFSv4 avoids the expense of single RPC requests and the attendant latency issues and allows these calls to be bundled together. For instance, a lookup, open, read and close can be sent once over the wire, and the server can execute the entire compound call as a single entity. The effect is to reduce latency considerably for multiple operations.

Delegations

Servers are employing ever more quantities of RAM and flash technologies, and very large caches in the orders of terabytes are not uncommon. Applications running over NFSv3 can’t take advantage of these caches unless they have specific application support. With increasing WAN latencies doing every IO over the wire introduces significant delay.

NFSv4 allows the server to delegate certain responsibilities to the client, a feature that allows caching locally where the data is being accessed. Once delegated, the client can act on the file locally with the guarantee that no other client has a conflicting need for the file; it allows the application to have locking, reading and writing requests serviced on the application server without any further communication with the NFS server. To prevent deadlocking conditions, the server can recall the delegation via an asynchronous callback to the client should there be a conflicting request for access to the file from a different client.

Migration, Replicas and Referrals

For broader use within a datacenter, and in support of high availability applications such as databases and virtual environments, copying data for backup and disaster recovery purposes, or the ability to migrate it to provide VM location independence are essential. NFSv4 provides facilities for both transparent replication and migration of data, and the client is responsible for ensuring that the application is unaware of these activities. An NFSv4 referral allows servers to redirect clients from this server’s namespace to another server; it allows the building of a global namespace while maintaining the data on discrete and separate servers.

Sessions

Perhaps one of the most significant features of NFSv4.1 is the introduction of stateful sessions. Sessions bring the advantages of correctness and simplicity to NFS semantics. In order to improve on the correctness of NFSv4, NFSv4.1 sessions introduce “exactly-once” semantics.
Servers maintain one or more session states in agreement with the client; they maintain the server’s state relative to the connections belonging to a client. Clients can be assured that their requests to the server have been executed, and that they will never be executed more than once.

Sessions extend the idea of NFSv4 delegations, which introduced server-initiated asynchronous callbacks; clients can initiate session requests for connections to the server. For WAN based systems, this simplifies operations through firewalls.

Security

An area of great confusion, many believe that NFSv4 requires the use of strong security. The NFSv4 specification simply states that implementation of strong RPC security by servers and clients is mandatory, not the use of strong RPC security. This misunderstanding may explain the reluctance of users from migrating to NFSv4 due to the additional work in implementing or modifying their existing Kerberos security.

Security is increasingly important as NFSv4 makes data more easily available over the WAN. This feature was considered so important by the IETF NFS working group that the security specification using Kerberos v5 was “retrofitted” to the NFSv2 and NFSv3 specifications.

Graphic for Advantages of NFS

Although access to an NFS filesystem without strong security such as provided by Kerberos is possible, across a WAN it should really be considered only as a temporary measure. In that spirit, it should be noted that NFSv4 can be used without implementing Kerberos security. The fact that it is possible does not make it desirable! A fuller description of the issues and some migration considerations can be found in the SNIA White Paper “Migrating from NFSv3 to NFSv4”.

Many of the practical issues faced in implementing robust Kerberos security in a UNIX environment can be eased by using a Windows Active Directory (AD) system. Windows uses the standard Kerberos protocol as specified in RFC 1510; AD user accounts are represented to Kerberos in the same way as accounts in UNIX realms. This can be a very attractive solution in mixed-mode environments.

In the next post, we’ll discuss one of the primary features of NFSv4.1; pNFS, or parallelized NFS, and some of the new work being done in support of NFSv4.2.
FOOTNOTE: Parts of this blog were originally published in Usenix ;login: February 2012 under the title The Background to NFSv4.1. Used with permission.

Why NFSv4.1 and pNFS are Better than NFSv3 Could Ever Be

NFSv4 has been a standard file sharing protocol since 2003, but has not been widely adopted; party because NFSv3 was “just good enough”. Yet, NFSv4 improves on NFSv3 in many important ways; and NFSv4.1 is a further improvement on that. In this post, I explain the how NFSv4.1 is better suited to a wide range of datacenter and HPC use than its predecessor NFSv3 and NFSv4, as well as providing resources for migrating from NFSv3 to NFSv4.1. And, most importantly, I make the argument that users should, at the very least, be evaluating and deploying NFSv4.1 for use in new projects; and ideally, should be using it wholesale in their existing environments.

The background to NFSv4.1
NFSv2 (specified in RFC-1813, but never an Internet standard) and its popular successor NFSv3 was first released in 1995 by Sun. NFSv3 has proved a popular and robust protocol over the 15 years it has been in use, and with wide adoption it soon eclipsed some of the early competitive UNIX-based filesystem protocols such as DFS and AFS. NFSv3 was extensively adopted by storage vendors and OS implementers beyond Sun’s Solaris; it was available on an extensive list of systems, including IBM’s AIX, HP’s HP-UX, Linux and FreeBSD. Even non-UNIX systems adopted NFSv3; Mac OS, OpenVMS, Microsoft Windows, Novell NetWare, and IBM’s AS/400 systems. In recognition of the advantages of interoperability and standardization, Sun relinquished control of future NFS standards work, and work leading to NFSv4 was by agreement between Sun and the Internet Society (ISOC), and is undertaken under the auspices of the Internet Engineering Task Force (IETF).

In April 2003, the Network File System (NFS) version 4 Protocol was ratified as an Internet standard, described in RFC-3530, which superseded NFSv3. This was the first open filesystem and networking protocol from the IETF. NFSv4 introduces the concept of state to ameliorate some of the less desirable features of NFSv3, and other enhancements to improved usability, management and performance.

But shortly following its release, an Internet draft written by Garth Gibson and Peter Corbett outlined several problems with NFSv4; specifically, that of limited bandwidth and scalability, since NFSv4 like NFSv3 requires that access is to a single server. NFSv4.1 (as described in RFC-5661, ratified in January 2010) was developed to overcome these limitations, and new features such as parallel NFS (pNFS) were standardized to address these issues.

Now NFSv4.2 is now moving towards ratification. In a change to the original IETF NFSv4 development work, where each revision took a significant amount of time to develop and ratify, the workgroup charter was modified to ensure that there would be no large standards documents that took years to develop, such as RFC-5661, and that additions to the standard would be an on-going yearly process. With these changes in the processes leading to standardization, features that will be ratified in NFSv4.2 (expected in early 2013) are available from many vendors and suppliers now.

Adoption of NFSv4.1
Every so often, I and others in the industry run Birds-of-a-Feather (BoFs) on the availability of NFSv4.1 clients and servers, and on the adoption of NFSv4.1 and pNFS. At our latest BoF at LISA ’12 in San Diego in December 2012, many of the attendees agreed; it’s time to move to NFSv4.1.

While there have been many advances and improvements to NFS, many users have elected to continue with NFSv3. NFSv4.1 is a mature and stable protocol with many advantages in its own right over its predecessors NFSv3 and NFSv2, yet adoption remains slow. Adequate for some purposes, NFSv3 is a familiar and well understood protocol; but with the demands being placed on storage by exponentially increasing data and compute growth, NFSv3 has become increasingly difficult to deploy and manage.

In essence, NFSv3 suffers from problems associated with statelessness. While some protocols such as HTTP and other RESTful APIs see benefit from not associating state with transactions – it considerably simplifies application development if no transaction from client to server depends on another transaction – in the NFS case, statelessness has led, amongst other downsides, to performance and lock management issues.

NFSv4.1 and parallel NFS (pNFS) address well-known NFSv3 “workarounds” that are used to obtain high bandwidth access; users that employ (usually very complicated) NFSv3 automounter maps and modify them to manage load balancing should find pNFS provides comparable performance that is significantly easier to manage.

So what’s the problem with NFSv3?
Extending the use of NFS across the WAN is difficult with NFSv3. Firewalls typically filter traffic based on well-known port numbers, but if the NFSv3 client is inside a firewalled network, and the server is outside the network, the firewall needs to know what ports the portmapper, mountd and nfsd servers are listening on. As a result of this promiscuous use of ports, the multiplicity of “moving parts” and a justifiable wariness on the part of network administrators to punch random holes through firewalls, NFSv3 is not practical to use in a WAN environment. By contrast, NFSv4 integrates many of these functions, and mandates that all traffic (now exclusively TCP) uses the single well-known port 2049.


Plus, NFSv3 is very chatty for WAN usage; and there may be many messages sent between the client and the server to undertake simple activities, such as finding, opening, reading and closing a file. NFSv4 can compound these operations into a single RPC (Remote Procedure Call) and reduce considerably the back-and-forth traffic across the network. The end result is reduced latency.

One of the most annoying NFSv3 “features” has been its handling of locks. Although NFSv3 is stateless, the essential addition of lock management (NLM) to prevent file corruption by competing clients means NFSv3 application recovery is slowed considerably. Very often stale locks have to be manually released, and the lock management is handled external to the protocol. NFSv4’s built-in lock leasing, lock timeouts, and client-server negotiation on recovery simplifies management considerably.

In a change from NFSv3, these locking and delegation features make NFSv4 stateful, but the simplicity of the original design is retained through well-defined recovery semantics in the face of client and server failures and network partitions. These are just some of the benefits that make NFSv4.1 desirable as a modern datacenter protocol, and for use in HPC, database and highly virtualized applications.
NFSv3 is extremely difficult to parallelise, and often takes some vendor-specific “pixie dust” to accomplish. In contrast, pNFS with NFSv4.1brings parallelization directly into the protocol; it allows many streams of data to multiple servers simultaneously, and it supports files as per usual, along with block and object support through an extensible layout mechanism. The management is definitely easier, as NFSv3 automounter maps and hand-created load-balancing schemes are eliminated and, by providing a standardized interface, pNFS ensures fewer issues in supporting multi-vendor NFS server environments.

Next post; the Advantages of NFSv4.1

FOOTNOTE: Parts of this blog were originally published in Usenix ;login: February 2012 under the title The Background to NFSv4.1. Used with permission.

NFSv4.1 Webcast Q&A

Our recent Webcast: NFSv4.1 – Plan for a Smooth Migration was very well received and well attended. We thank everyone who was able to make the live event. For those of you who couldn’t make it, it’s now available on demand. Check it out here.

There wasn’t enough time to respond to all of the questions during the Webcast, so we have consolidated answers to all of them in this blog post from the presentation team. Feel free to comment and provide your input.

Q. Will NFS 4.2 be any easier to migrate to than 4.1? Would it be worth waiting for?

A. NFSv4.2 is a set of additional functionality that will be easy to take advantage of – if you’re on NFSv4.1. The first move is to NFSv4.1, as it offers a wealth of features over and above NFSv3. Waiting for NFSv4.2 features wouldn’t be advisable; it’s unlikely to be ratified until the end of 2012, and enterprise server solutions and the required downstream client distributions will be a lot further out than that.

Q. Since NFS 4.1 is out, what is the uptake in the industry?

A. There aren’t any global figures, since not all suppliers collect detailed information about protocol usage, and of those that do, many can’t differentiate between NFS versions. Anecdotally, it’s slow. That’s because NFSv4.1 servers (particularly for file-layout) have only been available for less than a year, and the needed Linux client support has only recently made it through to the enterprise distributions.. NFSv4 (as opposed to 4.1) is more widely used; but the only figures I have are anecdotal, and would be misleading.
Q. Are there any network architecture design considerations that need to be taken before implementing NFSv4.1?

A. No. In fact, (if you’re not using pNFS) NFSv4.1 should get you more “bang for your buck” as there’s a reduction in network traffic compared with NFSv3. pNFS requires a different architecture; your storage vendor should be able to assist in the planning.

Q. Clustered servers – you mentioned that vendors had to provide a special server for this… are these enhancements going to be ported into the general linux nfs server stream?

A. I’m not sure to what this refers; perhaps the MDS (metadata server)? Although this server is often shown as a separate box in diagrams for simplicity, that’s not how it is normally implemented. The MDS is normally part of the cluster running on one or more fo the data servers.

Q. If you recommend AD for kerberos, do all of the NFS clients need to be joined to the same AD domain as well? Or only the servers?

A. Any time a client in one domain (or realm) attempts to access a server, the server must be in the same realm as the client, or if it’s in another realm, there must be cross realm trust so that the principal (the client) can be correctly authenticated.

Q. Can you talk about any difficulties in using Active Directory with NFS? Are there changes needed on AD?

A. No changes are needed to AD. It’s relatively straightforward security administration, and storage vendors should be able to provide you with implementation checklists.

Q. What is the impact on clustering and failover by introducing statefulness?

A. Significant! And much better. Recovery is much improved, as the server and client after a failure can attempt to agree on what locks were held, what files were open, what data had been written and so on. It’s a big improvement on NFSv3.

Q. Will it be possible to mount root file systems from NFSV4? Like boot from the SAN that we already have in FC or iSCSI?

A. Yes, that doesn’t change.

Q. Can you explain the reasons why home dir and hpc would benefit with v4.1?

A. Home directories are an easy win; no application (well, at least that you care about) and easily migrated. The same is often true of HPC. For example where the data is transient – served from a store to local disk, computed and crunched, and then sent back to the store – the store could be migrated to NFSv4 and the app later; or the app first and the store later.

Beyond Potatoes – Migrating from NFSv3

“It is a mistake to think you can solve any major problems just with potatoes.”
Douglas Adams (1952-2001, English humorist, writer and dramatist)

While there have been many advances and improvements to NFS over the last decade, some IT organizations have elected to continue with NFSv3 – like potatoes, it’s the staple filesystem protocol that just about any UNIX administrator understands.

Although adequate for many purposes and a familiar and well understood protocol, choosing and continuing to deploy NFSv3 has become increasingly difficult to justify in a modern datacenter. For example, NFSv3 makes promiscuous use of ports, something that is unsuitable for a variety of security reasons for use over a wide area network (WAN); plus increased server & client bandwidth demands and improved functionality of Network Attached Storage (NAS) arrays have outstripped NFSv3’s ability to deliver high throughput.
NFSv4 and the minor versions that follow it are designed to address many of the issues that NFSv3 poses. NFSv4 also includes features intended to enable its use in global wide area networks (WANs), and to improve the performance and resilience of NAS (Network Attached Storage):

  • Firewall-friendly single port operations
  • Internationalization support
  • Replication and migration facilities
  • Mandatory use of strong RPC security flavors that depend on cryptography, with support of access control that is compatible with both UNIX® and Windows®
  • Use of character strings instead of integers to represent user and group identifiers
  • Advanced and aggressive cache management features with delegations
  • (with NFSv4.1 pNFS, or parallel NFS) Trunking

In April 2003, the Network File System (NFS) version 4 Protocol was ratified as an Internet standard, described in RFC-3530, which superseded NFS Version 3 (NFSv3, specified in RFC-1813). Since the ratification of NFSv4, further advances have been made to the standard, notably NFSv4.1 (as described in RFC-5661, ratified in January 2010) that included several new features such as parallel NFS (pNFS). And further work is currently underway in the IETF for NFSv4.2.

Delegations with NFSv4

In NFSv3, clients have to function as if there is contention for the files they have opened, even though this is often not the case. As a result of this conservative approach to file locking, there are frequently many unneeded requests from the client to the server to find out whether an open file has been modified by some other client. Even worse, all write I/O in this scenario is required to be synchronous, further impacting client-side performance.
NFSv4 differs by allowing the server to delegate specific actions on a file to the client; this enables more aggressive client caching of data and the locks. A server temporarily cedes control of file updates and the locking state to a client via a delegation, and promises to notify the client if other clients are accessing the file. Once the client holds a delegation, it can perform operations on files with data has been cached locally, and thereby avoid network latency and optimize its use of I/O.

Trunking with pNFS

Many additional enhancements to NFSv4 are available with NFSv4.1, of which pNFS is a part. pNFS adds the capability to perform trunking at the NFS level by adding a session layer. The client establishes a session with an NFSv4.1 server, and can then create multiple TCP connections to the NFSv4.1 server, each potentially going over a different network interface on the client, and arriving on a different interface on the NFSv4.1 server. Now different requests sent over the same session identifier can go over different network paths, dramatically improving latency and increasing bandwidth.
Although client and server implementations of NFSv4.1 are available, they are in early stages of implementation and adoption. However, to take advantage of them in the future, it is important to plan now for the move to NFSv4 and beyond – and there are many servers and clients available now that support NFSv4. NFSv4 is a mature and stable protocol with many advantages in its own right over its predecessors NFSv3 and NFSv2.

Potatoes and Beyond

Now is the time to make the switchover; there really is no justification for not pursuing NFSv4 as the first NFS protocol version of choice. Although migrating from earlier versions of NFS requires some planning as there are significant differences between the two protocols, the benefits are impressive. To ensure a smooth migration to NFSv4 and beyond, the SNIA Ethernet Storage Forum NFS Special Interest Group has recently published an overview white paper “Migrating to NFSv4”. This covers internationalization support, automatic mounting of NFSv4 filesystems on demand, TCP protocol support amongst other considerations.
NFSv4 and NFSv4.1 have been developed for a reason; and NFSv4.2 is on the horizon. Like the potato, NFSv3 is a staple of the network Filesystem world. But as Douglas Adams said; “It is a mistake to think you can solve any major problems just with potatoes.” NFSv4 fixes many of NFSv3’s deficiencies, and represents a major advance that brings improved availability, performance and security; all the check-list items beyond potatoes that today’s users of network attached storage demand.