Lustre – SNIA on Storage

Common Questions on Clustered File Systems

November 18, 2016November 18, 2016 John Kim

More than 350 people have already seen our SNIA Ethernet Storage Forum (ESF) webcast “Clustered File Systems: No Limits.” Our presenters, James Coomer and Jerry Lotto, did a great job explaining what clustered file systems are, key considerations, choices and performance. As we expected, there were plenty of questions, so as promised, here are answers to them all.

Q: Parallel NFS (pNFS) has been in development/standard effort for a long time, and I believe pNFS is not in the Linux kernel it appears pNFS is yet to be prime time.

A: pNFS has been in Linux for over a decade! Clients and server are widely available, and you should look at the SNIA White Paper “An Updated Overview of NFSv4; NFSv4.0, NFSv4.1, pNFS, and NFSv4.2” for more information on the current state of play.

Q: Why the emphasis on parallel I/O? Any single storage server can feed results at link capacity, so you do not need multiple storage servers to feed a client at full speed. Isn’t the more critical issue the bottleneck on access to metadata for a single directory or file? Federated NAS bottlenecks updates for each directory behind a single master server?

A: Any one storage server can usually saturate one client, but often there are multiple hungry clients making requests simultaneously. Using parallel I/O allows multiple servers to feed multiple high-bandwidth clients across a narrow or wide set of data. This smooths out the I/O load on the servers in a near-perfect manner regardless of the number of clients performing I/O. It is absolutely true that metadata serving can become a bottleneck, so parallel file systems use cached and/or distributed metadata to overcome this and again, every client takes part in that interaction and shares some responsibility for managing communicating metadata updates.

Q: Can any application access parallel file system (i.e. through an agent in the driver level)? Or does it require specific code within the application?

A: Native access to a parallel file system requires a specific client or agent in the host, but many parallel file systems allow any client to access the data through a NAS protocol gateway. No changes are needed to applications to use a parallel file system – These parallel file systems are mounted as a POSIX compliant file system and therefore adhere to basically the same standards as an NFS mount for example.

Q: Are parallel file system clients compatible with scale-out NAS servers?

A: Nearly all scale-out NAS servers speak a standard NAS protocol like NFS or SMB. Clients running a parallel file system client can also access NAS via these standard protocols. Exceptions to this may possibly (but none that we know of) occur for scale-out NAS servers that support a modified NFS/SMB protocol or a custom NAS client which might conceivably conflict with the parallel file system client when installed on an OS.

Q: Of course I am biased, but I am fond of the AFS (Andrew File System) Family of File Systems. There is OpenAFS, but there is also what we are doing at AuriStor extending beyond the core AFS global namespace model (security functionality, and performance)

A: AFS is another distributed file system which supports large scale deployments, native clients for many platforms, and strong security features. It also uses local caching of files to improve performance. It uses a weakly consistent file locking system so multiple clients can access the same file simultaneously but they cannot both update the same file at the same time. OpenAFS is an open-source implementation of AFS. Auristor (formerly Your File System, Inc.) is a startup providing a commercial parallel file system that is compatible with AFS.

Q: I am more familiar with Veritas Cluster File System, could you please do a quick compare with Lustre or GPFS?

A: The Veritas Cluster File System (formerly VxCFS, now part of Veritas InfoScale) is a distributed file system that runs on Linux and popular flavors of Unix. It supports up to 64 nodes and allows multiple nodes to share the same back-end storage hardware. Comparing it to Lustre and GPFS is beyond the scope of this webinar, but in basic terms, parallel file systems can offer far greater scalability and bandwidth for example, through the use of optimized RDMA clients for high performance networks.

Q: Why do file apps need shared access to data, but block apps do not?

A: Traditionally block storage did not offer shared access to data (except when used as shared back-end storage for a clustered file system), while apps that needed shared access to data usually chose to use a NAS protocol such as SMB or NFS. So in many cases file-based apps use file sharing protocols because they need shared access to data from multiple clients. (In other cases file-based applications do not require sharing but the storage administrators believe it’s easier to manage or less expensive than networked block storage.)

Q: Do Lustre and GPFS have SMB Direct support?

A: Not today. SMB Direct is an option to use RDMA and multi-channel with the SMB 3 protocol. Both Lustre and GPFS support the ability to export a file system via NFS or SMB, but generally they do not support SMB Direct yet. Both Lustre and GPFS support RDMA access through their clients.

How to the clients avoid doing simultaneous writes to the same file?

A: Some parallel file systems allow this by letting different clients write to different parts of the same file. Others do not allow this. In either case, distributed file locking is used to prevent two clients from writing simultaneously to the same part of a file (or to the same file if it’s not allowed).

Q: How can you say that the application “does not have to worry about” how the clustered file system serializes writes? Doesn’t this require continuous end-to-end connectivity?

A: When the application writes data it generally writes to a POSIX-compliant file system and does not need to worry about how the parallel file system serializes, distributes, or protects the data because this is virtualized (managed) by the file system. It usually does require continuous end-to-end connectivity from the clients to the servers, though in some cases caching could allow for brief gaps in connectivity and in some systems not every client needs to have network connectivity to every server. There are multiple mechanisms within parallel file systems to manage the various cases of clients/servers disappearing from the network, temporarily or permanently (whilst for example holding a lock).

Q: How does a parallel file system handle the sequences of write on a same file? Just append one by one? What if a client modified a line?

A: This is the biggest challenge for and reason to use a parallel file system. Beneath the covers, coherency is maintained by Spectrum Scale using a token management server process which issues locks for object requests. Similar functionality is implemented in Lustre using a distributed lock manager. These objects are most commonly blocks within files rather than entire files, but this is application controlled. The end result is a POSIX-compliant interface that scales to thousands of clients.

Q: What does FPO stand for?

A: File Placement Optimizer – a shared-nothing architecture and licensing model for IBM Spectrum Scale (aka GPFS). Learn more here.

Q: Is there a concept in parallel file systems for “auto-tuning” yet? Seems like the early days of SAN management and tuning…

A: Default tuning values are optimized for general purpose workloads, but the whole purpose of tuning parameters is to adjust away from those defaults to optimize the file system for a particular application workload or fil esystem architecture. Both IBM and OpenSFS with the support of Intel have published extensive documentation on best practices for optimization and tuning for either file system. We are not aware of any work on “automating” that process but there has been recent work (e.g. in spectrum scale) to simplify the tuning process.

Q: Which is better as interconnect between disk and servers, shared access or share-nothing?

A: The use of shared access in the interconnect between disks and servers is limited to providing HA functionality in Lustre or Spectrum Scale, the ability to service I/O requests to a storage device if the server which has primary responsibility for that device is not available. This usually involves multiple server-attached external storage which can add cost to building the file system. The alternative approach to HA is to replicate blocks of data to different disks on different servers, cutting back on the usable capacity of the file system. If HA is not a requirement, a share-nothing architecture will generally involve less hardware and therefore be less expensive to build.

If you have more questions, please comment on this blog. And I encourage you to check out the SNIA ESF webcast library for educational, vendor-neutral content on Ethernet networked storage topics

Clustered File Systems: No Limits

October 7, 2016October 7, 2016 John Kim

Today’s storage world would appear to have been divided into three major and mutually exclusive categories: block, file and object storage. The marketing that shapes much of the user demand would appear to suggest that these are three quite distinct animals, and many systems are sold as exclusively either SAN for block, NAS for file or object. And object is often conflated with cloud, a consumption model that can in reality be block, file or object.

A fixed taxonomy that divides the storage world this way is very limiting, and can be confusing; for instance, when we talk about cloud. How should providers and users buy and consume their storage? Are there other classifications that might help in providing storage solutions to meet specific or more general application needs? What about customers who need file access performance beyond what one storage box can provide? Which options support those who want scale-out solution like object storage with file protocol semantics?

To clear up the confusion, the SNIA Ethernet Storage Forum is hosting a live Webcast, “Clustered File Systems: No Limits.” In this Webcast we will explore clustered storage solutions that not only provide multiple end users access to shared storage over a network, but allow the storage itself to be distributed and managed over multiple discrete storage systems. You’ll hear:

General principles and specific clustered and distributed systems and the facilities they provide built on the underlying storage
Better known file systems like NFS, IBM Spectrum Scale (GPFS) and Lustre, along with a few of the less well known
How object based systems like S3 have blurred the lines between them and traditional file based solutions

This Webcast should appeal to those interested in exploring some of the different ways of accessing & managing storage, and how that might affect how storage systems are provisioned and consumed. POSIX and other acronyms may be mentioned, but no rocket science beyond a general understanding of the principles of storage will be assumed. Contains no nuts and is suitable for vegans!

As always, our experts will be on hand to answer your questions on the spot. Register now for this October 25^th event.

OpenStack File Services for HPC Q&A

October 5, 2015October 5, 2015 Glyn Bowden

We got some great questions during our Webcast on how OpenStack can consume and control file services appropriate for High Performance Computing (HPC) in a cloud and multi-tenanted environment. Here are answers to all of them. If you missed the Webcast, it’s now available on-demand. I encourage you to check it out and please feel free to leave any additional questions at this blog.

Q. Presumably we can use other than ZFS for the underlying filesystems in Lustre?

A. Yes, there a plenty of other filesystems that can be used other than ZFS. ZFS was given as an example of a scale up and modern filesystem that has recently been integrated, but essentially you can use most filesystem types with some having more advantages than others. What you are looking for is a filesystem that addresses the weaknesses of Lustre in terms of self-healing and scale up. So any filesystem that allows you to easily grow capacity whilst also being capable of protecting itself would be a reasonable choice. Remember, Lustre doesn’t do anything to protect the data itself. It simply places objects in a distributed fashion of the Object Storage Targets.

Q. Are there any other HPC filesystems besides Lustre?

A. Yes there are and depending on your exact requirements Lustre might not be appropriate. Gluster is an alternative that some have found slightly easier to manage and provides some additional functionality. IBM has GPFS which has been implemented as an HPC filesystem and other vendors have their scale-out filesystems too. An HPC filesystem is simply a scale-out filesystem capable of very good throughput with low latency. So under that definition a flash array could be considered a High Performance storage platform, or a scale out NAS appliance with some fast disks. It’s important to understand you’re workloads characteristics and demands before making the choice as each system has pro’s and con’s.

Q. Does “embarrassingly parallel” require bandwidth or latency from the storage system?

A. Depending on the workload characteristics it could require both. Bandwidth is usually the first demand though as data is shipped to the nodes for processing. Obviously the lower the latency the fast though jobs can start and run, but its not critical as there is limited communication between nodes that normally drives the low latency demand.

Q. Would you suggest to use Object Storage for NFV, i.e Telco applications?

A. I would for some applications. The problem with NFV is it actually captures a surprising breadth of applications so of which have very limited data storage needs. For example there is little need for storage in a packet switching environment beyond the OS and binaries needed to stand up the VM’s. In this case, object is a very good fit as it can be easily, geographically distributed ensuring the same networking function is delivered in the same manner. Other applications that require access to filtered data (so maybe billing based applications or content distribution) would also be good candidates.

Q. I missed something in the middle; please clarify, your suggestion is to use ZFS (on Linux) for the local file system on OSTs?

A. Yes, this was one example and where some work has recently been done in the Lustre community. This affords the OSS’s the capability of scaling the capacity upwards as well as offering the RAID-like protection and self-healing that comes with ZFS. Other filesystems can offer those some things so I am not suggesting it is the only choice.

Q. Why would someone want/need scale-up, when they can scale-out?

A. This can often come down to funding. A lot of HPC environments exist in academic institutions that rely on grant funding and sponsorship to expand their infrastructure. Sometimes it simply isn’t feasible to buy extra servers in order to add capacity, particularly if there is already performance headroom. It might also be the case that rack space, power and cooling could be factors in which case adding drives to cope with bigger workloads might be the only option. You do need to consider if the additional capacity would also provoke the need for better performance so we can’t just assume that adding disk is enough, but it’s certainly a good option and a requirement I have seen a number of times.

A Deep Dive into the SNIA Storage Developer Conference – The File Systems Track

September 9, 2015September 9, 2015 khauser Leave a comment

SNIA Storage Developer Conference (SDC) 2015 is two weeks away, and the SNIA Technical Council is finalizing a strong, comprehensive agenda of speakers and sessions. Wherever your interests lie, you’re sure to find experts and topics that will expand your knowledge and fuel your professional development!

For the next two weeks, SNIA on Storage will highlight exciting interest areas in the 2015 agenda. If you have not registered, you need to! Visit www.storagedeveloper.org to see the four day overview and sign up.

Year after year, the File Systems Track at SDC provides in depth information on the latest technologies, and 2015 is no exception. The track kicks off with Vinod Eswaraprasad, Software Architect, Wipro, on Creating Plugin Modules for OpenStack Manila Services. He’ll discuss his work on integrating a multi-protocol NAS storage device to the OpenStack Manila service, looking at the architecture principle behind the scalability and modularity of Manila services, and the analysis of interface extensions required to integrate a typical NAS head.

Ankit Agrawal and Sachin Goswami of TCS will discuss How to Enable a Reliable and Economic Cloud Storage Solution by Integrating SSD with LTFS. They will share views on how to integrate SSD as a cache with a LTFS tape system to transparently deliver the best benefits for Object Base storage, and talk about the potential challenges in their approach and best practices that can be adopted to overcome these challenges.

Jakob Homan, Distributed Systems Engineer, Microsoft, will present Apache HDFS: Latest Developments and Trends. He’ll discuss the new features of HDFS, which has rapidly developed to meet the needs of enterprise and cloud customers, and take a look at HDFS at implementations and how they address previous shortcomings of HDFS.

James Cain, Principal Software Architect, Quantel Limited, will discuss a Pausable File System, using his own implementation of an SMB3 server (running in user mode on Windows) to demonstrate the effects of marking messages as asynchronously handled and then delaying responses in order to build up a complete understanding of the semantics offered by a pausable file system.

Ulrich Fuchs, Service Manager, CERN, will talk about Storage Solutions for Tomorrow’s Physics Projects, suggesting possible architectures for tomorrow’s storage implementations in this field, and showing results of first performance tests done on various solutions (Lustre, NFS, Block Object storage, GPFS ..) for typical application access patterns.

Neal Christiansen, Principal Development Lead, Microsoft, will present Support for New Storage Technologies by the Windows Operating System, describing the changes being made to the Windows OS, its file systems, and storage stack in response to new evolving storage technologies.

Richard Morris and Peter Cudhea of Oracle will discuss ZFS Async Replication Enhancements, exploring design decisions around enhancing the ZFS send and ZFS receive commands to transfer already compressed data more efficiently and to recover from failures without re-sending data that has already been received.

J.R. Tipton, Development Lead, Microsoft, will discuss ReFS v2: Cloning, Projecting, and Moving Data File Systems, presenting new abstractions that open up greater control for applications and virtualization, covering block projection and cloning as well as in-line data tiering.

Poornima Gurusiddaiah and Soumya Koduri of Red Hat will present Achieving Coherent and Aggressive Client Caching in Gluster, a Distributed System, discussing how to implement file system notifications and leases in a distributed system and how these can be leveraged to implement a client side coherent and aggressive caching.

Sriram Rao, Partner Scientist Manager, Microsoft, will present Petabyte-scale Distributed File Systems in Open Source Land: KFS Evolution, providing an overview of OSS systems (such as HDFS and KFS) in this space, and describing how these systems have evolved to take advantage of increasing network bandwidth in data center settings to improve application performance as well as storage efficiency.

Richard Levy, CEO and President, Peer Fusion, will discuss a High Resiliency Parallel NAS Cluster, including resiliency design considerations for large clusters, the efficient use of multicast for scalability, why large clusters must administer themselves, fault injection when failures are the nominal conditions, and the next step of 64K peers.

Join your peers as well – register now at www.storagedeveloper.org. And stay tuned for tomorrow’s blog on Cloud topics at SDC!

OpenStack File Services Options

August 17, 2015August 17, 2015 Glyn Bowden

How can OpenStack consume and control file services appropriate to High Performance Compute (HPC) in a cloud and multi-tenanted environment? Find out on September 22^nd when SNIA Cloud hosts a live Webcast and examines two approaches to integration.

One approach is to have OpenStack manage the storage infrastructure services using Cinder, Nova and Neutron to provide HPC Filesystem as a Service.

A second option is to use Manila file services for OpenStack to control the HPC File system deployment and manage the exports etc. This part also looks at the creation (in progress) of the Lustre Manila driver and its current progress.

I hope you’ll join Alex McDonald and me as we discuss the pros and cons of each approach. Register today and