Ceph Q&A

May 6, 2024May 6, 2024 Erin Farr

In a little over a month, more than 1,500 people have viewed the SNIA Cloud Storage Technologies Initiative (CSTI) live webinar, “Ceph: The Linux of Storage Today,” with SNIA experts Vincent Hsu and Tushar Gohad. If you missed it, you can watch it on-demand at the SNIA Educational Library. The live audience was extremely engaged with our presenters, asking several interesting questions. As promised, Vincent and Tushar have answered them here. Given the high level of this interest in this topic, the CSTI is planning additional sessions on Ceph. Please follow us @SNIACloud or at SNIA LinkedIn for dates. Q: How many snapshots can Ceph support per cluster? Q: Does Ceph provide Deduplication? If so, is it across objects, file and block storage? A: There is no per-cluster limit. In the Ceph filesystem (cephfs) it is possible to create snapshots on a per-path basis, and currently the configurable default limit is 100 snapshots per path. The Ceph block storage (rbd) does not impose limits on the number of snapshots. However, when using the native Linux kernel rbd client there is a limit of 510 snapshots per image. Read More

Here’s Why Ceph is the Linux of Storage Today

February 14, 2024February 14, 2024 Erin Farr

Data is one of the most critical resources of our time. Storage for data has always been a critical architectural element for every data center, requiring careful considerations for storage performance, scalability, reliability, data protection, durability and resilience. A decade ago, the market was aggressively embracing public storage because of its agility and scalability. In the last few years, people have been rethinking that approach, moving toward on-premises storage with cloud consumption models. The new cloud native architecture on-premises has the promise of the traditional data center’s security and reliability with cloud agility and scalability. Ceph, an Open Source project for enterprise unified software-defined storage, represents a compelling solution for this cloud native on-premises architecture and will be the topic of our next SNIA Cloud Storage Technologies Initiative webinar, “Ceph: The Linux of Storage Today.” This webinar will discuss: Read More

Edge AI Q&A

November 14, 2023November 15, 2023 Erin Farr

At our recent SNIA Cloud Storage Technologies (CSTI) webinar “Why Distributed Edge Data is the Future of AI” our expert speakers, Rita Wouhaybi and Heiko Ludwig, explained what’s new and different about edge data, highlighted use cases and phases of AI at the edge, covered Federated Learning, discussed privacy for edge AI, and provided an overview of the many other challenges and complexities being created by increasingly large AI models and algorithms. It was a fascinating session. If you missed it you can access it on-demand along with a PDF of the slides at the SNIA Educational Library. Our live audience asked several interesting questions. Here are answers from our presenters. Q. With the rise of large language models (LLMs) what role will edge AI play? Read More

Confidential AI Q&A

September 14, 2023September 18, 2023 Erin Farr

Confidential AI is a new collaborative platform for data and AI teams to work with sensitive data sets and run AI models in a confidential environment. It includes infrastructure, software, and workflow orchestration to create a secure, on-demand work environment that meets organization’s privacy requirements and complies with regulatory mandates. It’s a topic the SNIA Cloud Storage Technologies Initiative (CSTI) covered in depth at our webinar, “The Rise in Confidential AI.” At this webinar, our experts, Parviz Peiravi and Richard Searle provided a deep and insightful look at how this dynamic technology works to ensure data protection and data privacy. Here are their answers to the questions from our webinar audience. Q. Are businesses using Confidential AI today? A. Absolutely, we have seen a big increase in adoption of Confidential AI particularly in industries such as Financial Services, Healthcare and Government, where Confidential AI is helping these organizations enhance risk mitigation, including cybercrime prevention, anti-money laundering, fraud prevention and more. Q: With compute capabilities on the Edge increasing, how do you see Trusted Execution Environments evolving? Read More

How Edge Data is Impacting AI

September 6, 2023September 6, 2023 Erin Farr

AI is disrupting so many domains and industries and by doing so, AI models and algorithms are becoming increasingly large and complex. This complexity is driven by the proliferation in size and diversity of localized data everywhere, which creates the need for a unified data fabric and/or federated learning. It could be argued that whoever wins the data race will win the AI race, which is inherently built on two premises: 1) Data is available in a central location for AI to have full access to it, 2) Compute is centralized and abundant.

The impact of edge AI is the topic for our next SNIA Cloud Storage Technologies Initiative (CSTI) live webinar, “Why Distributed Edge Data is the Future of AI,” on October 3, 2023. If centralized (or in the cloud), AI is a superpower and super expert, but edge AI is a community of many smart wizards with the power of cumulative knowledge over a central superpower. In this webinar, our SNIA experts will discuss: Read More

Training Deep Learning Models Q&A

May 19, 2023August 29, 2023 Erin Farr

The estimated impact of Deep Learning (DL) across all industries cannot be understated. In fact, analysts predict deep learning will account for the majority of cloud workloads, and training of deep learning models will represent the majority of server applications in the next few years. It’s the topic the SNIA Cloud Storage Technologies Initiative (CSTI) discussed at our webinar “Training Deep Learning Models in the Cloud.” If you missed the live event, it’s available on-demand at the SNIA Educational Library where you can also download the presentation slides. The audience asked our expert presenters, Milind Pandit from Habana Labs Intel and Seetharami Seelam from IBM several interesting questions. Here are their answers: Q. Where do you think most of the AI will run, especially training? Will it be in the public cloud or will it be on-premises or both [Milind:] It’s probably going to be a mix. There are advantages to using the public cloud especially because it’s pay as you go. So, when experimenting with new models, new innovations, new uses of AI, and when scaling deployments, it makes a lot of sense. But there are still a lot of data privacy concerns. There are increasing numbers of regulations regarding where data needs to reside physically and in which geographies. Because of that, many organizations are deciding to build out their own data centers and once they have large-scale training or inference successfully underway, they often find it cost effective to migrate their public cloud deployment into a data center where they can control the cost and other aspects of data management. [Seelam]: I concur with Milind. We are seeing a pattern of dual approaches. There are some small companies that don’t have the right capital necessary nor the expertise or teams necessary to acquire GPU based servers and deploy them. They are increasingly adopting public cloud. We are seeing some decent sized companies that are adopting this same approach as well. Keep in mind these GPU servers tend to be very power hungry and so you need the right floor plan, power, cooling, and so forth. So, public cloud definitely helps you have easy access and to pay for only what you consume. We are also seeing trends where certain organizations have constraints that restrict moving certain data outside their walls. In those scenarios, we are seeing customers deploy GPU systems on-premises. I don’t think it’s going to be one or the other. It is going to be a combination of both, but by adopting more of a common platform technology, this will help unify their usage model in public cloud and on-premises. Q. What is GDR? You mentioned using it with RoCE. [Seelam]: GDR stands for GPUDirect RDMA. There are several ways a GPU on one node can communicate to a GPU on another node. There are three different ways (at least) of doing this: The GPU can use TCP where GPU data is copied back into the CPU which orchestrates the communication to the CPU and GPU on another node. That obviously adds a lot of latency going through the whole TCP protocol. Another way to do this is through RoCEv2 or RDMA where CPUs, FPGAs and/or GPUs actually talk to each other through industry standard RDMA channels. So, you send and receive data without the added latency of traditional networking software layers. A third method is GDR where a GPU on one node can talk to a GPU on another node directly. This is done through network interfaces where basically the GPUs are talking to each other, again bypassing traditional networking software layers. Q. When you are talking about RoCE do you mean RoCEv2? [Seelam]: That is correct I’m talking only about RoCEv2. Thank you for the clarification. Q. Can you comment on storage needs for DL training and have you considered the use of scale out cloud storage services for deep learning training? If so, what are the challenges and issues? [Milind]: The storage needs are 1) massive and 2) based on the kind of training that you’re doing, (data parallel versus model parallel). With different optimizations, you will need parts of your data to be local in many circumstances. It’s not always possible to do efficient training when data is physically remote and there’s a large latency in accessing it. Some sort of a caching infrastructure will be required in order for your training to proceed efficiently. Seelam may have other thoughts on scale out approaches for training data. [Seelam]: Yes, absolutely I agree 100%. Unfortunately, there is no silver bullet to address the data problem with large-scale training. We take a three-pronged approach. Predominantly, we recommend users put their data in object storage and that becomes the source of where all the data lives. Many training jobs, especially training jobs that deal with text data, don’t tend to be huge in size because these are all characters so we use object store as a source directly to read the data and feed the GPUs to train. So that’s one model of training, but that only works for relatively smaller data sets. They get cached once you access the first time because you shard it quite nicely so you don’t have to go back to the data source many times. There are other data sets where the data volume is larger. So, if you’re dealing with pictures, video or these kinds of training domains, we adopt a two-pronged approach. In one scenario we actually have a distributed cache mechanism where the end users have a copy of the data in the file system and that becomes the source for AI training. In another scenario, we deployed that system with sufficient local storage and asked users to copy the data into that local storage to use that local storage as a local cache. So as the AI training is continuing once the data is accessed, it’s actually cached on the local drive and subsequent iterations of the data come from that cache. This is much bigger than the local memory. It’s about 12 terabytes of cache local storage with the 1.5 terabytes of data. So, we could get to these data sets that are in the 10-terabyte range per node just from the local storage. If they exceed that, then we go to this distributed cache. If the data sets are small enough, then we just use object storage. So, there are at least three different ways, depending on the use case on the model you are trying to train. Q. In a fully sharded data parallel model, there are three communication calls when compared to DDP (distributed data parallel). Does that mean it needs about three times more bandwidth? [Seelam]: Not necessarily three times more, but you will use the network a lot more than you would use in a DDP. In a DDP or distributed data parallel model you will not use the network at all in the forward pass. Whereas in an FSDP (fully sharded data parallel) model, you use the network both in forward pass and in backward pass. In that sense you use the network more, but at the same time because you don’t have parts of the model within your system, you need to get the model from the other neighbors and so that means you will be using more bandwidth. I cannot give you the 3x number; I haven’t seen the 3x but it’s more than DDP for sure. The SNIA CSTI has an active schedule of webinars to help educate on cloud technologies. Follow us on Twitter @SNIACloud and sign up for the SNIA Matters Newsletter, so that you don’t miss any.