Solving Cloud Object Storage Incompatibilities in a Multi-Vendor Community

The SNIA Cloud Storage Technologies Initiative (CSTI) conducted a poll early in 2024 during a live webinar “Navigating Complexities of Object Storage Compatibility,” citing 72% of organizations have encountered incompatibility issues between various object storage implementations. These results resulted in a call to action for SNIA to create an open expert community dedicated to resolving these issues and building best practices for the industry. Since then, SNIA CSTI has partnered with the SNIA Cloud Storage Technical Work Group (TWG) and successfully organized, hosted, and completed the first SNIA Cloud Object Storage Plugfest (multi-vendor interoperability testing), co-located at SNIA Developer Conference (SDC), September 2024, in Santa Clara, CA. Participating Plugfest companies included engineers from Dell, Google, Hammerspace, IBM, Microsoft, NetApp, VAST Data, and Versity Software. Three days of Plugfest testing discovered and resolved issues, and included a Birds of a Feather (BoF) session to gain consensus on next steps for the industry. Plugfest contributors are now planning two 2025 Plugfest events in Denver in April and Santa Clara in September. It’s a collaborative effort that we’ll discuss in detail on November 21, 2024 at our next live SNIA CSTI webinar, “Building a Community to Tackle Cloud Object Storage Incompatibilities.” At this webinar, we will share insights into industry best practices, explain the benefits your implementation may gain with improved compatibility, and provide an overview of how a wide range of vendors is uniting to address real customer issues, discussing: Read More

Complexities of Object Storage Compatibility Q&A

72% of organizations have encountered incompatibility issues between various object storage implementations according to a poll at our recent SNIA Cloud Storage Technologies Initiative webinar, “Navigating the Complexities of Object Storage Compatibility.” If you missed the live presentation or you would like to see the answers to the other poll questions we asked the audience, you can view it on-demand at the SNIA Educational Library. The audience was highly-engaged during the live event and asked several great questions. Here are answers to them all. Q. Do you see the need for fast object storage for AI kind of workloads? A. Yes, the demand for fast object storage in AI workloads is growing. Initially, object storage was mainly used for backup or archival purposes. However, its evolution into Data Lakes and the introduction of features like the S3 SELECT API have made it more suitable for data analytics. The launch of Amazon’s S3 Express, a faster yet more expensive tier, is a clear indication of this trend. Other vendors are following suit, suggesting a shift towards object storage as a primary data storage platform for specific workloads. Q. As Object Storage becomes more prevalent in the primary storage space, could you talk about data protection, especially functionalities like synchronous replication and multi-site deployments – or is your view that this is not needed for object storage deployments? Read More

It’s All About Cloud Object Storage Interoperability

Object storage has firmly established itself as a cornerstone of modern data centers and cloud infrastructure. Ensuring API compatibility has become crucial for object storage developers who want to benefit from the wide ecosystem of existing applications. However, achieving compatibility can be challenging due to the complexity and variety of the APIs, access control mechanisms, and performance and scalability requirements. The SNIA Cloud Storage Technologies Initiative, together with the SNIA Cloud Storage Technical Work Group, is working to address the issues of cloud object storage complexity and interoperability. We’re kicking off 2024 with two exciting initiatives: 1) a webinar on June 9, 2024, and 2) a Plugfest in September of 2024. Here are the details: Read More

An Open Standard for Namespace Management

The days of simple, static, self-contained file systems have long passed. Today, we have complex, dynamic namespaces, mixing block, file, object, key-value, queue, and graph-based resources, accessed via multiple protocols, and distributed across multiple systems and geographic sites. These complexities result in new challenges for simplifying management. There is good news on addressing this issue, and the  SNIA Cloud Storage Technologies Initiative (CSTI) will explain how in our live webinar “Simplified Namespace Management – The Open Standards Way” on October 18, 2023, where David Slik, Chair of the SNIA Cloud Storage Technical Work Group, will demonstrate how the SNIA Cloud Data Management Interface (CDMI™), an open ISO standard (ISO/IEC 17826:2022) for managing data objects and containers, already includes extensive capabilities for simplifying the management of complex namespaces. In this webinar, you’ll learn Read More

Data Fabric Q&A

Unification of structured and unstructured data has long been a goal – and challenge for organizations. Data Fabric is an architecture, set of services and platform that standardizes and integrates data across the enterprise regardless of data location (On-Premises, Cloud, Multi-Cloud, Hybrid Cloud), enabling self-service data access to support various applications, analytics, and use cases. The data fabric leaves data where it lives and applies intelligent automation to govern, secure and bring AI to your data.

How a data fabric abstraction layer works and the benefits it delivers was the topic of our recent SNIA Cloud Storage Technologies Initiative (CSTI) webinar, “Data Fabric: Connecting the Dots between Structured and Unstructured Data.” If you missed it, you can watch it on-demand and access the presentations slides at the SNIA Educational Library.

We did not have time to answer audience questions at the live session. Here are answers from our expert, Joseph Dain. Read More

Training Deep Learning Models Q&A

The estimated impact of Deep Learning (DL) across all industries cannot be understated. In fact, analysts predict deep learning will account for the majority of cloud workloads, and training of deep learning models will represent the majority of server applications in the next few years. It’s the topic the SNIA Cloud Storage Technologies Initiative (CSTI) discussed at our webinar “Training Deep Learning Models in the Cloud.” If you missed the live event, it’s available on-demand at the SNIA Educational Library where you can also download the presentation slides. The audience asked our expert presenters, Milind Pandit from Habana Labs Intel and Seetharami Seelam from IBM several interesting questions. Here are their answers: Q. Where do you think most of the AI will run, especially training? Will it be in the public cloud or will it be on-premises or both [Milind:] It’s probably going to be a mix. There are advantages to using the public cloud especially because it’s pay as you go. So, when experimenting with new models, new innovations, new uses of AI, and when scaling deployments, it makes a lot of sense. But there are still a lot of data privacy concerns. There are increasing numbers of regulations regarding where data needs to reside physically and in which geographies. Because of that, many organizations are deciding to build out their own data centers and once they have large-scale training or inference successfully underway, they often find it cost effective to migrate their public cloud deployment into a data center where they can control the cost and other aspects of data management. [Seelam]: I concur with Milind. We are seeing a pattern of dual approaches. There are some small companies that don’t have the right capital necessary nor the expertise or teams necessary to acquire GPU based servers and deploy them. They are increasingly adopting public cloud. We are seeing some decent sized companies that are adopting this same approach as well. Keep in mind these GPU servers tend to be very power hungry and so you need the right floor plan, power, cooling, and so forth. So, public cloud definitely helps you have easy access and to pay for only what you consume. We are also seeing trends where certain organizations have constraints that restrict moving certain data outside their walls. In those scenarios, we are seeing customers deploy GPU systems on-premises. I don’t think it’s going to be one or the other. It is going to be a combination of both, but by adopting more of a common platform technology, this will help unify their usage model in public cloud and on-premises. Q. What is GDR? You mentioned using it with RoCE. [Seelam]: GDR stands for GPUDirect RDMA. There are several ways a GPU on one node can communicate to a GPU on another node. There are three different ways (at least) of doing this: The GPU can use TCP where GPU data is copied back into the CPU which orchestrates the communication to the CPU and GPU on another node. That obviously adds a lot of latency going through the whole TCP protocol. Another way to do this is through RoCEv2 or RDMA where CPUs, FPGAs and/or GPUs actually talk to each other through industry standard RDMA channels. So, you send and receive data without the added latency of traditional networking software layers. A third method is GDR where a GPU on one node can talk to a GPU on another node directly. This is done through network interfaces where basically the GPUs are talking to each other, again bypassing traditional networking software layers. Q. When you are talking about RoCE do you mean RoCEv2? [Seelam]: That is correct I’m talking only about RoCEv2. Thank you for the clarification. Q. Can you comment on storage needs for DL training and have you considered the use of scale out cloud storage services for deep learning training? If so, what are the challenges and issues? [Milind]: The storage needs are 1) massive and 2) based on the kind of training that you’re doing, (data parallel versus model parallel). With different optimizations, you will need parts of your data to be local in many circumstances. It’s not always possible to do efficient training when data is physically remote and there’s a large latency in accessing it. Some sort of a caching infrastructure will be required in order for your training to proceed efficiently. Seelam may have other thoughts on scale out approaches for training data. [Seelam]: Yes, absolutely I agree 100%. Unfortunately, there is no silver bullet to address the data problem with large-scale training. We take a three-pronged approach. Predominantly, we recommend users put their data in object storage and that becomes the source of where all the data lives. Many training jobs, especially training jobs that deal with text data, don’t tend to be huge in size because these are all characters so we use object store as a source directly to read the data and feed the GPUs to train. So that’s one model of training, but that only works for relatively smaller data sets. They get cached once you access the first time because you shard it quite nicely so you don’t have to go back to the data source many times. There are other data sets where the data volume is larger. So, if you’re dealing with pictures, video or these kinds of training domains, we adopt a two-pronged approach. In one scenario we actually have a distributed cache mechanism where the end users have a copy of the data in the file system and that becomes the source for AI training. In another scenario, we deployed that system with sufficient local storage and asked users to copy the data into that local storage to use that local storage as a local cache. So as the AI training is continuing once the data is accessed, it’s actually cached on the local drive and subsequent iterations of the data come from that cache. This is much bigger than the local memory. It’s about 12 terabytes of cache local storage with the 1.5 terabytes of data. So, we could get to these data sets that are in the 10-terabyte range per node just from the local storage. If they exceed that, then we go to this distributed cache. If the data sets are small enough, then we just use object storage. So, there are at least three different ways, depending on the use case on the model you are trying to train. Q. In a fully sharded data parallel model, there are three communication calls when compared to DDP (distributed data parallel). Does that mean it needs about three times more bandwidth? [Seelam]: Not necessarily three times more, but you will use the network a lot more than you would use in a DDP. In a DDP or distributed data parallel model you will not use the network at all in the forward pass. Whereas in an FSDP (fully sharded data parallel) model, you use the network both in forward pass and in backward pass. In that sense you use the network more, but at the same time because you don’t have parts of the model within your system, you need to get the model from the other neighbors and so that means you will be using more bandwidth. I cannot give you the 3x number; I haven’t seen the 3x but it’s more than DDP for sure. The SNIA CSTI has an active schedule of webinars to help educate on cloud technologies. Follow us on Twitter @SNIACloud and sign up for the SNIA Matters Newsletter, so that you don’t miss any.                      

Web 3.0 – The Future of Decentralized Storage

Decentralized storage is bridging the gap between Web 2.0 and Web 3.0, and its impact on enterprise storage is significant. The topic of decentralized storage and Web 3.0 will be the focus of an expert panel discussion the SNIA Networking Storage Forum is hosting on June 1, 2023, “Why Web 3.0 is Important to Enterprise Storage.” In this webinar, we will provide an overview of enterprise decentralized storage and explain why it is more relevant now than ever before. We will delve into the benefits and demands of decentralized storage and discuss the evolution of on-premises, to cloud, to decentralized storage (cloud 2.0). We will also explore various use cases of decentralized storage, including its role in data privacy and security and the potential for decentralized applications (dApps) and blockchain technology. Read More

Storage Threat Detection Q&A

Stealing data, compromising data, and holding data hostage have always been the main goals of cybercriminals. Threat detection and response methods continue to evolve as the bad guys become increasingly sophisticated, but for the most part, storage has been missing from the conversation. Enter “Cyberstorage,” a topic the SNIA Cloud Storage Technologies Initiative recently covered in our live webinar, “Cyberstorage and XDR: Threat Detection with a Storage Lens.” It was a fascinating look at enhancing threat detection at the storage layer. If you missed the live event, it’s available on-demand along with the presentation slides. We had some great questions from the live event as well as interesting results from our audience poll questions that we wanted to share here. Q. You mentioned antivirus scanning is redundant for threat detection in storage, but could provide value during recovery. Could you elaborate on that? Read More

Survey Says…Here are Data & Cloud Storage Trends Worth Noting

With the move to cloud continuing, application modernization, and related challenges such as hybrid and multi-cloud adoption and regulatory compliance requirements, enterprises must ensure they understand the current data and storage landscape. The SODA Foundation’s annual comprehensive global survey on data and storage trends does just that, providing a comprehensive look at the intersection of cloud computing, data and storage management, the configuration of environments that end-user organizations are gravitating to, and priorities of selected capabilities over the next several years On April 13, 2023, SNIA Cloud Storage Technologies Initiative (CSTI) is pleased to host SODA in a live webcast “Top 12 Trends in Data and Cloud Storage” where SODA members who led this research will share key findings. I hope you will join us for a live discussion and in-depth look at this important research to hear the trends that are driving data and storage decisions, including: Read More

Kubernetes Trials & Tribulations Q&A: Cloud, Data Center, Edge

Kubernetes cloud orchestration platforms offer all the flexibility, elasticity, and ease of use — on premises, in a private or public cloud, even at the edge. The flexibility of turning on services when you want them, turning them off when you don’t, is an enticing prospect for developers as well as application deployment teams, but it has not been without its challenges. At our recent SNIA Cloud Storage Technologies Initiative webcast “Kubernetes Trials & Tribulations: Cloud, Data Center, Edge” our experts, Michael St-Jean and Pete Brey, debated both the challenges and advantages of Kubernetes. If you missed the session, it is available on-demand along with the presentation slides. The live audience raised several interesting questions. Here are answers to them from our presenters. Q: Are all these trends coming together? Where will Kubernetes be in the next 1-3 years? Read More