How Edge Data is Impacting AI

AI is disrupting so many domains and industries and by doing so, AI models and algorithms are becoming increasingly large and complex. This complexity is driven by the proliferation in size and diversity of localized data everywhere, which creates the need for a unified data fabric and/or federated learning. It could be argued that whoever wins the data race will win the AI race, which is inherently built on two premises: 1) Data is available in a central location for AI to have full access to it, 2) Compute is centralized and abundant.

The impact of edge AI is the topic for our next SNIA Cloud Storage Technologies Initiative (CSTI) live webinar, “Why Distributed Edge Data is the Future of AI,” on October 3, 2023. If centralized (or in the cloud), AI is a superpower and super expert, but edge AI is a community of many smart wizards with the power of cumulative knowledge over a central superpower.  In this webinar, our SNIA experts will discuss: Read More

Training Deep Learning Models Q&A

The estimated impact of Deep Learning (DL) across all industries cannot be understated. In fact, analysts predict deep learning will account for the majority of cloud workloads, and training of deep learning models will represent the majority of server applications in the next few years. It’s the topic the SNIA Cloud Storage Technologies Initiative (CSTI) discussed at our webinar “Training Deep Learning Models in the Cloud.” If you missed the live event, it’s available on-demand at the SNIA Educational Library where you can also download the presentation slides. The audience asked our expert presenters, Milind Pandit from Habana Labs Intel and Seetharami Seelam from IBM several interesting questions. Here are their answers: Q. Where do you think most of the AI will run, especially training? Will it be in the public cloud or will it be on-premises or both [Milind:] It’s probably going to be a mix. There are advantages to using the public cloud especially because it’s pay as you go. So, when experimenting with new models, new innovations, new uses of AI, and when scaling deployments, it makes a lot of sense. But there are still a lot of data privacy concerns. There are increasing numbers of regulations regarding where data needs to reside physically and in which geographies. Because of that, many organizations are deciding to build out their own data centers and once they have large-scale training or inference successfully underway, they often find it cost effective to migrate their public cloud deployment into a data center where they can control the cost and other aspects of data management. [Seelam]: I concur with Milind. We are seeing a pattern of dual approaches. There are some small companies that don’t have the right capital necessary nor the expertise or teams necessary to acquire GPU based servers and deploy them. They are increasingly adopting public cloud. We are seeing some decent sized companies that are adopting this same approach as well. Keep in mind these GPU servers tend to be very power hungry and so you need the right floor plan, power, cooling, and so forth. So, public cloud definitely helps you have easy access and to pay for only what you consume. We are also seeing trends where certain organizations have constraints that restrict moving certain data outside their walls. In those scenarios, we are seeing customers deploy GPU systems on-premises. I don’t think it’s going to be one or the other. It is going to be a combination of both, but by adopting more of a common platform technology, this will help unify their usage model in public cloud and on-premises. Q. What is GDR? You mentioned using it with RoCE. [Seelam]: GDR stands for GPUDirect RDMA. There are several ways a GPU on one node can communicate to a GPU on another node. There are three different ways (at least) of doing this: The GPU can use TCP where GPU data is copied back into the CPU which orchestrates the communication to the CPU and GPU on another node. That obviously adds a lot of latency going through the whole TCP protocol. Another way to do this is through RoCEv2 or RDMA where CPUs, FPGAs and/or GPUs actually talk to each other through industry standard RDMA channels. So, you send and receive data without the added latency of traditional networking software layers. A third method is GDR where a GPU on one node can talk to a GPU on another node directly. This is done through network interfaces where basically the GPUs are talking to each other, again bypassing traditional networking software layers. Q. When you are talking about RoCE do you mean RoCEv2? [Seelam]: That is correct I’m talking only about RoCEv2. Thank you for the clarification. Q. Can you comment on storage needs for DL training and have you considered the use of scale out cloud storage services for deep learning training? If so, what are the challenges and issues? [Milind]: The storage needs are 1) massive and 2) based on the kind of training that you’re doing, (data parallel versus model parallel). With different optimizations, you will need parts of your data to be local in many circumstances. It’s not always possible to do efficient training when data is physically remote and there’s a large latency in accessing it. Some sort of a caching infrastructure will be required in order for your training to proceed efficiently. Seelam may have other thoughts on scale out approaches for training data. [Seelam]: Yes, absolutely I agree 100%. Unfortunately, there is no silver bullet to address the data problem with large-scale training. We take a three-pronged approach. Predominantly, we recommend users put their data in object storage and that becomes the source of where all the data lives. Many training jobs, especially training jobs that deal with text data, don’t tend to be huge in size because these are all characters so we use object store as a source directly to read the data and feed the GPUs to train. So that’s one model of training, but that only works for relatively smaller data sets. They get cached once you access the first time because you shard it quite nicely so you don’t have to go back to the data source many times. There are other data sets where the data volume is larger. So, if you’re dealing with pictures, video or these kinds of training domains, we adopt a two-pronged approach. In one scenario we actually have a distributed cache mechanism where the end users have a copy of the data in the file system and that becomes the source for AI training. In another scenario, we deployed that system with sufficient local storage and asked users to copy the data into that local storage to use that local storage as a local cache. So as the AI training is continuing once the data is accessed, it’s actually cached on the local drive and subsequent iterations of the data come from that cache. This is much bigger than the local memory. It’s about 12 terabytes of cache local storage with the 1.5 terabytes of data. So, we could get to these data sets that are in the 10-terabyte range per node just from the local storage. If they exceed that, then we go to this distributed cache. If the data sets are small enough, then we just use object storage. So, there are at least three different ways, depending on the use case on the model you are trying to train. Q. In a fully sharded data parallel model, there are three communication calls when compared to DDP (distributed data parallel). Does that mean it needs about three times more bandwidth? [Seelam]: Not necessarily three times more, but you will use the network a lot more than you would use in a DDP. In a DDP or distributed data parallel model you will not use the network at all in the forward pass. Whereas in an FSDP (fully sharded data parallel) model, you use the network both in forward pass and in backward pass. In that sense you use the network more, but at the same time because you don’t have parts of the model within your system, you need to get the model from the other neighbors and so that means you will be using more bandwidth. I cannot give you the 3x number; I haven’t seen the 3x but it’s more than DDP for sure. The SNIA CSTI has an active schedule of webinars to help educate on cloud technologies. Follow us on Twitter @SNIACloud and sign up for the SNIA Matters Newsletter, so that you don’t miss any.                      

Digital Twins Q&A

A digital twin (DT) is a virtual representation of an object, system or process that spans its lifecycle, is updated from real-time data, and uses simulation, machine learning and reasoning to help decision-making. Digital twins can be used to help answer what-if AI-analytics questions, yield insights on business objectives and make recommendations on how to control or improve outcomes. It’s a fascinating technology that the SNIA Cloud Storage Technologies Initiative (CSTI) discussed at our live webcast “Journey to the Center of Massive Data: Digital Twins.” If you missed the presentation, you can watch it on-demand and access a PDF of the slides at the SNIA Educational Library. Our audience asked several interesting questions which are answered here in this blog. Q. Will a digital twin make the physical twin more or less secure? Read More

You’ve Been Framed! An Overview of Programming Frameworks

With the emergence of GPUs, xPUs (DPU, IPU, FAC, NAPU, etc.) and computational storage devices for host offload and accelerated processing, a panoramic wild west of frameworks is emerging, all vying to be one of the preferred programming software stacks that best integrates the application layer with these underlying processing units. On October 26, 2022, the SNIA Networking Storage Forum will break down what’s happening in the world of frameworks in our live webcast, “You’ve Been Framed! An Overview of Programming Frameworks for xPU, GPU & Computational Storage Programming Frameworks.” We’ve convened an impressive group of experts that will provide an overview of programming frameworks that support: Read More

5G Industrial Private Networks and Edge Data Pipelines

The convergence of 5G, Edge Compute and Artificial Intelligence (AI) promise to be catalyst for continued digital transformation. For many industries, it will be a game-changer in term of how business in conducted. On January 27, 202, the SNIA Cloud Storage Technologies Initiative (CSTI) will take on this topic at our live webcast “5G Industrial Private Networks and Edge Data Pipelines.” Advanced 5G is specifically designed to address the needs of verticals with capabilities like enhanced mobile broadband (emBB), ultra-reliable low latency communications (urLLC), and massive machine type communications (mMTC), to enable near real-time distributed intelligence applications. For example, automated guided vehicle and autonomous mobile robots (AGV/AMRs), wireless cameras, augmented reality for connected workers, and smart sensors across many verticals ranging from healthcare and immersive media, to factory automation. Read More

Storage for AI Q&A

What types of storage are needed for different aspects of AI? That was one of the many topics covered in our SNIA Networking Storage Forum (NSF) webcast “Storage for AI Applications.” It was a fascinating discussion and I encourage you to check it out on-demand. Our panel of experts answered many questions during the live roundtable Q&A. Here are answers to those questions, as well as the ones we didn’t have time to address. Q. What are the different data set sizes and workloads in AI/ML in terms of data set size, sequential/ random, write/read mix? A. Data sets will vary incredibly from use case to use case. They may be GBs to possibly 100s of PB. In general, the workloads are very heavily reads maybe 95%+. While it would be better to have sequential reads, in general the patterns tend to be closer to random. In addition, different use cases will have very different data sizes. Some may be GBs large, while others may be <1 KB. The different sizes have a direct impact on performance in storage and may change how you decide to store the data. Read More

Storage for Applications Webcast Series

Everyone enjoys having storage that is fast, reliable, scalable, and affordable. But it turns out different applications have different storage needs in terms of I/O requirements, capacity, data sharing, and security.  Some need local storage, some need a centralized storage array, and others need distributed storage—which itself could be local or networked. One application might excel with block storage while another with file or object storage. For example, an OLTP database might require small amounts of very fast flash storage; a media or streaming application might need vast quantities of inexpensive disk storage with extra security safeguards; while a third application might require a mix of different storage tiers with multiple servers sharing the same data. This SNIA Networking Storage Forum “Storage for Applications” webcast series will cover the storage requirements for specific uses such as artificial intelligence (AI), database, cloud, media & entertainment, automotive, edge, and more. With limited resources, it’s important to understand the storage intent of the applications in order to choose the right storage and storage networking strategy, rather than discovering the hard way that you’ve chosen the wrong solution for your application. We kick off this series on October 5, 2020 with “Storage for AI Applications.” AI is a technology which itself encompasses a broad range of use cases, largely divided into training and inference. Read More

Can Cloud Storage and Big Data Live Happily Ever After?

“Big Data” has pushed the storage envelope, creating a seemingly perfect relationship with Cloud Storage. But local storage is the third wheel in this relationship, and won’t go down easy. Can this marriage survive when Big Data is being pulled in two directions? Should Big Data pick one, or can the three of them live happily ever after? This will be the topic of discussion on October 21, 2021 at our live SNIA Cloud Storage Technologies webcast, “Cloud Storage and Big Data, A Marriage Made in the Clouds.” Join us as our SNIA experts will cover: Read More

Q&A on the Ethics of AI

Earlier this month, the SNIA Cloud Storage Technologies Initiative (CSTI) hosted an intriguing discussion on the Ethics of Artificial Intelligence (AI). Our expert, Rob Enderle, Founder of The Enderle Group, and Eric Hibbard, Chair of the SNIA Security Technical Work Group, shared their experiences and insights on what it takes to keep AI ethical. If you missed the live event, it Is available on-demand along with the presentation slides at the SNIA Educational Library. As promised during the live event, our experts have provided written answers to the questions from this session, many of which we did not have time to get to. Q. The webcast cited a few areas where AI as an attacker could make a potential cyber breach worse, are there also some areas where AI as a defender could make cybersecurity or general welfare more dangerous for humans? Read More

Cloud Analytics Drives Airplanes-as-a-Service Business

On-demand flying through an app sounds like something for only the rich and famous, yet the use of cloud analytics is making flexible flying a reality at start-up airline, KinectAir.  On April 7, 2021, The CTO of KinectAir, Ben Howard, will join the SNIA Cloud Storage Technologies Initiative (CSTI) for a fascinating discussion on first-hand experiences of leveraging cloud analytics methods to bring new business models to life that are competitive and profitable. And since start-up companies may not have legacy data and analytics to consider, we’ll also explore what established businesses using traditional analytics methods can learn from this use case. Join us on April 7th for our live webcast “Adapting Cloud Analytics for Practical Business Use” for views from both start-up and established companies on how to revisit the analytics decision process with a discussion on: Read More