Podcast: AI and its impact on data storage

We talk to Shawn Rosemarin, vice-president for R&D at Pure Storage, about the requirements of AI workloads, and what data storage needs to cope with AI

In this podcast, we look at the impact of artificial intelligence (AI) processing on data storage with Shawn Rosemarin, vice-president for R&D in customer engineering at Pure Storage.

We talk about how AI turns enterprise data into a vital source of insight to the business, but also the challenges faced in the complexity of AI operations, the need for data portability, rapid storage access and the ability to extend capacity to the cloud.

Rosemarin also talks about the particular forms of data found in AI, such as vectors and checkpoints, and the need for dense, fast, sustainable and easy-to-manage data storage infrastructure.

Antony Adshead: What’s different about AI workloads?

Shawn Rosemarin: I think the most interesting part of this is, first of all, let’s align AI with the next iteration of analytics.

We saw business intelligence. We saw analytics. We saw what we called modern analytics. Now, we’re seeing AI.

What’s different is ultimately now we’re looking at a corpus of data, not just the general corpus like we look at in ChatGPT, but the individual corpuses of data within each enterprise actually now becoming essentially the gold that gets harvested into these models; the libraries that now train all of these models.

And so, when you think about the volume of data that represents, that’s one element. The other thing is you now have to think about the performance element of actually taking all these volumes of data and being able to learn from them.

Then you’ve got another element which says, ‘I’ve got to integrate all of those data sources across all the different silos of my organisation, not just data that’s sitting on-premises, data that’s sitting in the cloud, data that I’m buying from third-party sources, data that’s sitting in SaaS [software as a service]’.

And lastly, I would say there’s a huge human element to this. This is a new technology. It’s quite complex at this particular point in time, although we all believe it will be standardised and it’s going to require staffing, it’s going to require skill sets that most organisations don’t have at their fingertips.

What does storage need to cope with AI workloads?

Rosemarin: At the end of the day, when we think of the evolution of storage, we’ve seen a couple of things.

First of all, there’s no doubt, I think, in anyone’s mind at this point, that hard drives are pretty much going the way of the dodo. And we’re moving on to all flash, for reasons of reliability, for reasons of performance, for reasons of, ultimately, environmental economics.

But, when we think about storage, the biggest obstacle in AI is actually moving storage around. It’s taking blocks of storage and moving them to satisfy certain high-performance workloads.

What we really want is a central storage architecture that can be used not just for the gathering of information, but the training, and the interpretation of that training in the marketplace.

Ultimately, what I’m talking to you about is performance to feed hungry GPUs. We’re talking about latency, so that when we’re running inference models, our consumers are getting answers as quickly as they possibly can without waiting. We’re talking about capacity and scale. We’re talking about non-disruptive upgrades and expansions.

As our needs change and these services become more important to our users, we don’t have to bring down the environment just to be able to add additional storage.

Last but not least would be the cloud consumption element: the ability to easily extend those volumes to the cloud. If we wish to do that training or inference in the cloud, and then obviously consuming them as a service, getting away from these massive CapEx injections up front and instead looking to consume the storage that we need as we need it and completely 100% via service level agreements and as-a-service.

Is there anything about the ways in which data is held for AI, such as the use of vectors, checkpointing or the frameworks used in AI like TensorFlow and PyTorch, that dictate how we need to hold data in storage for AI?

Rosemarin: Yeah, absolutely it does, especially if we compare it with the way storage has been used historically in relational databases or data protection.

When you think about vector databases, when you think about all of the AI frameworks, and you think about how these datasets are being fed to GPUs, let me give you an analogy.

In essence, if you think of the GPUs, these very expensive investments that enterprises and clouds have made, think of them as PhD students. Think of them as very expensive, very talented, very smart folks who work in your environment. And what you want to do is ensure they always have something to do, and more importantly, that as they complete their work, you’re there to collect that work and ensure you’re bringing the next volume of work to them.

And so, in the AI world, you’ll hear this concept of vector databases and checkpoints. What that essentially says is, ‘I’m moving from a relational database to a vector database’. And essentially, as my information is getting queried, it’s getting queried across multiple dynamics.

We call these parameters, but essentially we’re looking at the data from all angles. And the GPUs are telling storage what they’ve looked at and where they are in their particular workload.

The impact on storage is that it does force significantly more writes. And when you think of reads versus writes, those are very important from a performance profile. When you think about the writes in particular, these are very small writes. These are essentially bookmarks of where they are in their work.

And that is actually forcing a very different performance profile than what many have been used to. It is building new performance profiles for what we’re considering specifically in training.

Now, inference is all about latency and training. It’s all about IOPs. But to answer your question very specifically, this is forcing a much higher write ratio than we have traditionally looked at. And I would suggest to your audience that looking at 80% writes, 20% reads in a training environment is much more appropriate than where we would have traditionally looked at 50/50.

What do you think enterprise storage is going to look like in five years as AI increases in use?

Rosemarin: I like to think of storage a little bit like the tyres on your car.

Right now, everybody’s very focused on the chassis of their car. They’re very focused on the GPUs and the performance, and how fast they can go and what they can deliver.

But the reality is, the real value in all of this is the data that you’re mining; the quality of that data, the use of that data in these training models to actually give you an advantage – be it personalisation and marketing, be it high-frequency trading if you’re a bank or know your customer, be it patient care within a healthcare facility.

When we look to the future of storage, I think storage will be recognised and acknowledged for being absolutely critical in driving the end value of these AI projects.

I think clearly what we’re seeing is denser and denser storage arrays. Here at Pure, we’ve already committed to marketing that. We’ll have 300TB drives by 2026. I think we’re seeing the commodity solid state drive industry significantly behind that. I think they’re aiming for about 100TB under the same time frame, but I think we’ll continue to see denser and denser drives.

I think we’ll also see, in tandem with that density, lower and lower energy consumption. There’s no doubt that energy and access to energy is the silent killer in the build-out of AI, so getting to a point where we can consume less energy to drive more computing will be crucial.

Lastly, I would get to this point of autonomous storage. Putting less and less energy – human energy, human manpower – into the day-to-day operations, the upgrades, the expansions, the tuning of storage is really what enterprises are asking for, to ultimately allow them to focus their human energy on building out the systems of tomorrow.

So, when you think about it, really: density, energy efficiency and simplicity.

Then, I think you’ll continue to see the cost per gigabyte cost per TB fall in the marketplace, allowing for more and more of the consumerisation of storage and allowing organisations to actually light up more and more of their data for the same amount of investment.

Read more on Datacentre capacity planning

How Vast Data is simplifying data infrastructure