AI Leadership Insights: Managing Data Pipelines

Synopsis: In this Techstrong AI Leadership Insights video, Hammerspace CEO, David Flynn, explains why managing data pipelines is going to be a bigger challenge as IT teams deploy more large language models (LLMs) to drive artificial intelligence (AI) applications

Mike Vizard: Hello and welcome to the latest edition of the Techstrong AI video series. I’m your host, Mike Vizard. Today we’re with David Flynn, who is CEO for Hammerspace, and we’re talking about how we’re going to get all these data pipelines constructed for all these large language models we hope to deploy one day because well, they don’t just magically appear. David, welcome to the show.

David Flynn: Hey. Thanks for having me, Michael. Really appreciate.

Mike Vizard: Let’s get started first with, well, how many LLMs do we think that the average enterprise is going to be trying to work with and how big are these things ultimately going to be? Because some people, I think they have it in their heads that each one is petabytes maybe, but turns out maybe they’re in terabytes, but what’s the current state of the thing?

David Flynn: Yeah. Well, it’s true that LLMs language models text is rather compact, so there’s not a whole lot of space taken up there. It’s when we start going multimodal that I think we see the very large capacity points show up. So I think you have a very valid point that text-based only isn’t as storage capacity demanding.

Mike Vizard: To be blunt about it, I don’t think we’ve been very good at data management historically in most organizations. And I think maybe AI is going to force that issue.

David Flynn: I like to say like that AI is causing a reckoning around the enterprise to their traditional model of store and copy as a way to serve and preserve the data. The store and copy model is inherently a silo, and so it’s forcing a reckoning to have us have to introduce agility and performance levels that were not attainable using the traditional storage architectures and the store and copy management model.

Mike Vizard: So what needs to replace the store and copy model, because that’s all we’ve known?

David Flynn: Well, data orchestration and data orchestration, it’s not a fancy new word for data management. It’s when the movement of data is performed behind the facade of the data presentation. So that data can be continuously presented without interruption even while the data moves across infrastructure and not just across different storage systems or services, but even across whole data centers. We have to get to where data can move without it disrupting the accessibility of the data. And that is the binary thing. Right now, data is moved from outside and so it inherently disrupts the ongoing access to the storage because you’re moving to different storage. It’s actually different data. So of course you have to interrupt the access to point to different data. So the real trick is how do you solve the seeming paradox of how can you have data local to different data centers, local on different infrastructure clusters, but without ever having copied it?
Now think about that. How can you have data move without ever having copied it? Well, the answer is once the metadata is singular and the separated from the data, now the data can move even while you keep a singular view of the data. So this is a fundamentally different architecture where it takes the data presentation layer, what in tech we call a file system and takes that out of the infrastructure and builds an overlay file system. This isn’t just a catalog and copy. Cataloging things from the outside is just another form of copy. So we’re not talking about building a catalog and copying stuff around and then trying to introduce that you have to access through the catalog just to then access through the underlying file system.
Adding another layer of access isn’t the solution. We have to take the most fundamental layer, the file system, and pull that out and put it atop everything, building a whole new type of file system actually. And then behind that, you can now make the data movement hidden and that’s what we call data orchestration. That new file system is Hyperscale NAS, a hyper scalable NAS. And once you have that, then the data movement function can be made to be non-disruptive and transparent ongoing access. And that’s data orchestration. So these two things come hand in glove.

Mike Vizard: How dynamic is that going to be? Because when I was much younger, they taught me that moving data was always a career threatening proposition because that could happen. So how is the way we think about managing data changing and needs to change?

David Flynn: Well, I like to learn from consumer electronics what technologies and what approaches might end up disrupting the world of enterprise. And that was actually what led to creating my previous company, Fusion-io, and a realization that solid state had the potential to transform data centers and move away from hard drives. Because in the world of consumer electronics, we had already been using NAND flash solid state storage for quite a while, and it took a long time before you had solid state devices that were capable of disrupting the enterprise disc and disc array business.
Well, similarly, we are already in a data orchestrated world when it comes to our consumer data. You don’t think twice when you move from one cell phone to the next. You don’t think twice when you move between your cell phone and your laptop and your tablet. All of your personal data is orchestrated for you to where you have the appearance of everything everywhere without it being a copy. Now that’s because the iOS platform and Android platform have done the orchestration for you. Is it because the data’s stored in the cloud or maybe there’s a backup in the cloud and the data’s on the device or combination of the two, maybe it’s held on a server, you don’t really care. The data is physically distributed to where it’s more robust, it’s more permanent than the device in your hand.
And the ubiquitous connectivity of that device and the software that leverages that connectivity to make sure the data is physically distributed safely, gives you the appearance of all of your stuff on all of your devices without having to do that data migration process. So how do we get to that point with enterprise IT, with petabyte and even exabyte scale data sets to where it simply exists in a way that’s just as independent of these enterprise storage systems as your data is independent of your cell phone and more permanent. And that’s the irony is once you put data in motion, once data is presumed to be orchestrated for you and physically distributed, it actually becomes not just more readily accessible, not just higher performance, but it becomes more permanent. Because it can outlive any of those storage things.
We get rid of the whole data wrangling and copying the career ending task of copy stuff. It’s crazy that data migrations and copying are still such a manual process. It’s even crazy that it’s a manual process to decide what type of storage to put your data sets on. I mean, you have to negotiate. The app owner has to sit down with the CFO and sit down with the infrastructure team and decide, “Okay. Are we going to go file? Are we going to go object? Whose storage system, where are we going to put it?” The repercussions of that decisions last for so long because once you become a prisoner to that store, once your data is in it, your owned by it, because the data’s very existence is a rendering of that storage system. And that’s the problem. We have to take it. So the data’s is not a rendering of the storage system, but a data’s existence transcends that.
And that’s what Hyperscale NAS and data orchestration give you, is it gives you that data presentation layer that’s now independent of all infrastructure and yet able to move data across any of it, so that you can have that appearance of everything being potentially anywhere and everywhere without having to think about it. And by the way, it’s the same thing. Data centers now have massive pipes connecting them. We have plenty of raw capability. We just haven’t had the software layer to give the data its identity, its existence independent of the storage system. We hadn’t pulled the file system out of the infrastructure anyhow. That’s my story and I’m sticking to it.

Mike Vizard: Are we replacing the existing storage systems we have or is this a new capability that we’re adding on top of what we already have and how does we get from where we are today to where we need to be tomorrow?

David Flynn: Well, that has become a really interesting story. First of all, this has to be an overlay and it has to be able to use all different forms of existing storage. You have to be able to go into brownfield opportunities and overlay. You can think of it as an abstraction layer where abstracting the data from the very storage storing it. So it is an overlay and that allows you to go into brownfield but also greenfield opportunities and choose whatever storage you want, including just commodity off the shelf hardware. Doesn’t have to be expensive storage systems.
But when you overlay, the interesting dynamic here is that by introducing the abstraction layer and automating the data movement, you’re reducing the friction to having data move. And you would think that, “Okay. That’s going to allow me to evolve more quickly. It’s going to allow me to get off of this old stuff and get onto this new stuff and to go to the next generation more quickly.” And while that’s true, the real net effect is that it lets you continue to use your sunk cost investments for longer because now you can delay, delay, delay when you need to move off of it and when you do need to move off of it’s not disruptive. So it has actually the net effect of letting you continue to utilize your sunk cost investments much longer than you otherwise would have.
So once you get rid of the friction for data moving, now you can maximize the value potential of the existing investments and mix in newer design points, higher performance capabilities at the same time. That’s an interesting paradox that you end up making it easy to evolve the infrastructure means that you don’t have to evolve it as soon. And I’ll give you a very concrete example of that. With Hammerspace. You can put it in front of any of your existing storage infrastructure and get GPU direct capability. And now you can feed your GPU farms with the utmost of performance and have the bulk of your data where it always was, on the existing storage. So it is a very profound impact to not be saying you have to forklift upgrade and migrate all your stuff over here just to be able to feed it into, because that’s what other vendors would tell you, “Put in our shiny new silo and we’ll be the silo to end all silos and we’re capable of feeding GPU direct.” That’s not our story. Our story is use what you have and augment the capability.

Mike Vizard: So who’s in charge of making this transition? Because we see on the AI side, there’s all these data scientists running around, we see data engineers trying to work with them, and then there’s all the traditional storage folks in the IT department.

David Flynn: The infrastructure guys. Yeah.

Mike Vizard: So who’s taking the lead here?

David Flynn: Well, I’d say that how these different groups have related to each other is one of the biggest handicaps to having data be storage centric in how it’s served and managed. When data is a platform level thing, it’s a higher level abstraction, but when the data is being rendered by the infrastructure, where its existence is a mirage, it’s really just something that the infrastructure is serving to you. You’ve flipped. Now infrastructures on top and platform is subordinate, and the net impact of that is the organizational, those who are working at a platform layer or even an app layer are not able to operate without the infrastructure being the one to wear the pants frankly, with the whole thing. And you can tell that in a very simple way. I would ask you to consider how the data is laid out, how you organize, what data sets are next to other data sets in the same directory or even in the same file system, how we group our data.
Note that how the data is group and organized is first and foremost about how we’re going to put it on the infrastructure. That’s the tail wagging the dog. The data ought to be organized from the perspective of how it’s going to be used and how the people using it want to see it and perceive it not how it’s going to be packed on infrastructure. So it has very real repercussions that data is infrastructure bound and it compounds and confuses the relationship between the app and the platform folks and the infrastructure teams. What they really need is an abstraction layer so that data can move freely and continues to exist under the same identity, even though it’s moving between whole different classes of storage and between whole different facilities.

Mike Vizard: Is this like the revenge of the days when we used to optimize where data was put on a disc to get the best performance, so now we need to figure out this at a much higher level of scale.

David Flynn: It’s very insightful for you to think of it that way because file systems that were born inside of the OS on a local machine, on a PC or mainframe. That file system, its whole function was to map and manage the layout of data onto a disc. And then it wasn’t one disc, it was a raid set of many discs and then it wasn’t a raid set, it was an array of discs. And then we took the file system out of a single OS and we put it into an appliance on the network so that all the OS could share that same perception, that same logical view of data and thus was born NAS. And okay, that progression needs to be taken to the logical conclusion where the thing that maps and manages data is now mapping and managing data across all forms of infrastructure across the full design range of different design points in storage and across different physical geographies, different data centers and facilities.
It’s the same purpose though, mapping and managing data. And it has to supply the most foundational of interfaces, a POSIX file system, not some object store, not some high level Weber thing. You got to be able to load your programs off of it. It has to be a real data supply to OS’s. And that has never happened before. File systems never made that leap to truly be all encompassing until Hyperscale NAS and data orchestration from Hammerspace. And this is what we talk about as a global data environment. It’s where now the file system, the data presentation layer that’s mapping and managing data across anything and moving it around is independent of all of those infrastructure choices. Yeah. So this is a new class of file system, entirely new.

Mike Vizard: So do I have to go find some semi-retired IT professional and pull them off a beach somewhere to remind everybody of this stuff or is there some intelligent way to go about getting everybody to relearn this?

David Flynn: Well, the beauty of it is what we did was we took NAS, network attached storage, shared networked file systems, and we used that same protocol. So instead of asking folks, just like we don’t ask the OS’s to do something other than ics, so the OS knows natively how to read and write data in the richest, highest performance levels foundational to the os. By that same token, this is NAS network attached storage, it’s the protocols that everybody knows and it comes built in and is plug and play. It’s not something like a file system in the supercomputing world where it’s like trying to raise and care for unicorns. You’ve got to hire a whole team of staff just to know how to take care of these magical creatures.
Folks need stuff that is turnkey, plug and play and just works. And that’s why what we’ve done is we’ve used the existing NAS protocols and enhanced them, built the stuff into the OS’s, into Linux in particular, one of the things that has made what we’re doing possible is that Linux from a performance and scale out perspective, it won the Unix wars. And while people still use desktops and workstations and stuff with a variety of different OS’s like Mac and Windows, when it comes to in the data center and scaling out compute when it comes to AI and LLMs, it’s Linux. So we’ve been able to put these capabilities into Linux so that anybody, any Linux admin can use the standard process for connecting to network storage. But now they’re connecting to a networked Hyperscale NAS, data orchestrator that’s capable of moving things across whole data centers and across different storage systems and services.

Mike Vizard: So do you think that this is going to be the year of not so much the benefits of AI, but more the year of operationalizing AI because we have so much of this data out there that we need to figure out how to organize before we feed it into these LLMs and then start building all these awesome apps?

David Flynn: Well, like I said, it’s a day of reckoning for the industry because we were living with the crutch that an application put its data set onto a piece of storage. Another application put its data on another piece of storage. There was a one-to-one relationship between the app, its data and its storage. With AI, we can now extract value from any data, and that data is generally coming from other applications. So now we have a cross product problem. It’s not just one model. It’s potentially many models. It’s just not one training session. We might do retraining and as we continue to get new information, do more fine tunings. So now it becomes an ETL, extract transform load. It becomes how do I get the data that used to be in this tightly coupled relationship between an app and its storage silo to collect up the data from many different apps across their many different storage silos and feed it into potentially many different models in different facilities.
Because the other angle is this is bound to hardware and not a small amount of hardware. That hardware has to have a physical location. You got to plug a power plant into the thing because if you’re running hundreds of GPUs or thousands of GPUs, this is a serious affair. Getting your hands on GPUs is hard enough. So I like to say it’s gone from a one-to-one relationship to a many to many to many. Many data sets from their different applications that need to go into many different models. And it needs to be able to do that in many different physical locations just to have access to the compute resources that those models take. So that’s the reckoning is that it’s not one-to-one a cross product of many to many. So manually copying crap around between file systems simply won’t cut it.

Mike Vizard: All right, folks. You heard it here. Despite the rise of AI, the laws of physics have not been suspended when it comes to data. So we need to figure out another approach and a way to orchestrate all this stuff. David, thanks for being on the show.

David Flynn: Really appreciate it. Thanks for having me, Mike.

Mike Vizard: Thank you all for watching the latest episode of the Techstrong AI series. You can find this episode and others on our website. We invite you to check them all out. Until then, we’ll see you next time.

AI Leadership Insights: Managing Data Pipelines

SHARE THIS STORY

FOLLOW US

AI Leadership Insights: Managing Data Pipelines

TECHSTRONG TV

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP