Synopsis: In this AI Leadership Insights video interview, Amanda Razani speaks with Steven Hillion, SVP of data and AI at Astronomer, about data orchestration in the age of AI.
Amanda Razani: Hello, I’m Amanda Razani with Techstrong.ai, and I’m excited to be here today with Steven Hillion. He is the senior vice president of data and AI at Astronomer. How are you doing today?
Steven Hillion: I’m great. Hey there, Amanda. How are you doing?
Amanda Razani: Doing well. Can you tell me about Astronomer? And what services do you provide?
Steven Hillion: Yeah, for sure. We’re the commercial developer behind the data platform, Apache Airflow. I’m not sure if you’ve heard of Apache Airflow. It’s pretty widely known. It was created by Airbnb some years ago to handle all of their data pipelines. It quickly grew into, really, the backbone of their business. And they decided to open source it so that they could get the benefit of community contributions.
And it’s quickly become ubiquitous. It really is the de facto standard for orchestrating data pipelines throughout modern organizations and gets about something like 12 to 15 million downloads every month. It’s used by most companies, really, to manage the flow of data throughout the company. And we’re the commercial developer behind it. We provide a cloud service so that you can easily run Airflow on our infrastructure, Lights Out. You don’t have to worry about maintaining it, you just worry about actually producing data. That’s really the service that we provide.
Amanda Razani: Okay. Well, that is a great segue into our topic today, which is data orchestration in the age of AI. From your experience, what are you hearing or seeing from business leaders when it comes to AI and implementing that as far as the data orchestration?
Steven Hillion: Maybe let me unpack just those initials, AI, artificial intelligence, it means so many different things to different people. In a sense, it’s an umbrella term for the use of advanced analytics to make predictions and to generate new insights. These days, I think people often associate it with more advanced still methods, those based on networks and deep learning, and especially, of course, lately the use of generative AI and large language models and so on.
If I think just broadly about machine learning, I think in the last 15, 20 years, it’s really become a commonplace tool for doing business at a level beyond traditional business intelligence. And the flow of data to supply machine learning models is critical, and so orchestrating those data pipelines has just become part of what now people call MLOps, the operationalization of machine learning pipelines. That’s essential for most modern businesses. And at Astronomer, we see that Airflow is not just the default platform for doing data engineering, as if that wasn’t a big enough job, but is also really becoming the default platform for orchestrating machine learning.
Now, what’s been interesting to me over, really, the last couple of years, especially the last few months as generative AI and large language models in their applications has really increased exponentially is the degree to which Airflow has also become dominant there as well. Andreessen Horowitz put us right at the heart of their reference stack for managing the flow of data for generative AI and other advanced applications. And so if you regard orchestration as key to the way that you get value out of your data with machine learning and AI, then Airflow really sits at the heart of that. Orchestration is key. I don’t think you can just use machine learning models in isolation, you have to have them wired into your business processes. And that’s really what orchestration is. How do you get machine learning models and generative AI to produce value to your consumers, to the website, to your business applications for compliance and so on? Well, you do that through orchestration.
Amanda Razani: What are some of the roadblocks or the barriers that you’re seeing business leaders come upon when they’re trying to implement that?
Steven Hillion: Well, first of all, I think that traditionally data scientists and machine learning teams have set somewhat apart from the data ops, the DevOps folks. If you think about the way that software engineers develop applications, there’s a pretty standard process around that for taking your code and pushing that to production. You would never think about developing applications separately from production concerns. Developers are constantly thinking about how they integrate and deploy their applications. That’s the whole point of DevOps and CICD processes. That’s just now de rigueur; that is standard.
And yet data scientists still sometimes will be developing separately in notebooks and will then throw those over the wall to the ops team or sometimes just never end up deploying them. That, to me, is the biggest hurdle. That’s been true for far too long that there’s a little bit of a war between what the data scientists and machine learning teams do and what the ops folks. There’s the rise now of the persona of MLOps and ML engineers, and so I think there are some best practices starting to be established around this. But it is still fairly uncommon that organizations have implemented those best practices and really have a tight cycle around the deployment of machine learning models.
Amanda Razani: Breaking down those silos, that’s still a big issue.
Steven Hillion: I think so. I think it’s critical because in the end, if you’re developing models in a notebook, well that’s cute, but if they’re not powering your website, if they’re not making product recommendations, if they’re not preventing fraud and running Lights Out to do that, then what are you doing, really? That process of orchestrating from the ingestion of data from your website and from your user activity to the training of models to the deployment of those models to actually make recommendations based on that user behavior to then monitoring the success of those models, if that’s not happening Lights Out, then you are wasting your time and you’re wasting your resources.
Amanda Razani: I know these tools and technologies are really speeding up this process, and with that speed and efficiency, how do business leaders ensure that the data is clean, viable data?
Steven Hillion: I think it’s important, first of all, to make sure that you have centralized data pipelines and centralized standards for executing those pipelines and a set of standards around testing of your data sources. I so often will talk to data leaders, and they say secretly, “Honestly, our data engineering or our data prep for our machine learning models is the wild west. Everybody’s doing it their own way.” You have different teams who are each deploying their models in different ways. There’s no sharing of common standards and best practices around that. I think just identifying what the right platform should be and getting everybody to use that so that you get very naturally then this network effect of improvements in standards is the first ingredient in order to do that, in order to make sure that the data is of the quality that you need. T.
Here’s plenty of tools to do this sort of thing. At Astronomer, we integrate with a number of different platforms, Great Expectations and so on, as well as just standard test suites that people implement themselves. The point is if everybody is using the same pipeline, if you like, to deploy their machine learning work, then you’re going to get people sharing the way that they enforce these standards. That, to me, is the critical thing. It’s really about having a common library, if you like, of tests that you apply.
I was chatting to somebody at a major social network, a very large data engineering team, for example, and they said they literally have hundreds of thousands of data sets that they use for monitoring the health of the business and clickthrough rates and engagement and so on. And I said, “Well, how do you know when you’re creating a new data set? Maybe it already exists if you’ve got hundreds of thousands of tables. Then maybe you don’t need to create a new clickthrough rate metric.” And I said, “How do you document this?” And he’s like, “Frankly, we don’t.” It’s just, again, the wild west. And it’s because, again, there’s no standard way of documenting and testing your data sets. And it just seems like madness To me. That’s the mission that we’re on at Astronomer, really, is to say there is already a framework for establishing a common method for your data pipelines. It’s called Airflow. And you should be using it for your machine learning as well.
Amanda Razani: Okay. As this technology advances very quickly, what do you foresee in the next six months to a year as it relates to the enterprise?
Steven Hillion: It’s moving so quickly, sometimes I wonder if I can even predict what’s going to happen next week. I think maybe what I would predict is in some sense, the mundane applications of generative AI… I was a little skeptical about the power of AI. It seemed to know as just a glorified search engine. But no, the more I talk to our customers and to the Airflow community, I realize that they’re finding 1,000 different applications of these technologies, especially large language models, especially the ability to have enhanced support, contextual help within your applications, auto generation of documentation, code generation, well, for your data engineering pipelines and code generation for your applications and code generation for your sequence. You see these applications just sprouting everywhere amongst major software providers, but even internally within local development teams. And so everybody is quickly learning what the power of these applications are, and there is a ton of examples out there, and so I see a proliferation of those, but of course that’s going to introduce its own challenges as well.
Amanda Razani: If there’s one key takeaway you can leave the business leaders with today, what would that be?
Steven Hillion: Well, again, I think it’s about saying should my data engineering and machine learning teams be operating independently or should they be working against a common platform? Because in the end, and I almost hate to say this to my data science friends and to the machine learning engineers even on my own team, but the two disciplines are not actually different. You’re both processing data in order to generate valuable metrics and insights and actionable models and actionable data. It’s the processing and cleansing of data, the generation of valuable features and the production of useful actionable insights out of that. It’s the manipulation of data along the pipeline.
And in my mind, the discipline that has accrued over the last few years in data engineering, data engineers are getting really good about producing their pipelines, needs to spread to the machine learning side of the house as well. And do you want two platforms for doing that, do you want a million platforms for doing that, or do you want one that encourages, again, the spread of best practice? That would be my advice is look at how your machine learning developers create their models, look at how your data engineers create their data sets and say, “Are these fundamentally different exercises? And should they be using different technologies?” And I think the answer’s clear.
Amanda Razani: Definitely something they need to consider. Thank you so much for coming on our show today and sharing your insights with us.
Steven Hillion: Yeah, my pleasure. Thanks very much.
Amanda Razani: Thank you. Look forward to speaking with you again soon.