
Data has been driving business operations for many years, but that’s about to change in a profound way. Agentic AI, the next step in the evolution of artificial intelligence (AI) applications, takes generative AI beyond producing answers to natural language questions.
This new kind of AI automates operations such as:
Handling of customer service requests – Executing workflows for processing returns, scheduling service visits, preventing fraud, and even escalating cases to humans in the loop in exceptional cases
Automatic adjustment of inventory – Predicting demand and balancing inventory across the supply chain to avoid excess inventory and shipping inefficiencies.
Execution of market trades in real-time – Whether trading securities, or currency, or raw materials, automating market transactions can give your company a first-mover advantage and higher profits as market changes are detected.
Increasing manufacturing efficiency – Detecting equipment issues and adjusting production rates, scheduling maintenance, and ordering replacement parts.
By automating workflows like these, businesses can increase operational agility, scalability, and profitability by a significant measure. Gartner predicts that by 2028, 15% of day-to-day work decisions will be made autonomously via agentic AI.
The Four A’s of Responsible AI
Agentic AI systems work by using generative AI to create plans to achieve specific goals, and executing those plans to automate operational tasks. As business operations become driven by agentic AI systems, one will need to ensure the four A’s of responsible AI automation:
Actual-time – Agentic AI is all about real-time interactions; so freshness and dynamism are key factors for relevance.
Accuracy – Agentic AI systems must be correct and accurate or the business may fail to execute properly.
Always On – Systems must be run continuously or AI-automated operations cease to function.
Auditability – A record of the AI system inputs and outputs, and the automated actions they precipitate, must be auditable to deal with customer disputes and regulatory compliance.
Observability Is Fundamental to Responsible AI
No responsible business should let AI drive operations without observability in today’s time. The cornerstone for ensuring the four A’s of responsible AI is real-time observability. Systems that can observe the behavior of agentic AI applications at a very detailed level can deliver the required kind of visibility.
Whether developed in-house or designed by an observability software vendor, the systems must collect the following information about your agentic AI applications:
System Metrics – Information about system and infrastructure status like how much RAM is being used at any given moment, or, how much storage is used, how busy the server CPUs are, the number of users connected to the system or the number of LLM calls per second. Many crashes occur due to software trying to use more resources than the computing infrastructure provides. Observing metrics can make this clear and even alert you in time to avoid failures to help ensure always-on performance.
Logs – Information about what system resources were accessed by a user or machine. For example, what queries were submitted by which users, and what response were returned by an LLM. Logs facilitate auditing, troubleshooting, and improving accuracy.
Traces – Information about the execution of each programming step in the application software – what inputs were received, how the inputs were transformed or acted upon, and what outputs were passed to the next step, etc. Traces are essential to debugging agentic AI decision making and workflow execution.
Lineage – Visibility into the chain of programs that originated or processed data during its journey from source to its final destination. Lineage should also connect the identities of the developers who created each program . If an issue arises, lineage helps avoid finger pointing and getting the best person to fix the problem asap.
Avoid These Agentic AI Observability Pitfalls
Don’t Confuse Monitoring and Observability
Many people make the mistake of equating system monitoring with observability. Monitoring systems typically just capture system metrics. They raise a flag about something that may have gone wrong. Usually these flags are connected to things that already happened and the metric definition for that problem is known. They don’t help you understand why. Logs, traces, and lineage are required to maximize actual time response, accuracy, always-on, and auditability.
Make sure Metrics, Logs, and Traces are Correlated
Collecting metrics, logs, traces, and data lineage is one part of the observability picture. The data must also be correlated, along a timeline at a minimum and preferably not collected separately by different observability systems. This enables developers or auditors to easily see root causes of certain events. For example, metrics can tell you that a system crash at 2:30PM was preceded by a memory fault, and if the program traces and logs are correlated with the metrics timeline, it can be easy to discover that an error in the code execution trace caused the memory fault. And with lineage data, you can see who developed the faulty code and assigned them to fix it.
If your observability data is uncorrelated, developers can correlate it manually, but it makes troubleshooting harder and more time-consuming.
Real-Time Processing Requires Specialized Observability
Many agentic AI applications are driven by real-time data that streams from connected machinery, point of sale (PoS) devices, and e-commerce and customer service sites. Streaming data presents unique observability challenges because it can be unpredictable. This data can be incomplete, can get corrupted as it travels through the network, and can arrive out of order. In addition, most streaming data is ephemeral; it is processed and then often discarded.
Open-source Apache Flink is a popular choice for stream data processing. It enables the analysis, summarization, cleansing and enrichment of data as it flows to your agentic AI applications, giving them the data and understanding they need to take action. It’s vitally important to ensure you’re capturing and correlating metrics, logs, traces, and data lineage from Flink programs, allowing Flink teams to detect, isolate, and resolve issues quickly.
Closing Thoughts
Agentic AI is a transformative technology that holds the potential to automate many business use cases. The more dependent on AI a business becomes, the more it is expected to ensure accuracy, availability, and auditability of agentic AI systems. Observability is your first line of assurance that reveals the inputs and outputs of agentic AI applications and the large language models (LLMs) that drive them. If your organization actively uses AI, or it’s on your technology roadmap, then begin work today on your real-time and AI observability strategy based on correlated AI application metrics, logs, traces, and lineage.