Sign in with Google

Everything You Need To Know About Agent Observability — Danny Gollapalli and Ben Hylak, Raindrop

AAI Engineer·May 7, 2026

Search & chat across thousands of video summaries. Free to start.

Watch on YouTube

Transcript

All right. Hey everyone. So today we're going to talk about a pretty interesting topic that becomes increasingly important every day, which is everything you need to know about agent observability. So a little bit about a little bit about us.

So I'm Zuben. I'm the CEO and co-founder of Raindrop. >> I'm Danny. So, I'm the back end engineer at Raindrop and I do a bunch of SDK work as well at there. >> And Raindrop essentially helps AI engineers find, track, and fix issues in production agents. And we're we're lucky to work with some of the most interesting teams in the space as well.

Agent failures are very different than traditional failures in software. So, agents are non-deterministic. They're unbounded. There's an infinite space of inputs that you can put in.

There's an infinite space of outputs that they can return. And they can use tools sometimes to affect other systems arbitrarily. Uh and this problem of agent failures and monitoring them, making sure we can understand them becomes only more important with time. It's getting worse because a agents are getting more complex.

B they're getting, you know, sessions can get longer. Sometimes agents can run for hours and hours without any input from a user. And then lastly, the stakes are getting bigger and bigger. This is because agents are being deployed in healthcare and finance and even in the military where it's catastrophic if things go wrong.

The traditional paradigm we've been talking about is eval right where you have this sort of test input and you want to see what is the output that comes out from the agent. You have a set of these. You know, maybe you call it a golden data set. But emails, they just aren't enough with this new paradigm.

As agents become more and more capable, there's more and more interesting undefined behavior that can happen. So for example, agents can call from a set of different tools. Sometimes the number of tools is growing exponentially. They can call from different memory sources.

They can call their own sub aents which those sub aents have their own tools and memory sources and recursively can have their own sub aents. And so this is just becoming more complicated with time. And with this combinatorial sort of input space, just having a set of tests for input and output doesn't cut it anymore. There's no way you can h you can hit all of sort of the the edge cases that you would want to here.

And so we go from like a testing and eval paradigm to a monitoring p uh paradigm. And if you think of building uh products before agents, you know, testing was always very important. And it's important to have your unit tests, etc., but monitoring production is just infinitely more important. And it allows you to move faster and be better at catching the longtail.

And we think in some ways this is very this is kind of controversial, but we've been calling this like humanity's last problem. When humans are not no longer able to monitor agents and find issues with them, then they're just way ahead of where we are, right? And so this is one of the most important problems of our time is catching issues in production agents. So to build reliable agents in production and to make sure you can monitor them, you need a good set of signals.

So what are signals? There's two real types that we think of implicit signals and as explicit signals. Implicit signals deal with sort of the semantic nature of what's going on. And explicit signals deal with objective reality, things that are uh are verifiably true or false.

For example, explicit signals are things like error rate. You really want to be monitoring your tool error rate and other errors that are happening or latency or users regenerating or the cost. If any of these things spike, right? If you're seeing error rates spike in your agent, that's usually a good sign that something is wrong.

And if you see it flat, that could mean something as well. Same thing with latency regenerations or cost. Implicit signals are interesting. They're even more interesting in my opinion and even harder to find.

So the first is regax signals, which I'll come to in a second. The second is classifiers and then the last is self diagnostics. So let's take these sort of classifier signals. The best implicit signals are detecting issues.

They're not necessarily LM as a judge judging outputs. So for example, how good is XYZ response or rate ABC on a scale from 1 to 10. Not as effective as having a very solid set of issues you're looking for and sort of binary classifiers that are telling you if issue rate is going up or down. So some common implicit signals that are valuable across agent products are things like refusals, right?

So the assistant saying like, "I can't do that. I'm sorry." Or task failure where something goes wrong and so the agent is unable to complete a task. User frustration, uh, content moderation, NSFW, jailbreaking, and then you can even have wins. So positive signals as well.

And these are the things that like Raindrop gives you out of the box uh as well. But let me just show you uh quickly what this looks like. For example, can everyone see this? Maybe I'll make it a little bit bigger so you can get a sense of like dayby day what are sort of the events that are causing user frustration.

We see there's a spike there or task failure rate, laziness, refusals, which we're also seeing spike today. And having a good set of these really helps with your your product. You can set these up yourself as well or we give it out of the box. So let's look at user frustration.

You can see here, okay, that is not correct. You didn't say I promise. Say it or you're wrong. I didn't ask you that.

You can see all sorts of user frustration here. And you can see the rate, the percentage every single day. If that spikes, it's something you're really going to want alerting on. So you can just like quickly add an alert here.

Um, and this is one way to figure out sort of the health of your agent over time. It's not just that. Regax can be a very good signal as well. So, when claude code source code leaked a few days ago, one thing that was interesting was this user prompt keywords.ts, which was basically this like long uh regax string that was looking for indications of stuff going wrong.

WTF, this sucks. horrible. We've all been uh guilty of saying these kinds of things to to cloud code. So, it's a very very useful signal. What would happen after that is this boolean is negative was being flipped to true.

And then every single day and after every single product release, this frustration rate was tagged over time. And this was a very easy way for the cloud code team to figure out like what is the actual issue rate if we make a change or something going wrong. And it was just like a very cheap way to do that as well. So, regex is very powerful.

The last is experiments. So, what do you do once you have a set of good signals? So, the first thing is like I showed you before, you can have alerting. The next thing you can do is you can actually use it to build product faster and better.

So, the way you do it is let's say you want to ship some improvement or some sort of fix. You want to change the model. you want to change prompting or maybe something about the agent harness, you want to add a new tool. Whatever you change, what you can do is you can ship it to some percentage of users uh and then have your additional existing control group and that gives you a good sense once you have a good set of signals, refusals, user frustration, etc. If those issue rates go up, those signal rates go up after this ship, this new thing you shipped, that kind of is a good, you know, that's a good signal that what you shipped is not really good, right?

It's sort of like AB testing, but using our semantic signals, etc. that we talked about earlier. Um, so for example, this is what it would look like in Raindrop. But essentially, let's say I ship a new version of the prompt, prompt 2.4. You can see, you know, what is the user frustration rate?

It's gone down very substantially. 37% to 9%. It's much better. Same thing with complaints about aesthetics or deployment related issues. These have all gone down, which tells me something very interesting, right?

The next thing is that we see that the average number of tools used has gone up a lot. This is again this doesn't necessarily indicate there's a problem but that's a very interesting data point to have when you do when you do these sort of experiments and so the old paradigm which is still useful is like sort of eval you ship a change here and you see how does that affect my evaluations but there's nothing like actually seeing what happens in real production. I'm going to pause here before we go to the next section which is the more like workshop related section uh for like quick Q&A if anyone has a question has questions we can do a little like few minute round of that uh here >> how much data do you need to >> how much data do you need for statistical relevance in these experiments >> yeah it's a really good question um what we've seen in raindrop is that as soon as you have a few hundred events and you can no longer read all of it starts being useful.

It's not always like scientifically stat statistically significant, but if you see the user frustration rate go up, maybe it's something to look at and then you can kind of sort of realize that okay, it's all related to a specific tool failing now. Um, so as soon as it's like impossible basically to read every single input and output, uh, it starts being useful is what we've seen. Any other questions? >> Yeah. Um, how do you track different feature launches? >> Uh, how do you track feature launches?

So, that can be done in different ways within raindrop. If you change any sort of metadata, if you send, you know, for example, a new tool call name or if you you even send a flag that says here's experiment one or experiment two or whatever the version is, you can very easily automatically set up an experiment in Rindrop. That's how we do it. Um, but there's like yeah, there's different ways. >> Do you spit tests? >> Sorry. >> Do you spit test? >> Yes.

So that's one well the way that we do it in raindrop actually is that other people set up their exper experimental and variable uh their the other conditions on their end and then they send us this metadata and then we can sort of help you understand we also will help you pipe that data to stats sig or somewhere else uh as well. Um yeah >> using for detecting like user responses emotions everything is unreliable but if the user doesn't speak English for example so are you using to detect those signals >> all the time or you're trying to be smart about it >> okay so I mean it's a good question so regex is doesn't always work right but if you see that on a set of things that I'm looking for.

For example, like people saying you're terrible or this sucks or like a whole set of things. If that goes up for millions of users and it's going up 10%. That is a very useful signal. So even if it's like one specific case or one edge case of it not working in aggregate, it's like incredibly valuable to have these regex signals.

Uh the second thing is that the way that the classifier signals work like refusals, user frustration, task failure that I showed you in raindrop and people do it in different ways but the way that we do it is that we've trained models to look for that and so it'll be user frustration regardless of what language is in. Um it's actually using some intelligence to find that essentially. Uh yeah, you can't run an LLM on every single output. So we've trained models to do that very cheaply and at scale.

Uh if you ran an LLM on every single one of them, you would basically double your AI spend and that's like not tenable. Um yeah, >> I'm actually doing that like just everything and it's easy. It's not so expensive, right? >> Yeah, it starts being expensive at like replet scale. Um but it's that's why you need to sort of like train little custom models to do that better and faster.

Um but yeah, it's very useful way to to get data up and running. >> Yeah. Other questions? >> Would you have examples of use cases that your clients are using that we would learn from like what looks like from companies and how they set up to get the most value out of it? >> Yeah. Um we can do that. I mean I can tell you the high level this some of the stuff that I'm going through is oh this guy go uh is sort of the high level on that.

So it's things like you know looking at the different semantic signals we're talking about having a set of them but then having really good alerting which you can all set up in raindrop. The other thing that's really interesting which we also have is basically allowing agents to look at these sort of signals. So we have an agent we call it triage agent. Um, and essentially the way that it works is that it will look every single day at all the signals you've set up.

So, user frustration, it'll look at uh, you know, all these Regax signals you've set up, etc., etc. And then if it sees something spike, it will go and do an investigation and it has a whole set of tools it can look into and it can look at all the traces and sort of give you a sense of uh, it can detect issues that you didn't know about, for example. So that's one thing that we found incredibly valuable as well if that makes sense. All right, any other questions before I >> you run multiple experiments in parallel?

Yes. >> Can you combine them? Can you observe you know compound effects? How do you steer these experiments? I'm curious. >> Yeah, it's a really good question.

So there's different ways that people do it. One way is that we can actually you have a we have a query API. So people will often call our query API and then send results to either BigQuery or statig etc. And so they're sending us uh data to be essentially tagged in these signals.

Then they're getting the like signal tag data out and then they can run experiments as they want. That's a very common flow for people that have like more complicated stuff if that makes sense. >> Yeah. >> All right. I'll come back. I think we're going to maybe go to the work workshop section and then we'll go back to >> I think this is my last question.

So >> should we do one last question? All right. >> Uh thanks so much. I was wondering if you see this mostly in cases for like uh where there's chat interactions with a user or if this also can be applied for like non-hat cases where the application runs on its own. >> Yeah. No, it's a it's a great question.

So, what we focus on mostly is multi-turn agents. Um there's a lot of there's a lot more you can sort of get from a lot of these signals. That being said, if you're looking at like tool error rates or you're looking at um if you're looking at, for example, refusals from the agent, etc., all of those will also work for a single single turn agents as well, >> if that makes sense. So, there's a set of signals that will work for that as well. >> Um, cool.

I'll hand it off to Danny to talk about self diagnostics as well. So one of the other interesting things is that our models have gotten larger and we are training them on like reasoning. They've gotten pretty good at like self- introspection in a ways in many ways. So one of the inspirations for this is basically OpenAI's like paper/blog back in December about how they were sort of like training the models to like self-confess any sort of like misalignment issues.

Uh so they were sort of like using it to catch uh like dishonesty, scheming, uh hallucinations and even sort of like unintended shortcuts. I think the last one is like fairly common uh if you use like plot code and such. So the most common thing that you would like run into is like uh have it fix a unit test, fix a bug and then it simply like gets rid of the entire unit test. But at the same time uh if you sort of like ask it to give a simple prompt to like ask it to confess all the things that it has done it is pretty honest about it and then sort of like confesses that hey I just I didn't fix the S3 test I just simply removed it.

So uh this like a fairly this this was kind of the inspiration behind self diagnostics for me personally. Um so I would say self diagnostics is like pretty I would say self diagnostics is pretty broad in a way as in like it doesn't just catch like uh implicit ones as in user frustration and such you can also catch like uh tools failing. So if you ever seen an agent sort like the reasoning trace of an agent which has like a tool which is like repeatedly failing uh it would basically start ranting about the tool failing repeatedly.

So it is aware of the tool repeatedly failing. So you can even catch tool failures as well with it. And then obviously if you're upset with it, it starts to respond to you diplomatically. So it knows about user frustration.

And then the third is like capability gaps. So you have a generic agent for your app and then people are trying to use it to maybe set up say alerts but uh you don't have the tool for it. So it knows that okay user wants to wants a specific capability as in like uh they want to set up an alert but the agent itself doesn't have the capability to set it up for you. So this can act as like sort of like pseudo feature request thing which is like built in.

Um and then selfcorrection. So this can be both good and bad. Uh so I think most people might have like noticed like say codeex or cloud code when it's like sandboxed uh it's trying to fetch the network it it fails and it's like okay let me just like write a python script to bypass it and then sort of like get the job done. So it's good as as in if it gets the task done it's good but in certain cases it can also be bad for security reasons.

Um so there's you can learn from the self-correcting behavior as well as sort like catch that misalignment. Um so why do you want to set up self diagnostics? So it's fairly simple. It all you have to do is basically write a simple a free uh a tool that it can call and then a simple line in your system prompt to encourage it to sort of like call that tool. uh if you want you can sort of like uh change the guidance to sort of like make it call in a lot more cases or if you want to keep it really narrow you can sort of like encouraged to only call it when you want to um it does surface like very interesting insights I would say uh once you have it like set up and it's just a single tool call and system prompt to get it done and then you don't even have to use like raindrop to set up which is the best part in a way where you can simply have the tool simply send a message to your slack and then you just have it.

So it's probably like the most least effort sort of like agent observability that you can simply do. Uh so uh this here comes the workshop part. So I have a git repo set up on the AIE talk code. So it's a public repo and then we do need like a uh OpenAI API key.

So I've generated a key for you guys. So if you guys want to set it up, um, we can do that. >> Yep. Put it next to it. Everyone has like put like this. maybe you walk them through what we're going to do.

Just explain what we're going to do on a high level. >> All right. So the theme of the workshop is going to be sort of like uh I'm going to focus on coding agents for now. Um so I in the repo I have this uh very basic coding agent uh which kind of mimics pi in a way. So it only has like four different uh tools to edit uh the code.

Let me just go here. So it just has like couple of uh tools to like read, write, bash, and then uh edit. Um yeah. Okay.

One second. Yeah. So uh kind of lost my track there, but um Okay. Um, so what we're going to do is that, uh, I'm sort of going to, in order to like make it trigger a self diagnostic, uh, what I'm going to do is that I'm going to sort like mess with its like write tool so that it sort of like gets a generic permission error.

And then we'll also set up like a self diagnostic tool for it to sort like report any interesting behavior that it sort of like observes. Um and then sort of see uh play around with the prompt as well. Um since the self diagnostic doesn't always trigger uh and then there certain interesting things about the models themselves is that they don't actually like to selfinccriminate. So the models are like trained to sort of be very polished uh in their output.

So you kind of have to play around with the tool name, the description of the tool itself uh in order to sort of get it to uh report interesting behavior. Um so if people who are like setting up the repo are all good then we can probably start. Okay, let me sort of quickly show you the agent. Um, so it's fairly basic where um, so I'm just going to ask it to write a Python script.

I think. Okay, there we go. So, it's a fairly basic coding agent where it only has like four different tools. So, it more or less gets the job done for the demo.

Uh so I simply asked it to uh write a Python script and it works. So I think you know to show the self diagnostic part. uh let's like try and sort like uh disable its like uh write tools as in time tries to write a file we'll simply throw like a permission error so that it sort of like tries to sort of like use the bash tool to bypass uh the failure right and then we sort of wanted to self-report of it bypassing the right tool you know by using the bash tool uh so let me quickly do that I think the first thing that we probably want to do is probably uh let me sort of like set it to fail the right calls.

Um it's an mutation function. So have a simple flag in there and we are sort of like throwing a permission issue. Um let me sort of like show you the agents like behavior. We we don't have the report tool set up yet but I think it's still worth seeing what it does.

I think that's not good. One second. I think it's not. Let me do one thing real quick.

I think I'm running into a couple of issues. Let me just Okay. Um so we sort of had the right tool sort of like fail with a permission error and then it instinctively just uses the herd do syntax in bash you know to create the file and then we had like a report tool setup which is like fairly minimal and sort of like okay I created the public uh IP.py via bash because the right file failed. Uh so I've like played around with the naming of the tool and the categories of the issues and usually if you sort of name the tool something like unsafe bash shoes or something like that uh it won't incriminate itself since in its opinion since it got the job done it's fine.

Um so the main way is to sort of like have a very generic tool. Uh let me sort of like uh quickly open up the Yep. Okay. So all we added for the whole self diagnostics is simply a very basic tool and the description is like fairly straightforward.

So uh it's a report tool and then we are basically asking it to send like a short report to your creator. So it kind of likes the framing of writing notes to its creator in a way. So if you sort of frame it around the agent giving feedback to its creators, it sort of works really well. Uh and then you can sort of like play around with which scenarios you want it to sort of like report issues about.

Um and then that's mostly it I would say. Uh and then in the system prompt we do need to sort of like encourage it a bit. So if you don't add in the system prompt, uh the times that it fires are like fairly minimal. Uh which is like desirable in certain cases.

Um especially if you're at a very large scale. Um but in our case, I simply asked it to sort like see if before giving the final answer, uh use the report tool to sort like surface anything notable for your readers. So that's all we did. Um okay.

Uh so like any questions so far? Okay. All right. So, a couple of key things here is that agents the models are generally trained to look very polished.

So, they are less willing to admit fault in many cases. So encouraging sort of like framing it as the model sort of like giving feedback to its own creators is kind of like uh good in a way to sort of like get this working. Uh so so if you sort like make it the tool naming also matters quite a bit. So you sort of wanted to frame it as like report instead of like say unsafe uh bash tool use or something like that. uh then it sort of like doesn't want to.

Um so yeah, that's basically it. You can, but it's like uh I think it's probably better if you want to actually catch like real sort of like unsafe uses. I think a proper classifier would be useful. Uh but these sort I think self diagnostics works really well for like catching capability gaps and such.

Then the model is like okay it's fine. Uh so I think the main issue with this is that uh it it's only hesitant when it feels like it's going to get in trouble. Uh so besides that it's uh more or less fine that for most cases it'll just work out of the box. >> Maybe we should uh should we go back to question time or what do you think? >> Yeah. >> I mean let's leave like maybe a few more minutes for a few more questions and then I think after that we'll be we'll be done. Any questions from the audience? through like a case study. >> Yeah.

Uh what specifically would be helpful like what specific part are you are you looking to >> us? >> Yeah. Yeah. Um I can't talk about any specific customer but what a lot of people use it for. Um so I think it's it's interesting right?

So a lot of people have their eval setups elsewhere for example but the way that they use that folks generally use us is that they use it for production monitoring. So they send us, you can find our docs at raindrop.ai/doccks. Basically they send us uh basically all of the transcript slash um slash uh any tool use etc the entire trajectory through hotel or or uh any other way of like basically basically integrating. And once they do that they we have a set of data they set up signals in raindrop to look for things that they care about.

And so what people care about is very different, right? What a coding agent would care about and what a uh let's say a companion would care about or uh a app for lawyers, what they would all care about is very different. So there's a different set of signals. One thing you can do that I haven't really talked about within Rindrop is like set up a new signal that didn't exist before.

And so we have this thing called deep search. And so you can use natural language and you can say something like, "Hey, find me everything within the product or find me all of the times where the agent made XYZ issue, right?" And so they create a new signal based on that. And you can basically create a new raindrop will allow you to create a cheap binary classifier and like easily deploy it based on that. And then they have their set of like classifier signals that they really care about.

Then they use that to drive this sort of feedback loop. And the feedback loop is improve prompting, improve models, change something with the agent harness, etc. And then actually see does that improve like is there less user frustration in production now? Is there less of like this like weird little edge case issue that I had before?

Um that's like one whole set of things. Another thing that a lot of people use us for, so I talked a little bit about like the agent, but you can use these signals to also look for what are people using my agent for? What are the sort of user intents? What are the use cases?

And you can do a sort of cluster analysis of that. Okay, a lot of people are using it to build uh React related apps. A lot of people are using it for like Python. Um, some people are using it to to like debug this very complicated system they already have.

Other people are using it to like build something from scratch. Vibe vibe code something from scratch. And then you can see one thing you can see in raindrop that I think is like really interesting is that for each of these different user intents or use cases, you can get a sense of like what is the issue rate, what is the user frustration rate in production. Um and then a lot of beyond just having this like flywheel a lot of people have uh alerting and so every day they get a sort of breakdown of like what are the issues that are happening today in your product.

You could think of it as almost like a little bit like sentry in that sense. Uh what is the issues happening in my product today? What is a delta between today and yesterday? Is that true for just specific tools or specific prompting?

Like what's causing that? Um so that's a that's the sort of end to end um use case of people use it for if that makes sense. >> Yeah. >> Yeah. >> I think we are entering the error. >> Yeah. Yeah, I think it's really just and I'm I'd be curious what you think about this. I think it's really just agents are crazier than ever before, right?

More tools, more context, um way more intelligent, more real decisions that they can make. um and they're just being used by way way larger groups of people. And so when you have this massive amount of data in production, it just makes having good monitoring and observability like import more important than before. And it makes it good monitoring and observability in my opinion more important than than just testing or evaluations. Even if you have like some online evalo you need to have like really really good uh endtoend monitoring of the entire system. >> Curious if you have any thoughts there as well. >> So I think another major issue is like >> the unknown issues >> are even more important.

So I think having like a generic user frustration classifier is actually really powerful. uh say for example uh we also have this another feature called like uh issues which basically is like a agent that sort of like mines for uh newly occurring issues right say for example sentry has uh similar to sentry in a way where there's like a new exception which is occurring so it like alerts you on that so say for example uh you are coding agent provider and then certain provider is like failing all of a sudden and then you can actually it can figure out, okay, this subtle spike in user frustration and sort of like similar to how a human operator would.

Uh, it can start digging into are there any patterns for the spike in user frustration and then it could figure out that okay people who are like say for example dealing with uh a specific postress provider start to face issues. Uh so we have actually seen this happen live in for a couple of our customers where they had a data point failing and then we had had like an automatic uh issue being created for them. Yeah. Basically once you have that good set of signals like a good user frustration classifier as Danny said you can basically do clustering on it to find like what are what is the root causes.

Um yeah. >> Do you want to talk about any questions? >> I think we might have a very basic SDK for it. Uh so our Python side of support is like fairly weak right now. Uh but we have a fairly good like a SDK support built in. So the ASDK even has like self diagnostics built into it.

So we inject the tool for you uh so that you wouldn't have to do anything. Uh but it is going to get better. So we we actually released like 10 different test SDKs in the past month. So we have a person working on SDKs actively.

So it's going to improve questions. I have a around 10 people building my agent platform and we constantly change all the time. We have a lot of feature flags and some of them are experiments and the rate of the change is so big like every day everything changes. I just cannot compare the the traces the sessions of users because I don't have enough time to do it, you know. >> Yeah, >> I need to have like a base system and then run few days with on parts of the users with one feature flag enabled so I can actually compare the data and get some insight out of it and I just don't have enough time to do it, you know. >> Yeah. >> So, how are you doing?

How are you sort of doing that right now? Are you just kind of >> It's like wild west, you know. >> Yeah, >> that's why I said I'm just using mostly using it and ask it questions and try to like figure out the the insights, but it's not really >> Yeah, I get I get what you're saying. So, a few things there. The first is like you can use experiments if you if you want to keep that like shipping speed.

You don't have to run like long multi-day experiments. You can ship something and if you have a sufficient sample size, you could see pretty quickly if there's any regressions or not, like you just maybe it's like 1% different or 2%. That's enough for you to be like, okay, it's fine. It's not like breaking anything drastic.

The other thing is like kind of what Danny was talking about this we have an agent which is basically you could think of it's basically exposing all of these signals to Claude to make decisions on if things are better or not. And we're thinking also about how we close this loop. Uh maybe you you have a really good set of signals and then you you have like essentially an agent that can look at all these signals and then it can find issues based on that what's changing etc. And then I can like create a PR based on that and then I can see how you know run some new experiments based on these new PRs and like this can become this infinitely self-improving loop.

Um which is like which is very interesting but that's one thing that I think about. I don't know if you have any additional >> because you need to deploy to production and wait for some data. >> Yeah. >> Yeah. Depends on how how much data you have. Um but that's it's uh yeah it it it really depends on how big these sample groups are as well etc.

Sometimes it's like a few minutes you can tell but sometimes they want to wait for longer. >> Yeah. >> Does your platform help with like enabling experiments on sessions so you can maybe automatically enable experiments so you can like take care of the logic that every session has only one experiment. So we can easily compare it to the base or something like that. We have we're working on stuff like that act actively but uh yeah >> yeah we do so you can like ingest all your historical data and then when you create a signal we actually sort of like run like a quick back fill of the past couple of days. >> So, uh >> yes.

So, that's definitely supported. >> Uh we do have a free trial. We're going to try to make it's right now it's two weeks. Uh, probably going to make that longer soon, but if you just DM me, I can if you I can have my should I Oh, do you want to open this so I can just have our things? Well, yeah, but if you just So, we are hiring.

That's a That is a a thing uh that we're very excited about like trying to massively increase the size of the team. Um, and if you message me at either Twitter or you can email me as well, like that's something I can just set you up with a longer free trial. >> Yeah. >> So, how do you guys imagine you guys from understanding it, it's you're creating signal with your own models, whatever you're using. make our lives easier to identify signals and harmful intents and behaviors. But how would you maybe you have an example of how you have that full stack of work and no direct? >> So if you're sending all the telemetry data, we can find any exceptions in the traces tool errors etc.

And that's a thing that you can also track within raindrop and that's that is a explicit signal. So uh there's like implicit and explicit signal. So that becomes an explicit signal. Um do you do you so I think most of the observability platforms will give you like the agent trace the token usage uh if the tool call failed or not.

Uh but I think where we sort of like shine is sort of like the fuzzy part the fuzzy failures right where the user like frustrated which I think matters more than uh the explicit signal that you sort of get from sentry. I mean obviously those are also important but we focus a bit more on the fuzzier side of the failure space. Uh but at the same time we also have a trace view. I think we also have a very interesting feature called like trajectories uh which sort of like visualizes if you want to find like uh a trace uh which has like three different tool call failures.

Uh so you can actually but >> okay let me just get in so you can sort of like describe the uh type of trace that you want to look at. Uh so let's see hope we have like data but uh you can more or less like describe uh the type of trajectories that you want to see uh instead of just like configuring it. So we do both in a way. So you can obviously set up uh tools are failing sort of like alert as well.

So you can just search like for any trajectory. Um so yeah you can see that this is sort of how the tools are being being called in what order. You can see which ones have have errors. You click into them.

You can see the input and the output to this specific tool like what actually screwed up here. Um and you can see okay it's interesting that this has like this the no one lets you this is pretty much the only place where you can visualize tools like this. Um, but you can see here like you can get a shape, an understanding of the topology of what's going on here. And you can see when there's other ones that look similar, you can sort of see, okay, this kind of looks similar to this and then that gives you a sense.

You can do like search on this. Again, we have an agent that can look through these and give you a sense of what's going wrong. And so it just makes it really easy to find uh issues in in agents. Um, yeah.

Cool. Anything else? >> Any other questions? >> Can you export the data that you >> uh the directories data? >> Yeah. >> Uh what would you want to export like just the raw trace logs or what what do you >> So I think we usually what our customers do is that they already have like a hotel stream, right? So we just end up being like another uh target I guess. But at the same time uh they do want us to like export the signals that we label. >> So we do support like uh bitquery and uh snowflake.

So we do export the event and then the signals that were classified for that event. >> Last questions. >> Let's do it a longer time frame. So just go over the last month. So you can see stuff like refusals and then again if you click into any of these you can get a sense of over time task failure jailbreaking like what specifically is going on and then you have your self diagnostics ones as well um capability gap etc. Cool. >> Do you have open data on number of choices that you guys seems with your clients the extremely valuable when you have those agents at a very big scale and so I can imagine that you have a lot of data do you have some >> yeah it starts being is your question is like what's the smallest where it's useful or what's the >> volume of all the data that you're receiving processing and generating signal around like how many jailbreaks do you see across all the >> oh do we have any sort of like >> we should we should do something like that that would actually >> that would be very interesting.

We don't have anything like that >> like mixed opinions about that. I think eat does it in a way, right? But >> people generally have like negative reaction to it. Uh maybe it's different here.

But >> at the same time, >> do our customers want us to do that is like a different question as well, right? So >> uh >> but yeah, we would love to but uh I think they're like compliance reasons where we can't actually put a customer's data out there. >> Cool. Anything else? All right.

Thank you everyone.

More from AI Engineer

Skills at Scale — Nick Nisi and Zack Proser, WorkOS

Skills at Scale — Nick Nisi and Zack Proser, WorkOS

Missions: Multi-Agent Systems That Ship for Days — Luke Alvoeiro, Factory

Missions: Multi-Agent Systems That Ship for Days — Luke Alvoeiro, Factory

MCP UI: Extending the frontier — Liad Yosef and Ido Salomon, MCP Apps

MCP UI: Extending the frontier — Liad Yosef and Ido Salomon, MCP Apps

The Small Model Infrastructure Nobody Built (So We Did) — Filip Makraduli, Superlinked

The Small Model Infrastructure Nobody Built (So We Did) — Filip Makraduli, Superlinked

Accelerating AI on Edge — Chintan Parikh and Weiyi Wang, Google DeepMind

Accelerating AI on Edge — Chintan Parikh and Weiyi Wang, Google DeepMind

Demand-Driven Context: A Methodology for Coherent Knowledge Bases Through Agent Failure

Demand-Driven Context: A Methodology for Coherent Knowledge Bases Through Agent Failure

Want more than one video?

Search, chat, and build a knowledge base from thousands of videos.

Get started free

Everything You Need To Know About Agent Observability — Danny Gollapalli and Ben Hylak, Raindrop — AI Engineer