I have a search/alert that alerts me when certain indexes have more than the usual amount of event data using _internal metrics, and which runs once an hour. And then I have this search which I run for the previous hour, which shows me where the spike occurred:
index=_internal source=*metrics* group=per_index_thruput series="winevent_index"
| rename series as index
| eval MB=round(kb/1024,3)
| where MB > 1
| stats sum(MB) as MB by index date_hour date_minute
| sort date_hour, date_minute
| addtotals col=true MB row=false
I alternately might run this search:
index=_internal source=*metrics* group="per_index_thruput" series="winevent_index" | eval MB=round(kb/1024,3) | bucket _time span=1m | stats sum(MB) as MB by _time | eval mtime=strftime(_time, "%Y-%d-%m %H:%M") | table mtime MB
In any case, doing one of these lets me see where the spike occurred and how many minutes it was, and then I want to run a search on the winevent_index for the time-frame where the spike showed in the metrics, and just a bit wider on either side. Now here what I am looking for is a spike in events per minutes, and I can slice and dice that information by host or whatever. This has worked well for me in identifying where the unusual log data is coming from and what sort of events were involved.
But recently I saw a spike in the metrics for winevent_index but I could not correlate that spike to a spike in events/minute in the actual index. This has me deeply puzzled. After some reflection, I began to wonder if the metrics include events that have been dropped into nullQueue via a transform.
MY QUESTION: is the data for the metrics post-index? Does it reflect what was indexed, or what was received.
Thanks for any insights on this whole issue!