I am a splunk newbie, so some obvious explanations might need further clarification.
What I have:
Advanced medical imaging system of systems that produces a global output log of a specific format (example given later)
I apply a repetitive task to this system: Example: startup until all statuses are reported, issue shutdown and repeat, this will go on for days without operator intervention. (there are many other tests I do, but this is the one I am testing with the splunk concept)
What I am trying to do (big picture):
Index/chop up log files based on cycle period. [Cycle test started to cycle test ended]
Index/chop up log files based on cycle. [startup to shutdown would be one cycle]
Index all output messages. [ I will get about 5 cycles per hour with 200-400 time stamped reported events per cycle]
Goal: find out which events are not supposed to happen and investigate to fix
Types of Outputs: categorize # of specific event_identifier that occur in each cycle to create a baseline/statistical prediction based on event_identifier and event_identifier content. Find errors that reflect a need to fix something.
I am not expecting someone to do my job for me, but more of being lead in the right direction. I am still learning the splunk data mining lingo.
What I am currently doing:
I am using the source log file for the cycle period
[this is what I can not figure out] For "cycle" I want the cycle to start every time the log outputs an event with message "System shutdown started from desktop button"
Each event is divided based on example message below, event being from start message to end message
my (users\admin\search\local\props.conf) is as follows:
[Test 1]
EXTRACT-event_source = (?im)^\t(?P<event_source>[^\t]+)
EXTRACT-event_identifier = (?im)^(?:[^\t\n]*\t){4}(?P<event_identifier>[^\t]+)
EXTRACT-event_location = (?im)^\t\w+\t(?P<event_location>.+)
EXTRACT-event_start_ID = (?im)^(?P<event_start_id>.+)
my (\etc\system\local\props.conf)
[Test 1]
BREAK_ONLY_BEFORE = SR \d\d\d
MUST_BREAK_AFTER = EN \d\d\d
NO_BINARY_CHECK = 1
SHOULD_LINEMERGE = true
pulldown_type = 1
I am testing this on my own time and hope to eventually present it to my supervisor to try and implement it as a common tool within our engineering department, especially when trying to prove system reliability.
Example Message:
SR 145
1371027603 1 1 Wed Jun 12 09:00:03 2013 200002348 4
bay90ct cupMonitor
ssProcStop.c 1509
The System Software has terminated.
EN 145
SR ### (event_start_ID) <--start message
1371027603(unique ID for specific time) 1(ignore) 1(ignore) Wed Jun 12 09:00:03 2013(tstamp) 200002348(event_identifier)
bay90ct(event_source) cupMonitor(Process)
ssProcStop.c(event_location) 1509(line in source)
The System Software has terminated.(message, can be multi-lined)
SR ### <---end message
Each cycle will be differentite by an event message that begins the next cycle right after. (there are two types of event_identifier's that say that output this desired message.)
example:
SR 153 1371086430 0 1 Thu Jun 13 01:20:30 2013 200002387 4 bay92ct cupMonitor ssProcStart.c 906
System shutdown started from desktop button
EN 153
Example cycle test period
SR 261 1370995620 0 1 Wed Jun 12 00:07:00 2013 0 7 bay90ct Svc_Notepad Notepad.c 44
This message was added by the OPERATOR to report on a problem: RstHast Enabled - start command: startrsthast -shutdown . Type stoprsthast in unix shell to disable
EN 261
/////PLACE A BUNCH OF Cycles with messages HERE
SR 179 1371027942 0 1 Wed Jun 12 09:05:42 2013 0 7 bay90ct Svc_Notepad Notepad.c 44
This message was added by the OPERATOR to report on a problem: Rsthast Disabled
EN 179