Quantcast
Channel: Latest Questions on Splunk Answers
Viewing all articles
Browse latest Browse all 13053

Optimizing Dashboards performances, looking for the better design

$
0
0

Hi,

Currently finalizing a Splunk application for my company, i am looking for the better way to optimize dashboards performances.

My application manages monitoring raw data collected by various nagios collectors (networking and security components) to provide complex reports with charts, this may represent a large amount of data lines to be analysed by Splunk.

Here are some example of search i use to create my reporting Dashboards: (i use lookup with csv files to define various fields and values)

Example 1, simple aggregation of stats

index="xxx_index" sourcetype="xxx_source" technical_zone="INTERNET" leveltechnical_zone="N1" monitor="CONNEXIONS" monitor_label="connexions" | dedup _time hour hostname monitor monitor_label value | bucket _time span=5m | stats sum(value) As value by _time | timechart span=5m eval(round(mean(value),0)) As Datacenter_Average_Session eval(round(max(value),0)) As Datacenter_Max_Session

Example 2, more standard representation by host

index="xxx_index" sourcetype="xxx_source" technical_zone="INTERNET" leveltechnical_zone="N1" monitor="CPU" monitor_label="cpu" | dedup _time hour hostname monitor monitor_label value | bucket _time span=5m | timechart span=5m eval(round(mean(value),2)) As Average_CPU eval(round(max(value),2)) As Max_CPU by hostname

Example 3, more complex managing networking counter type with multiple series:

index="xxx_index" sourcetype="xxx_source" technical_zone="INTERNET" leveltechnical_zone="N1" functionnal_zone="XXXXX" traffic_sense="IN" | dedup _time hour hostname monitor monitor_label value | streamstats current=f global=f window=1 first(value) as next_value, first(_time) as next_time by monitor_label, hostname | eval dt=next_time-_time | eval deltavalue=next_value-value | eval realvalue=deltavalue/dt | where realvalue>=0 | eval realvalue=round(realvalue,2) | eval value=realvalue | eval value=value*8/1000000 | bucket _time span=5m | stats max(value) As ValMax by _time,monitor_label,hostname | eval ValMax=round(ValMax,2) | eval s1="max" | makemv s1 | mvexpand s1 | eval yval=case(s1=="max",ValMax) | eval series=hostname+":"+monitor_label+":"+s1 | xyseries _time,series,yval | makecontinuous _time

About span / bin:

As i was not fully satisfied by auto span charting (i and i have to get charts as granular as possible), i found (after having tested various approaches) as the better solution to use javascript (using sideviews custom behavior) that will define span value depending on selected timerange (span value is being down-streamed), this requires inline search inside advanced xml views:

content of application.js: (got the main code from http://pastebin.com/jqDktMhC)

//Assign CustomBehavior triggers
if(typeof(Sideview)!="undefined"){
        $(document).bind("allModulesInHierarchy",function(){

                Sideview.utils.forEachModuleWithCustomBehavior("GatherBins",function(b,a){

//isReadyForContextPush -- don't push to the next modules, since the bins aren't assigned yet.
                        a.isReadyForContextPush = function(){
                                if(!this.RetrievedBinCount) return Splunk.Module.DEFER;
                                if (this.getLoadState() < Splunk.util.moduleLoadStates.HAS_CONTEXT) return false;
                                return true
                        }

//onJobProgress -- Actually figure out the number of bins.
                        a.onJobProgress = function() {
                                var c=this.getContext();
                                //This will be the upstream * | head 1 search job, which will give us absolute values for the TimeRangePicker          
                                var d=c.get("search").job;

                                var Bins = 0;
                                var Binsize = "";
                                var Span = "";
                                var Showspan = "";
                                var latest = new Date(d._latestTime);
                                var earliest = new Date(d._earliestTime);
                                //Handle latestTime = 0 (Not sure how often this should happen -- came up when I was testing)
                                if(latest.valueOf() == 0){
                                        latest = new Date();
                                }

                                //Calculate difference in seconds
                                var Difference = (latest.valueOf() - earliest.valueOf()) / 1000;

                                //Figure out how many bins to assign, based on the range. The below is for 10 minute data increments.
                                //If you had only hourly data, and were searching over 10 years, you might need to add an additional layer of summary.

                                if(Difference > (730*24*60*60)){
                                        //alert("More than 730 days -- summarize four days");
                                        Bins = parseInt(Difference / (96*60*60))+2;
                                        Binsize = "Four Day";
                                        Showspan = "4 jours";
                                        Span = "4d";
                                }else if(Difference > (450*24*60*60)){
                                        //alert("More than 450 days -- summarize two days");
                                        Bins = parseInt(Difference / (48*60*60))+2;
                                        Binsize = "Two Day";
                                        Showspan = "2 jours";
                                        Span = "2d";
                                }else if(Difference > (150*24*60*60)){
                                        //alert("More than 150 days -- summarize daily");
                                        Bins = parseInt(Difference / (24*60*60))+2;
                                        Binsize = "One Day";
                                        Showspan = "1 jour";
                                        Span = "1d";
                                }else if(Difference > (100*24*60*60)){
                                        //alert("More than 100 days -- summarize 12 hourly");
                                        Bins = parseInt(Difference / (12*60*60))+2;
                                        Binsize = "12 Hour";
                                        Showspan = "12 heures";
                                        Span = "12h";
                                }else if(Difference > (50*24*60*60)){
                                        //alert("More than 50 days -- summarize 8 hourly");
                                        Bins = parseInt(Difference / (8*60*60))+2;
                                        Binsize = "8 Hour";
                                        Showspan = "8 heures";
                                        Span = "8h";
                                }else if(Difference > (14*24*60*60)){
                                        //alert("More than 14 days -- summarize 4 hourly");
                                        Bins = parseInt(Difference / (4*60*60))+2;
                                        Binsize = "4 Hour";
                                        Showspan = "4 heures";
                                        Span = "4h";
                                }else if(Difference > (6*24*60*60)){
                                        //alert("More than 6 days -- summarize hourly");
                                        Bins = parseInt(Difference / (60*60))+2;
                                        Binsize = "One Hour";
                                        Showspan = "1 heure";
                                        Span = "1h";
                                }else if(Difference > (2*24*60*60)){
                                        //alert("More than 2 days -- summarize half-hourly");
                                        Bins = parseInt(Difference / (30*60))+2;
                                        Binsize = "30 Minute";
                                        Showspan = "30 minutes";
                                        Span = "30m";
                                }else if(Difference > (1*24*60*60)){
                                        //alert("More than 1 day -- summarize 10 minutes");
                                        Bins = parseInt(Difference / (10*60))+2;
                                        Binsize = "10 Minute";
                                        Showspan = "10 minutes";
                                        Span = "10m";       
                                }else{
                                        //alert("Less or equal to 1 day -- summarize to 5 minutes");
                                        Bins = parseInt(Difference / (5*60))+2;
                                        Binsize = "5 Minute";
                                        Showspan = "5 minutes";
                                        Span = "5m";
                                }

                                // Assign to context                           
                                this.Bins = Bins;
                                this.Binsize = Binsize;
                                this.Span = Span;
                                this.Showspan = Showspan;
                                this.RetrievedBinCount = true;

                                //Now that we have everything we need, we're ready to roll on to the next modules.
                                this.pushContextToChildren();

                        }

//getModifiedContent -- put the Bins into $Bins$

                        a.getModifiedContext=function(){
                                var context=this.getContext();
                                context.set("Bins", this.Bins);
                                context.set("Binsize", this.Binsize);
                                context.set("Span", this.Span);
                                context.set("Showspan", this.Showspan);
                                return context
                        }
                })
        })
}

Here is an example of xml view (using search example 1):

Note: To understand the view, an home page with a timerange button gives to user the time range selection choice, which is down-streamed to the view being called

<view autoCancelInterval="90" isVisible="False" onunloadCancelJobs="True" template="dashboard.html" stylesheet="dashboard_customsize.css" isSticky="False">

<!-- Version = 0.1 / Last update = March 9, 2013 -->

  <label>INTERNET - FW N1</label>

<!-- standard splunk chrome at the top -->
  <module name="AccountBar" layoutPanel="appHeader"/>
  <module name="AppBar" layoutPanel="navigationHeader"/>
  <module name="SideviewUtils" layoutPanel="appHeader" />

  <module name="Message" layoutPanel="messaging">
    <param name="filter">*</param>
    <param name="clearOnJobDispatch">False</param>
    <param name="maxSize">1</param>
  </module>

  <module name="URLLoader" layoutPanel="panel_row1_col1" autoRun="True">

   <module name="HTML" layoutPanel="panel_row1_col1">
      <param name="html"><![CDATA[

       <p></p>
       <h1>Capacity Planning: $title$</h1>

      ]]></param>
    </module>

<!-- Global TimeRangePicker --> 
    <module name="TimeRangePicker" layoutPanel="splSearchControls-inline">
        <param name="searchWhenChanged">True</param>

  <module name="Search" autoRun="True">
            <param name="search">* | head 1</param>

     <module name="CustomBehavior">
    <param name="customBehavior">GatherBins</param>
     <param name="requiresDispatch">True</param>

    <module name="HTML" layoutPanel="panel_row1_col1">
      <param name="html"><![CDATA[

       <p></p>
       <h2>Laps de temps d'analyse: $Showspan$</h2>

      ]]></param>
    </module>

    <module name="SearchControls" layoutPanel="panel_row1_col1">
        <param name="sections">print</param>
    </module>

<!-- ########################################       BEGIN OF SECTIONS           ######################################## -->

<!-- ########################################       FIREWALL N1         ######################################## -->

<!-- #####################    SESSIONS    ##################### -->

<!-- Using custom size -->

        <module name="Search" layoutPanel="panel_row2_col1" autoRun="True">
            <param name="search">index="xxx_index" sourcetype="xxx_source" technical_zone="INTERNET" leveltechnical_zone="N1" monitor="CONNEXIONS" monitor_label="connexions" | dedup _time hour hostname monitor monitor_label value | bucket _time span=5m | stats sum(value) As value by _time | timechart span=$Span$ eval(round(mean(value),0)) As Datacenter_Average_Session eval(round(max(value),0)) As Datacenter_Max_Session
            </param>

            <module name="HTML" layoutPanel="panel_row2_col1">
                <param name="html"><![CDATA[
                <h3>Vision Datacenter - Pics (Valeur Max) du nombre de sessions simultanées</h3>
                ]]></param>
            </module>

            <module name="HiddenFieldPicker">
            <param name="strictMode">True</param>
            <module name="JobProgressIndicator">
              <module name="EnablePreview">
                <param name="display">False</param>
                <param name="enable">True</param>
                <module name="HiddenChartFormatter">
                  <param name="charting.legend.placement">bottom</param>
                  <param name="charting.chart.nullValueMode">connect</param>
                  <param name="charting.chart">line</param>line
                  <param name="charting.axisTitleX.text">Periode</param>
                  <param name="charting.axisTitleY.text">Sessions</param>
                  <module name="JSChart">
                    <param name="width">100%</param>
            <param name="height">300px</param>
                    <module name="ConvertToDrilldownSearch">
                      <module name="ViewRedirector">
                        <param name="viewTarget">flashtimeline</param>
                      </module>
                    </module>
                  </module>
                  <module name="ViewRedirectorLink">
                    <param name="viewTarget">flashtimeline</param>
                  </module>
                </module>
              </module>
            </module>
          </module>
        </module>

<!-- ########################################       END OF SECTIONS         ######################################## -->

    </module> <!-- TimeRangePicker -->

</module> <!-- URLLoader -->

</module> <!-- CustomBehavior -->
</module> <!-- Search -->

</view>

This is working very great and answers perfectly to my needs. (regarding granularity of charts)

Therefore, i am looking for the better approach to optimize dashboard performance and reduce number of jobs and their CPU cost.

1. Schedule Saved searches

This was my first approach, defining specific timerange to be scheduled (like Alltime and Last 30 days as for an example).

Any time the user selects one of defined scheduled timerange, a specific version of this view is being called (the view contains corresponding time savedsearches) and executed.

Any other timerange selected by the user calls a "timerange" version of this view using instant searches.

Advantages: - Works great, loading dashboard is very quick when calling previously executed jobs - Keeps CPU as free as possible for other users

Constraints: - Various xml file versions to maintain for the same dashboard - Multiple savedsearches which becomes hard to maintain and implement with numerous dashboards

This is definitively a too complex and limited approach, very hard to keep clean with time and dashboards add, not satisfying.

2. Summary indexing

Summary indexing as far as i understood the way Splunk works is one of logical way to achieve optimization.

Unfortunately, all my configuration tests intends to demonstrate worst performances than using normal index and searches... (perhaps my fault!)

I have tried with or without "si" commands, with almost same results

I used to define a schedule saved search to generate data into a dedicated summary index, let's call it "xxx_summary"

All data is being collected each night around 2h. AM (my dashboards are reports with day -1), so i don't need to often execute savedsearches to populate the summary index.

With search example 1, i used as a scheduled search to populate the summary index with the lower Span value i need (5 minutes):

[INTERNET_FW_N1_sessions_sum_XXX]
action.email.inline = 1
alert.digest_mode = True
alert.suppress = 0
alert.track = 1
cron_schedule = */55 * * * * 
description = INTERNET_FW_N1_sessions_sum_XXX
dispatch.earliest_time = -1d@d
dispatch.latest_time = now
enableSched = 1
realtime_schedule = 0
auto_summarize = 0
auto_summarize.dispatch.earliest_time = 0
action.summary_index = 1
action.summary_index._name = xxx_summary
action.summary_index.report = INTERNET_FW_N1_sessions_sum_XXX
search = index="xxx_index" sourcetype="xxx_source" technical_zone="INTERNET" leveltechnical_zone="N1" monitor="CONNEXIONS" monitor_label="connexions" | dedup _time hour hostname monitor monitor_label value | bucket _time span=5m | stats sum(value) As value by _time | timechart span=5m eval(round(mean(value),0)) As Datacenter_Average_Session eval(round(max(value),0)) As Datacenter_Max_Session

Then, i my view i use the following inline search:

<param name="search">
index="xxx_summary" report="INTERNET_FW_N1_sessions_sum_XXX" | timechart span=$Span$ eval(round(mean(value),0)) As Datacenter_Average_Session eval(round(max(value),0)) As Datacenter_Max_Session by hostname
</param>

Used the python script to populate previous periods, example:

./splunk cmd python fill_summary_index.py -app My_Application -name "INTERNET_FW_N1_sessions_sum_XXX" -et -7d@d -lt @d -j 8

Then, when the search is being called, every thing works fine and i get my chart as expected.

But, performances are strangely worst than using the Raw-data index, where i used my index containing millions of events, and the summary only containing one filled report for a few days!

Performance test with search example 1:

Using normal index and normal search, i get as execution time:

This search has completed and has returned 276 results by scanning 1,654 events in 0.838 seconds

Using summary indexed search (populated with normal commands), i get:

This search has completed and has returned 276 results by scanning 19,021 events in 4.027 seconds.

I almost have same kind of performances using "si" commands.

What am i missing ? I don't really understand why Splunk has to scan so much more events when using summary index search, and why does the request takes so long to be executed

In this perf test example, my index contains 22.82 millions of events, where my summary index only contains 00,06 Millions of events, shouldn't we expect better result with summary ?

It seems Splunk is scanning all events in summary before getting exepected result, isn't the filter "report" enough to prevent this ?

Thank you in advance for any help you could provide me.


Viewing all articles
Browse latest Browse all 13053

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>