Hello,
I'm having performance and result-limit issues when trying to merge big data volumes from different sources with common key (foreign keys) fields. The scenario is as follows:
- We have 3 database inputs incrementally indexed daily with the dbx app. These datasources are updated once per day (during the night) as well.
- These three sources (A, B and C sourcetypes) are composed by events with common fields (foreign keys) in the case of A with B and B with C. Those would be, for example, fields AtoB, and BtoC.
- The total volume of events is near 4M. A and B have about 700k events in common to be merged, and B and C have about 800k. A+B AND B+C sums up to 600K events in common.
- The goal is to have all events with common fields grouped/joined/merged together to access all their fields, and also retain access to all the events without common fields. This is, one source that have all events unrelated, A+B only events grouped, A+B only events grouped, and A+B+C common events grouped. Like an "outer join".
- We can use night times dayly to perform heavy searches or batch processes and have them ready to use in the morning.
- The use would be a specific search panel which returns all the results from the A, B and C sources, correlated together when aplicable. All the data must be searchable, by specific field values, and search performance is required.
With this scenario, I've had various approaches in mind:
A - Use subsearches related commands (join, append, etc.) to have a pre-executed full searchable A+B+C dataset. This is unpractical in searchtime due to performance and also in nightly schedules due to the large volume of data to be joined or appended.
B - Use subsearches at search time (searchs look in large datasets but results are a small number of events...), with something like this to correlate A+B for example:
search sourcetype=A <A-arguments> [search sourcetype=B <B-arguments> | fields AtoB ] | join outer AtoB [search sourcetype=B <B-arguments> | fields <interesting-B-fields>]
This would not require nightly heavy-searches, but is a little bit slow, and makes the panel search logic very complex, as we would need to distingish between arguments for A, B and C and we would like to treat them as generic paramenters.
C - Use transaction or stats in a heavy scheduled nightly search, grouping all sourcetypes by common fields. The search would be similar to
index=myindex (sourcetype=A OR sourcetype=B OR sourcetype=C) | transaction AtoB BtoC keepevicted=true
The job results would be ready in the morning for direct use in the search panels. This is my current approach.. but I'm affraid this would have search limits as well.
D - Summary indexes seems to be not useful as we are not calculating aggregated stats or summarized values. But I'm not used to summary indexes so correct me if I'm wrong.
E - Export results to CSV to make lookup appends nightly, or similar lookup approaches...
Any help to find the best option would be very appreciated!
Thanks,