The latest edition of the Communications of the ACM had an article covering CEP from the BI perspective, written by researchers from Microsoft and HP. On Complex Event Processing they write:
The competitive pressure of today’s businesses has led to the increased need for near real-time BI. The goal of near real-time BI (also called operational BI or just-in-time BI) is to reduce the latency between when operational data is acquired and when analysis over that data is possible. … A class of systems that enables such real-time BI is Complex Event Processing (CEP) engines…
They then quickly make the mistake of assuming CEP = Event Stream Processing or ESP…
Applications define declarative queries that can contain operations over streaming data such as filtering, windowing, aggregations, unions, and joins. The arrival of events in the input stream(s) triggers processing of the query. These are referred to as “standing” or “continuous” queries …
Of course, rule-based CEP, using constructs like Event Condition Action rules, and indeed other languages / algorithms can be used to define aggregate events to support real-timeĀ BI tasks. Ooops.
There are several open technical problems in CEP; we touch upon a few of them here.
This should be interesting…
One important challenge is to handle continuous queries that reference data in the database (for example, the query references a table of customers stored in the database) without affecting near real-time requirements.
I’m pretty sure most stream processing engines provide some ability to do this (if you want to). But normally the last thing you want to do in (near) real-time business applications is waste time querying a RDBMS. Common practice is to put the data you need first into memory or a datagrid / cache, where access times are much lower.
The problem of optimizing query plans over streaming data has several open challenges. In principle, the benefit of an improved execution plan for the query is unlimited since the query executes “forever.” This opens up the possibility of more thorough optimization than is feasible in a traditional DBMS. Moreover, the ability to observe execution of operators in the execution plan over an extended period of time can be potentially valuable in identifying suboptimal plans.
Again, I’m pretty sure all the main CEP technology provides have optimised CEP engines – for example TIBCO uses a high-performance version of the Rete algorithm. I’m pretty sure Microsoft’s stream processing engine has optimizations too!
Finally, the increasing importance of real-time analytics implies that many traditional data mining techniques may need to be revisited in the context of streaming data. For example, algorithms that require multiple passes over the data are no longer feasible for streaming data.
This is actually an issue for the analytics guys: continuous analytics do indeed imply computing statistical models an event-at-a-time rather than “all at once against some vast data store”. The good news is that the analytics world has been doing some of this for some time, in order to accomodate complex processing of large data sets by batching and recombining data as it is processed.
The BI Overview also mentions several other “BI technology” aspects that are also often combined in CEP solutions: in-memory, distributed (map reduce a.k.a. “divide and conquer”), and analytics. So although it wasn’t emphasised in this paper, it seems the interesting development here is a certain amount of convergence in using these technologies together!