The True Solution to Big Data

Reading Time: 4 minutes

The most accepted definition of Big Data comes from Wikipedia:

“Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.”

To put it another way:

Big Data is the term for a collection of data sets so large and complex that it becomes financially impractical to process using on-hand database management tools or traditional data processing applications.

In the world of IT and business, there really isn’t a better synonym for “difficult” than “financially impractical.”  Most problems can be solved. The issue is that there is not enough resources or time to solve the problem.

Leave Room to Scale

The definition of Big Data is almost a self-fulfilling prophecy because Big Data cannot be solved in a financially practical manner. In a world of increasing information, no matter how many clusters of Hadoop an organization provisions, there will always be a situation where there is more data than what can be processed with available resources. In a world where 90% of data today was generated in the last two years, how would a CIO ever expect to buy enough Hadoop clusters to keep up?

With full respect given to Hadoop and other Big Data tools, they are simply point-in-time tools. They crunch data as it exists at that time and provide an answer as it exists at that time. If the data has grown or evolved since it was collected or as it was being processed, then the answer given by these tools is slightly incorrect. It will become more inaccurate as time continues and more data is collected. This means that, eventually, the tools will have to be applied to the data set again to update the answer.

Ensuring your Data Provides Relevant Information

Here is an example that makes this point obvious. If you program your Hadoop cluster to answer this simple question, your answer will never change:

How many customers, in 2013, purchased blue scarves and leather handbags, but never purchased earrings?

While this answer never changes, it also has decreasing value as time continues. In early 2014, this answer may have significant business value, but it would be only a mildly curious fact in 2016. This gets to a point that TIBCO understands very well—the value of your information decays over time. The entire Two-Second Advantage philosophy is about getting the right information to the right person in time for that person to make a business decision, and that all of the information in the world given to that person after the decision has been made is almost worthless.

To make this question continually viable to the business, we need to rephrase it a bit: How many customers, in the last year, purchased blue scarves and leather handbags, but never purchased earrings?

Now that the question has been re-phrased, we will get a business-relevant answer every time the question is asked. The next obvious step is to decide how often the business needs this information.

Strategy Behind Information Delivery

If the business needs this answer every four weeks, IT staff can set up their Hadoop cluster to crunch this answer every month. For simple arguments, let’s assume that this ties up the cluster for roughly four hours. With 672 hours in four weeks, this means that the business needs to get more than 0.5% of the cost of the cluster in value out of the question.

What happens if the business suddenly needs this answer every week? Now the cluster is tied up 2.4% of the time with this same question. Once again, this is fine as long as the business gets more value out of the answer than the cost of getting the answer. When the updated answer is required daily, we find that almost 17% of the cost of the cluster needs to be defended by the business benefit of this one question.

As the number, complexity, and frequency of questions increase, the IT department is once again challenged with the cost of the “collection of data sets so large and complex that it becomes difficult to process using on-hand Big Data tools.” This means that business people are inevitably going to be told by IT that the cost of acquiring any given frequently updated answer is too expensive to produce.

Business people rarely like it when IT tells them they cannot have what they need!

If Big Data Tools Are Not the Solution, What Is?

The solution to the problem lies in combining Big Data tools with tools that can process the flow of information. If we apply simple math to the problem, it is easy to see the answer:

Current_answer = Historic_information_answer + new_information_answer

If we take the answer to the question that was calculated one month ago, and modify it with the answer for only the latest month, we receive the new and current answer. While it may be possible to use Big Data tools to do this partial processing, it is likely a waste of resources. A more efficient implementation is to use an enhanced ESB (what TIBCO sometimes calls an event services bus).

By combining a traditional ESB with a complex event processing capability (resulting in an event services bus) your Big Data answers can be updated to being real-time, Big Data answers at a fraction of the cost and frustration. Also, the Big Data tools that the company uses never have to be used on that question again; they are free to work on new Big Data problems. This means a true re-purposing of that investment and maximizing of the ROI.

This resulting architecture is far more nimble and less costly to maintain. It also allows the organization to finally solve the true problem of Big Data. Big problems that thwart standard analysis tools can be solved and a solution can be maintained and kept current.