Connecting ActiveSpaces and Hadoop

Reading Time: 2 minutes

TIBCO is pleased to announce that we’re making code available that connects TIBCO ActiveSpaces to Apache Hadoop MapReduce, Apache Pig, and Apache Hive.  Hadoop and ActiveSpaces both support data-intensive distributed applications, and so they’re the perfect match!

Hadoop’s HDFS™ is designed to hold files of petabyte size, but the architectural tradeoff that was made is that you don’t get random access to that file – you have to read the whole thing.  Hadoop has some other storage technologies like HBase™ that have some more random access, but they have nothing like the in-memory data grid that ActiveSpaces provides.  With ActiveSpaces, you get random access to all of your data, without ever having to hit a disk, and with peer-to-peer replication, keeping Big Data in memory is quite feasible.

The code that we’re making available comes in three parts.  The first part integrates ActiveSpaces into the core MapReduce functionality, and provides an InputFormat and an OutputFormat for ActiveSpaces.  Take your class that describes your Tuples, and inherit from the simple provided interface ASWritable.  Now you can use MapReduce to create operations on all the Tuples in your space, and chain them together in all the standard MapReduce ways.

If, on the other hand, you’d prefer not to write Java™ code, but would prefer to script your data flows in Pig, the code supplies a LoadFunc and a StoreFunc that allow full interoperability between Pig and ActiveSpaces.  Now the dataflows that you’ve designed in Pig can read from, and write to ActiveSpaces, taking advantage of all ActiveSpaces has to offer.

Finally, we understand that Hive and its HiveQL are where it’s at.  The code supplies a Hive StorageHandler, so you can write all the HiveQL you want to slice and dice your ActiveSpaces data according to your needs.  It’s got nice little features like taking the part of the WHERE clause that makes sense for ActiveSpaces to operate on, and letting it do so.  So analysis over only parts of your data don’t have to operate on the whole data set.

If you’re wondering what you can accomplish by combining Hadoop and ActiveSpaces, let me give you a few quick examples.  The first is to store all your Point of Sale data in ActiveSpaces, and run day-by-day or minute-by-minute analysis on it.  A second would be to combine that with TIBCO Spotfire and be able to interactively peruse your Big Data, drilling down to the minute details when you need to.  That way when you’ve come across something interesting in your analysis, you have instant access to the details that help you determine whether the analysis was correct.

In one customer case, a car rental fleet analyst noticed that one particular model of car got significantly worse mileage than advertised. In this case, it turns out that one customer was stealing gas from the same car all the time.  This one behavior, repeated over time, completely skewed the aggregated gas mileage computed for every car of that model.  The ability to drill down into the details allowed the analyst to stop focusing on the car, and start focusing on the fraud.

Grab the code from TIBCOmmunity > Products > Event Processing > ActiveSpaces Data Grid and let us know what you think!