In this blog post we’ll have a look at how easy it is to do some (near-)realtime analytics with your (big) data. I will use some well-known technologies like MongoDB and node.js and a lesser known JavaScript library called Smoothies Charts for realtime visualization.
Streaming Data from MongoDB
There is no official API to stream data out of MongoDB in an asynchronous manner. But it is possible to do so by trailing the oplog, MongoDB’s internal collection for replication (which is basically an implementation of an distributed log). Each document (and each other update or remove operation) will show up in that collection when using a replica set. So let’s set up a minimal replica set with 2 members:
$ mkdir -p /data/rs01 $ mkdir -p /data/rs02 $ mongod --port 27017 --dbpath /data/rs01 --replSet rs $ mongod --port 27018 --dbpath /data/rs02 --replSet rs |
After the instances are up, we have to init the replica set. We connect to one of the instances using the mongo shell and issue the following command:
$ mongo ... > rs.initiate({ _id: "rs", members: [ {_id: 0, host: "localhost:27017"}, {_id: 1, host: "localhost:27018"} ]}) |
After some time, one of the instance becomes the primary node, the other the secondary.
Watching the Oplog
There are several node modules out there that allow you to easily watch the oplog. We’ll use …
npm install mongo-oplog-watcher |
This module allows us to register a callback for each replication operation (insert, update, delete) quite easily:
var OplogWatcher = require('mongo-oplog-watcher'); var oplog = new OplogWatcher({ host:"127.0.0.1" ,ns: "test.orders" }); oplog.on('insert', function(doc) { console.log(doc); }); |
We connect to the local replica set and are only interested in insert operations on the collection test.orders
. Right now, we are just logging the received document.
Simple Analytics
For our demo, we use one of the simplest aggregation available: we will count things. To be more specific, we’ll count product orders grouped by category (books, electronics, shoes, etc.). Each document in test.orders
has field cid
that holds the product category:
{ _id: ObjectId("..."), cid: "BK", // "BK" = books, "SH" = shoes, "EL" = electronics ... } |
The aggregation implementation is not that complicated. If we get an order for a yet unknown category, the counter is initialized with 1 and incremented otherwise:
var categoryCounter = {}; exports.addData = function( document ) { cat = document.cid; if ( cat ) { catCount = categoryCounter[cat]; if (catCount) { categoryCounter[cat]++; } else { categoryCounter[cat] = 1; } } }; exports.aggregate = function() { return categoryCounter; } |
This aggregation can easily extended with a timestamp to count all orders in a time frame, like orders per minute, orders per hour and so on. Do this as an exercise.
Client Notification
With each emitted document (that frequency maybe reduced in productio with heavy write load), the dispatcher pushes the updates to all registered web clients which use web sockets to connect to the node server process. We are using the node module socket.io
:
npm install socket.io |
The code on the server side looks basically like this:
// init web sockets var clients = []; var io = require('socket.io').listen(httpd); io.sockets.on('connection', function(socket) { clients.push(socket); }); io.sockets.on('disconnect', function(socket) { clients.pull( clients.indexOf(socket) ); }); // client notification var async = require('async'); if ( clients.length > 0 ) { console.info("Starting push to clients ..."); async.eachSeries( clients, function(socket, callback) { socket.emit(data_key, data); callback(); }, function(err) { console.info(err); } ); } |
The array of connected client get its notification asynchronously by use of the async
module.
Visualization – Smoothies Charts
The visualization is done by a JavaScript library called Smoothies Charts which supports drawing reatime graphs in an easy manner. After opening a web socket to the node process, the graph is initialized. We are plotting one line for each product category:
var CATEGORIES = [ 'BK', // books 'EL', // electronics 'SH' // shoes ]; var STROKES = {'BK': 'rgba(0, 255, 0, 1)', 'EL': 'rgba(255, 0, 0, 1)', 'SH': 'rgba(0, 0, 255, 1)' }; var FILLS = {'BK': 'rgba(0, 255, 0, 0.2)', 'EL': 'rgba(255, 0, 0, 0.2)', 'SH': 'rgba(0, 0, 255, 0.2)'}; var orders = []; var current_state = null; function initChart() { var chart = new SmoothieChart(); forEachCategory( function(category) { orders[category] = new TimeSeries(); chart.addTimeSeries(orders[category], { strokeStyle : STROKES[category], fillStyle : FILLS[category], lineWidth : 2 }); }); chart.streamTo(document.getElementById("orders-category"), 500); setInterval( autoUpdate, 2000 ); } function initWebsocket() { var socket = io.connect('http://localhost:8080'); socket.on('order_aggregates', function (data) { current_state = data; updateChart(data); }); } function updateChart(data) { if (data) { forEachCategory( function(category) { orders[category].append(new Date().getTime(), data[category] ); }); } } |
This results in three monotonically increasing graphs that may look like this:
Conclusion
Of course, this example is simple. But it illustrates how to put up a (near) realtime view to your data. After streaming your data out of your NoSQL datastore, you may want to use more mature solutions like Storm or Splunk to process that data. Your system may not only visualize your data, it can also perform actions, something like increasing the amount of ad banners for the product category that does not sell very well etc.
The full source code for our demo application can be found at this github repo.
The post Near-Realtime Analytics with MongoDB, Node.js & SmoothieCharts appeared first on codecentric Blog.