I have been working with the SMACK stack for a while now and it is great fun from a developer’s point of view. Kafka is a very robust data buffer, Spark is great at streaming all that buffered data and Cassandra is really fast at writing and retrieving it. Unless, of course, your data analysts come up with new queries for which there are no optimized Cassandra tables. In this case, you have to make a choice. You could use Spark itself to do those computations dynamically, for example with a notebook tool like Zeppelin. Depending on your data, this might take a while. Or you can store the data suitably during digestion in Cassandra – taking the penalty of duplicated data. And with the next new query, the cycle starts again…
Druid promises Online Analytical Processing (OLAP) capability in fast data contexts with realtime stream ingestion and sub-second queries. In this post, I am going to introduce the basic concepts behind Druid and show the tools in action.
Druid Segments
The equivalent to a table in Druid is called “data source”. The data source defines how data is stored and sharded.
Druid relies on the time dimension for every query, so data sets require a timestamp column. Other than the primary timestamp, data sets can contain a set of dimensions and metrics (measures in OLAP). Let’s use an example. We’re tracking registrations for events – the time the registration occurred, the name of the event, the country the event takes place in and the number of guests registered in that reservation:
timestamp | event name | country | guests |
---|---|---|---|
2016-08-14T14:01:35Z | XP Cologne 2017 | Germany | 12 |
2016-08-14T14:01:45Z | XP Cologne 2017 | Germany | 2 |
2016-08-14T14:02:36Z | data2day 2016 | Germany | 1 |
2016-08-14T14:02:55Z | JavaOne | USA | 9 |
“Event name” and “country” are dimensions – these are data fields that analysts might want to filter on. “guests” is a metric – we can do calculations on that field such as “how many guests have registered for the event “XP Cologne 2017”? “How many guests are attending conferences in Germany”?
Druid stores the data in immutable “segments”. A segment contains all data for a configured period of time. Per default, a segment contains data for a day, but this can be set higher or lower. So depending on the configuration, the data above could be in a single segment if the segment granularity is “day”, or in two different segments if the granularity is “minute”. Next to segment granularity, there is also “query granularity”. This defines how data is rolled up in Druid. With the above example, an analyst might not care about single registration events in a segment but about the aggregation. So going by the query granularity of minute, we would get the following result for the data above:
timestamp | event name | country | guests |
---|---|---|---|
2016-08-14T14:01:00Z | XP Cologne 2017 | Germany | 14 |
2016-08-14T14:02:00Z | data2day 2016 | Germany | 1 |
2016-08-14T14:02:00Z | JavaOne | USA | 9 |
Data rollup can save you quite a bit in storage capacity, but of course you lose the ability to query individual events. Now that we know about basic data structures in Druid, let’s look at the components that make a Druid cluster.
Cluster components
Well, I never promised that this was going to be easy. Druid requires quite a few components and external dependencies that need to work together. Let’s step through them:
Historical Nodes
These are nodes that do three things really well: loading immutable data segments from deep storage, dropping those segments and serving queries about loaded segments. In a production environment, deep storage will typically be Amazon S3 or HDFS. A historical node only cares about its own loaded segments. It does not know anything about any other segment. How does it know what segments to load? For that, we have …
Coordinator nodes
Coordinator nodes are the housekeepers in a Druid cluster. They make sure that all serviceable segments are loaded by at least one Historical Node – depending on configured replication factors. It also makes sure Historicals drop no longer needed segments and rebalances segments for somewhat even distribution within the cluster. So that’s historical data covered. The communication between Historicals and Coordinators is happening indirectly using another external dependency: Apache Zookeeper which you may already know from Kafka or Mesos. You can run multiple Coordinators for high availability, a leader will be elected.
So what about the promised realtime ingestion?
Realtime Ingestion
While Druid can be used to ingest batches of data, realtime data ingestion is one of its strong points. At the moment, there are two options available for realtime ingestion. The first one are Realtime Nodes, but I will not go into detail about them because they seem to be on their way out. Superseding Realtime nodes is the Indexing Service. Now, the Indexing Service is a bit more involved in itself. It is centered around a notion of tasks. The components that accepts tasks from the outside is the “Overlord”. It then forwards those tasks to so-called “Middle Managers”. These nodes in turn spawn so-called “Peons” who run in a separate JVM and can only run one task at a time. A Middle Manager can run a specified number of Peons. If all Middle Managers are occupied and the Overlord is not able to assign a new task anywhere, there is some autoscaling capability built-in to create new Middle Managers in suitable environments (e.g. AWS). So for realtime ingestion, a task is created for every new segment that is to be created. Data becomes queryable immediately after ingestion. The segments are announced in the third external dependency – a *gasp* relational database. The database contains the metadata about the segment and some rules that decide when this data should be loaded or dropped by Historicals. Once a segment is complete, it is handed off to deep storage, metadata is written and the task is completed.
The Indexing Service API is a bit low level. To make working with it easier, the “Tranquility” project seems to have established itself as the entrypoint for realtime ingestion. Tranquility servers can provide HTTP endpoints for data ingestion or read data from Kafka.
OK then, we got the data ingested and store. How can you query it?
Broker Nodes
Broker Nodes are your gateway to the data stored in Druid. They know which segments are available on which nodes (from Zookeeper), query the responsible nodes and merge the results together. The primary query “language” is a REST API, but tool suites like the Imply Analytics Platform extend that support with tools like Pivot and PlyQL. We will see those later.
As you can hardly be expected to remember all this prose, here is a diagram that shows the components and their relations:
Example
In this example, we’ll set up a Druid cluster that ingests RSVP events from Meetup.com – these are published on a very accessible WebSocket API.
Instead of starting each Druid component on its own, we’ll take a shortcut to set up the Imply Analytics Platform mentioned above. That way we won’t gain too much insight about the inner workings, but we’ll get our hands on a working “cluster”. I put cluster in quotes because we will run all processes on a single machine. Deep storage will be simulated by the local file system and the RDBMS is an in-memory Derby.
The data
Let’s take a look at the data first. We receive events of the following type from Meetup:
{ "venue":{ "venue_name":"Rosslyn Park FC", "lon":-0.226583, "lat":51.462914, "venue_id":24685117 }, "visibility":"public", "response":"no", "guests":0, "member":{ "member_id":4711, "member_name":"Paula" }, "rsvp_id":1624399275, "mtime":1471357877547, "event":{ "event_name":"Pre-season training", "event_id":"233241355", "time":1471457700000, "event_url":"http:\/\/www.meetup.com\/Rosslyn-Park-Womens-Rugby\/events\/233241355\/" }, "group":{ "group_topics":[ { "urlkey":"rugby", "topic_name":"Rugby" }, { "urlkey":"playing-rugby", "topic_name":"Playing Rugby" } ], "group_city":"London", "group_country":"gb", "group_id":20231882, "group_name":"Rosslyn Park Womens Rugby", "group_lon":-0.1, "group_urlname":"Rosslyn-Park-Womens-Rugby", "group_lat":51.52 } } |
For this example, I decided that the following fields could be of interest:
- mtime
- The time of the RSVP in milliseconds since the epoch
- response
- The response of the RSVP – “yes” or “no”
- guests
- The number of additional guests covered by this RSVP
- event_name
- The name of the event
- group_name
- The name of the Meetup group
- group_city
- The city of the Meetup group
- group_country
- The country of the Meetup group
- member_name
- The name of the member issuing the RSVP
- member_id
- The ID of the member issuing the RSVP
- member.other_services.twitter.identifier
- The Twitter handle of the the member if available
- venue.lat
- Latitude of the Vanue
- venue.lon
- Longitude of the Vanue
A simple Akka http client connects to the WebSocket (see this Gist) and transforms the data into a flat structure (expected by Druid). A single data row looks like this:
{ "rsvpTime":1471549945000, "response":"yes", "guests":0, "eventName":"Missouri Patriot Paws", "eventTime":1472601600000, "groupName":"Meet me at the library. No library card needed!", "groupCity":"O Fallon", "groupCountry":"us", "memberName":"Hans Dampf", "memberId":4711, "twitterName":null, "venueName":"St. Charles County Library Middendorf-Kredell Branch", "venueLat":38.767715, "venueLong":-90.69902 } |
Simple local setup
Before we can send this data to Druid, we need to start it up. The guys at Imply make this really easy for us. Following their quickstart guide for a local setup, we just need to execute these commands:
curl -O https://static.imply.io/release/imply-1.3.0.tar.gz tar -xzf imply-1.3.0.tar.gz cd imply-1.3.0 bin/supervise -c conf/supervise/quickstart.conf |
Yet before we do that, we need to tell Tranquility about our Meetup datasource. To do this, we edit conf-quickstart/tranquility/server.json to add the following datasource:
{ "spec": { "dataSchema": { "dataSource": "meetup", "parser": { "type": "string", "parseSpec": { "timestampSpec": { "column": "rsvpTime", "format": "auto" }, "dimensionsSpec": { "dimensions": [ "response", "eventName", "eventTime", "groupName", "groupCity", "groupCountry", "memberName", "memberId", "twitterName", "venueName" ], "dimensionExclusions": [ "guests", "rsvpTime" ], "spatialDimensions": [ { "dimName": "venueCoordinates", "dims": [ "venueLat", "venueLong" ] } ] }, "format": "json" } }, "granularitySpec": { "type": "uniform", "segmentGranularity": "hour", "queryGranularity": "none" }, "metricsSpec": [ { "type": "count", "name": "count" }, { "name": "guestsSum", "type": "doubleSum", "fieldName": "guests" }, { "fieldName": "guests", "name": "guestsMin", "type": "doubleMin" }, { "type": "doubleMax", "name": "guestsMax", "fieldName": "guests" } ] }, "ioConfig": { "type": "realtime" }, "tuningConfig": { "type": "realtime", "maxRowsInMemory": "100000", "intermediatePersistPeriod": "PT10M", "windowPeriod": "PT10M" } }, "properties": { "task.partitions": "1", "task.replicants": "1" } } |
This spec basically tells Tranquility and Druid what fields in our dataset are event timestamps, dimensions and metrics. We also can aggregate latitude and longitude to a spatial dimension. In the “granularity” section, we specify that we want our segments to cover an hour and query granularity to be none – this preserves single records and is nice for our example, but probably not what you would do in production. After editing the file, we can start Druid with the steps shown above. Once running, we can ingest data by posting http requests to http://localhost:8200/v1/post/meetup. To check if the meetup ingestion is running, we can open the Overlord console:
Accessing the data
So now we’re ingesting the data, how can we get to it? I am going to show three different ways.
Druid queries
The basic way to get data out of Druid is to run a plain Druid query. This means posting the query in JSON format against the broker node. If for example we would like to get the count of all RSVPs from Germany or the US per hour in a specified period, we’d post the following data to http://localhost:8082/druid/v2/?pretty:
{ "queryType": "timeseries", "dataSource": "meetup", "granularity": "hour", "descending": "true", "filter": { "type": "or", "fields": [ { "type": "selector", "dimension": "groupCountry", "value": "de" }, { "type": "selector", "dimension": "groupCountry", "value": "us" } ] }, "aggregations": [ { "type": "longSum", "name": "rsvpSum", "fieldName": "count" } ], "postAggregations": [ ], "intervals": [ "2016-08-14T00:00:00.000/2016-08-20T00:00:00.000" ] } |
This yields the following response (extract):
[ { "timestamp": "2016-08-18T20:00:00.000Z", "result": { "rsvpSum": 81 } }, { "timestamp": "2016-08-18T19:00:00.000Z", "result": { "rsvpSum": 249 } }, { "timestamp": "2016-08-17T11:00:00.000Z", "result": { "rsvpSum": 316 } } ] |
The queries that Druid performs best are time series and TopN queries. The documentation gives you an insight about what is possible.
Pivot
Imply provides “Pivot” at http://localhost:9095. Pivot is a GUI for analyzing a Druid datasource and is very accessible. You are greeted by something like this:
If we want to see the events with the biggest number of RSVPs in the last week including the split between “yes” and “no”, we certainly can do that:
We can also look at the raw data:
Playing around with Pivot to get a feeling for the tool and your data is certainly fun and works like this out of the box. Pivot is based on “Plywood” – a Javascript library as integration layer between Druid data and visualization frontends that is also part of Imply.
PlyQL
Another part of Imply is “PlyQL”. As you can imagine from the name, it aims to provide SQL-like access to the data. Regarding our Meetup platform, we start by looking at the set of tables that we can query:
bin/plyql --host localhost:8082 -q 'SHOW TABLES' ┌────────────────────┐ │ Tables_in_database │ ├────────────────────┤ │ COLUMNS │ │ SCHEMATA │ │ TABLES │ │ meetup │ └────────────────────┘ |
Describing the table “meetup” gives the following overview:
bin/plyql --host localhost:8082 -q 'DESCRIBE meetup' ┌──────────────────┬────────┬──────┬─────┬─────────┬───────┐ │ Field │ Type │ Null │ Key │ Default │ Extra │ ├──────────────────┼────────┼──────┼─────┼─────────┼───────┤ │ __time │ TIME │ YES │ │ │ │ │ count │ NUMBER │ YES │ │ │ │ │ eventName │ STRING │ YES │ │ │ │ │ eventTime │ STRING │ YES │ │ │ │ │ groupCity │ STRING │ YES │ │ │ │ │ groupCountry │ STRING │ YES │ │ │ │ │ groupName │ STRING │ YES │ │ │ │ │ guestsMax │ NUMBER │ YES │ │ │ │ │ guestsMin │ NUMBER │ YES │ │ │ │ │ guestsSum │ NUMBER │ YES │ │ │ │ │ memberId │ STRING │ YES │ │ │ │ │ memberName │ STRING │ YES │ │ │ │ │ response │ STRING │ YES │ │ │ │ │ twitterName │ STRING │ YES │ │ │ │ │ venueCoordinates │ STRING │ YES │ │ │ │ │ venueName │ STRING │ YES │ │ │ │ └──────────────────┴────────┴──────┴─────┴─────────┴───────┘ |
Finding the five events where the positive RSVPs have the highest average of additional guests is possible using this query:
in/plyql --host localhost:8082 -q \ 'SELECT eventName, avg(guestsSum) as avgGuests \ FROM meetup \ WHERE "2015-09-12T00:00:00" <= __time \ AND __time < "2016-09-13T00:00:00" \ AND response = "yes" \ GROUP BY eventName \ ORDER BY avgGuests DESC \ LIMIT 5' ┌────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────┐ │ eventName │ guests │ ├────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────┤ │ 8/19 NEC Business & Social Networking Meetup at the JW Marriott Hotel │ 69 │ │ Southwest Suburbs Sports, Outdoors, & Social Group Fall Picnic │ 63 │ │ Chicago Caming, Canoeing, and Outdoors Adventure Group Fall Picnic │ 60 │ │ Saturday, September 10th 2016 Dance @ Dance New York Studio! │ 49.5 │ │ Mingle & 90s/00s Piccadilly Party with 1 x FREE FOOD!! & Happy Hour until 9pm │ 39 │ │ Calpe Beach and Guadalest Castle (Option to climb Penyon de Ifach) │ 30 └────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────┘ |
We could also use the -o switch to get the output of our queries as JSON.
Summary
This concludes our quick walkthrough of Druid. We talked about the basic concepts of Druid, ran an ingestion of realtime Meetup data and looked at ways to access the data with plain Druid and the very interesting Imply Analytics Platform. Druid is a very promising piece of technology that warrants an evaluation if you’re trying to run OLAP queries on realtime fast data.
For further reading, I suggest:
The post Realtime Fast Data Analytics with Druid appeared first on codecentric AG Blog.