I don't know Riak, other than its a distributed NoSQL key-value data store. Time...

athenot · on May 10, 2016

Minor quibble: the article is about RiakTS, their time-series enhanced version of riak_core.

riak_core's main strength is that it does key-value in a distributed/resilient manner, spreading values in multiple copies (at least 3) all over a cluster of servers. Kill one server, no problem. Need more capacity, add servers and it will rebalance itself.

The TS part is just an optimization built on top of that, to make values which are near each other in time to be near each other in storage, for faster range-base retrieval.

siculars · on May 10, 2016

Yes, exactly. (Author)

eggy · on May 10, 2016

TS enhanced version of riak_core, got it.

Most TS, or tick DBs, are columnar-based, memory-mapped, fast and light systems.

Are there any benchmarks similar to STAC-M3, which is a year's worth of NYSE data run on different hardware to gauge kdb+'s effectiveness on different hardware configurations [1]? It's a great way to gauge performance and TCO.

Does it do both memory (streaming data) and disk-based (historical) storage for big data set analytics in realtime?

I'd be interested to see numbers there.

A lot of people think kdb+ is only for finance. There is a conference coming up in May that will have talks on q (the language for the kdb+ database) about natural language processing and machine learning in q to name a few. Another is about using it at a power plant to most efficiently route power based upon realtime data [2].

I only got into kdb+ and q with the free, non-commercial 32-bit version. I usually use J and sometimes APL, which had MapReduce since at least the 80s for APL.Check out this post from 2009 [3]. I guess the 'new shiny' bit threw me in your chosen title.

[1] https://stacresearch.com/news/2014/02/13/stac-reports-intel-...

[2] https://kxcon2016.com/agenda/

[3] http://blog.data-miners.com/2009/04/mapreduce-hadoop-everyth...

srpeck · on May 10, 2016

You might find these benchmarks interesting: http://kparc.com/q4/readme.txt

eggy · on May 10, 2016

I was inquiring about benchmarks for RiakTS, but your link was perfect. I am a J/APL dabbler, and quite recently learning kdb+/q (I prefer k).

As much as I step away from these languages, I always find my way back to them in strange ways. I was studying music, and there was a great J article in Vector magazine written in August 2006 [1] that walks through scales, and other musical concepts in J.

A Forth-based music software called Sporth [2] has a kona ugen in it, so you can generate scales or other musical items in kona, and then use them in the stack-based Sporth audio language.

My interests in kdb+/q, k, J and APL are in applying them to mathematical investigations of music, visuals, doing data analysis, and then just code golfing, or toying around. They're so much fun!

I need more time on large streaming datasets (Time Series data), than large disk-based datasets to really test latencies. I am building a box much better suited for it than my current machine. The goal is to stay in RAM as much as possible.

[1] http://archive.vector.org.uk/art10010610

[2] https://github.com/PaulBatchelor/Sporth

srpeck · on May 10, 2016

You should definitely check out JohnEarnest/RodgerTheGreat's iKe, built on his open source k interpreter in JS. Fun examples: http://johnearnest.github.io/ok/ike/ike.html?gist=bbab46d613... and http://johnearnest.github.io/ok/ike/ike.html?gist=b741444d04...

https://github.com/JohnEarnest/ok/tree/gh-pages/ike

https://github.com/JohnEarnest/ok

And related APL/J/K subreddit: https://www.reddit.com/r/apljk/

eggy · on May 10, 2016

I had stumbled upon John's work before. I am currently dabbling with a stack-based audio language called Sporth [1], and messing with the idea of somehow mashing it up with John's ike project.

See, vector/array languages aren't just for FinTech or Time Series!

[1] https://github.com/PaulBatchelor/Sporth

yummyfajitas · on May 10, 2016

As far as I know, this and KDB have very different use cases. KDB is a single box timeseries database - the closest open source analogue would probably be using Pandas + a big folder of CSV files. (KDB performs a lot better than this, however.)

The use case is storing tick data + economic data for all the symbols. I.e., one timeseries per publicly traded company. The primary use case is loading a significant chunk of that data and running some statistical analysis on it.

Also KDB pricing starts at $100k or something like that.

In contrast, Riak's timeseries product is distributed. It could store millions of timeseries if you throw enough boxes at it. You probably won't be loading all the data for analysis, probably you'll be processing some of the data one series at a time, and the rest is just for reference.

The main use case here is sensor networks (aka "internet of things") more than financial data. I.e., one timeseries per rotor on a drone, or per sensor on your phone, sometihng like that.

gricardo99 · on May 10, 2016

> KDB is a single box timeseries

It’s not limited to a single box. “single box” is the easiest setup, but by no means the common use case. Kdb+ can be, and is, used across multiple servers and multiple storage devices.

> the closest open source analogue would probably be using Pandas + a big folder of CSV files

I’m not familiar with Pandas, but I know what “a big folder of CSV files” looks like and that’s pretty far from kdb+. Historical/disk data is typically organized as “splayed/parted” tables, which means it’s column oriented, sorted, grouped/hashed by key columns, can be distributed across multiple devices[1], and has built-in compression. So yes, operating over binary data, with a well-organized physical layout, and some optimized data structures is going to be much faster, not to mention the implied storage/server cost savings.

> Also KDB pricing starts at $100k or something like that.

I wish Kx were more open about this, but this is not very accurate. You pay per CPU core, or for query throughput. The size of your data, or number of users doesn’t matter, so you can start-out for 1/10th this cost. Of course your need for query throughput will correlate with things like number of users/clients and size of your data. But the licensing model is pretty simple, and seems to fit well with how user access scales.

I know Kx makes the argument that the licensing costs are more than offset by the reduced server footprint. Personally, I’d love to see an actual case study that shows costs for a open-source database solution (with some performance metrics/benchmarks), while a kdb+ performance-equivalent solution costs less (and projected to scale for less). I know that’s a complicated comparison, where you’d really need to account for things like developer/admin time/salaries, not just equipment/bandwidth/storage/licensing costs. But it would be great to at least see a cost comparison on the latter dimensions.

[1] - http://code.kx.com/wiki/JB:KdbplusForMortals/kdbplus_databas...

yummyfajitas · on May 10, 2016

So the file format is a lot better than CSV files, but in principle it's basically just a bunch of files. Maybe a better analogy would have been a big folder of feather/hdf5/etc files.

(Incidentally, I'm a big fan of the folder/s3 bucket/etc full of CSV/binary files and use it whenever possible.)

I agree - it's absolutely better to use than that, but it's a lot closer to that model than to the Riak model of querying a distributed system to send you the data.

I stand corrected on single box and pricing - it's been a while since I've used it.

eggy · on May 10, 2016

kdb+/q/k are used for IOT applications [1], not just fin tec. After all, it is all time series data.

The benchmarks given in a response above by srpeck [2], shows spark/shark to be 230 times slower than a k4 query, and using 50GB or RAM vs. 0.2GB RAM for k4. If RiakTS is relying on spark/shark as the in-memory database engine, it is already at a big disadvantage compared to k in terms of speed, and all the RAM that is going to be required on those distributed servers.

I will have to look at the DDL/math functions available in RiakTS too, since that is how you get your work done regardless of speed of access.

[1] http://www.kdnuggets.com/2016/04/kxcon2016-kdb-conference-ma...

[2] http://kparc.com/q4/readme.txt

yummyfajitas · on May 10, 2016

Very cool, I stand corrected. I hope one day I have another opportunity to play with KDB.

As for the speed advantage, you'll have a similar speed advantage with python/pandas/big folder of CSV files. For all of Spark's claims on "speed", it's really just reducing the speed penalty of Hadoop from 500x to 50x. (Here 500x and 50x refer to the performance of loading flat files from a disk.)

gricardo99 · on May 10, 2016

Do you really mean flat CSV text files? I get the simplicity of that, but it seems really expensive (speed and size). But I'm used to tables with more than a dozen columns, and with kdb+ you only be pull in the columns of interest, and the rows of interest (due to on-disk sorting and grouping), which is a smaller subset, often much smaller.

yummyfajitas · on May 11, 2016

By number, my data sets are usually in CSV. I could probably get some additional advantage via HDF5, but a gzipped CSV is usually good enough and simpler. By volume (i.e. on my 2 or 3 biggest data sets) I'll probably be mostly HDF5. I haven't tried feather yet but it looks pretty nice.

KDB would probably be better, but don't underestimate what you can do with just a bunch of files.

macintux · on May 11, 2016

RiakTS does not rely on any external data storage (other than our fork of Google's leveldb) or processing tool, so Spark's performance is irrelevant.

anthonybsd · on May 10, 2016

KDB is not distributed and K (APL) is not particularly pleasant to work with. While it has a proven track record in fintech no one I know of is particularly fond of working with this technology. Not to mention the cost of K developers (200K+). Riak is simply offering a free alternative to these systems that is very palatable.

gricardo99 · on May 10, 2016

I’m not sure what you mean by "not distributed". With kdb+ you have a lot of flexibility in how to setup the database. You can organize the data to be stored in a distributed fashion (across multiple devices, multiple servers), you can setup query load balancers to distribute work-loads, and you can replicate to multiple servers/devices. You don’t have to use K, you code in q, which most people find far easier to read/write. There’s a wealth of information to help with all this[1], and a very responsive user group.

But yes, you do need kdb+ expertise to get full use out of the tool. And yes, feelings seem to run strong towards kdb+, in both directions, love/hate it. And correct again, it’s not free and the licensing cost is definitely a hurdle to wider adoption.

[1] - http://code.kx.com

anthonybsd · on May 10, 2016

>I’m not sure what you mean by "not distributed". With kdb+ you have a lot of flexibility in how to setup the database. You can organize the data to be stored in a distributed fashion (across multiple devices, multiple servers.

So I was under impression (feel free to correct me) that kdb horizontal scaling was something akin to Oracle RAC. I.e. horizontal in the name only. I.e. the data is only ever available from one physical instance at a time.

gd1 · on May 11, 2016

Having worked with both Oracle and KDB, the thing to understand is that KDB (or K or Q) is a fully fledged language... you can do whatever you like really. If you want something distributed you just design something yourself and build it. It's not like Oracle where it is all opaque and buried in query plans and the like.

MasterScrat · on May 10, 2016

> How is Riak different, or more suited to use than Kdb + q, J with JDB (free), Jd (a commercial J database like Kdb/q)[2], or the new Kerf lang/db being developed by Kevin Lawler[3]?

I would really be interested to see an informed answer to this question!

metaobject · on May 10, 2016

Time series analysis is also an important component of weather and climate research.