A data streaming engine in C/C++ would be very interesting. We have started in Java, but there might be parts it would make sense to write in C++ .
The challenge is not for others to write a data streaming engine to give to us. The challenge is for us to write a data streaming engine and give to you! ... but if you want to try beating the 1BRS milestone too - that would be fun too :-)
I think you misunderstood. WE will implement this data streaming engine, and give it away as open source ! ... so WE get the bragging rights, you get the data streaming engine.
What we mean by that rule (general purpose data streaming engine) is just, that the product must be usable for other use cases than just this challenge. You can write your own data streaming engine, but it should be able to handle a wide variety of use cases, not just the challenge use case.
For instance, simply writing a program that loads 1 billion bytes into memory, iterates them and sums them, would not count as a "general purpose data streaming engine". But you don't have to use Spark, Kafka or something like that. You can write your own.
Hi Varrakesh, the reason it is not "well specified" is, that all of your suggestions are interesting to try out and benchmark. Rather than saying "it has to be exactly like this" we have left it more open ended by saying "what would it take to get to 1 billion records per second?".
The answer might be different on different types of hardware, and with different types of data sets, and with different types of data set sculpting. Yes, it is okay to have one benchmark where there are no more than e.g. 255 products, or 255 customers, but then we should probably also benchmark with e.g. up to 65.536 products and 65.536 customers, and up. Part of achieving high performance data streaming is the ability to make your data small.
It would also be okay to use a GPU - although we have not (yet) plans about doing that. Still, it would be very interesting to see what kind of results you could get with that design.
We just have the requirement, that the data streaming engine must not be exclusively designed for this challenge. It must be a reasonably functional general purpose data streaming engine.
By the way, we hope to reach the 1 BRS milestone on a single server, i7-6700 Quad-Core Skylake CPU, with 2 NVME SSDs mounted in RAID 1. 1 GB of memory to run the benchmark app should be enough, but the server will probably have 64 GB by default.
The main differences of Boards compared to other web browsers will not really be clear until 1-2-3 releases from now. The current version just shows the core principles.
Boards is not designed to replace a standard web browser, but to supplement it for the use cases traditional web browsers do not support very well.
Eventually, we will not even classify Boards as a web browser, but as an alternative Internet Client - targeting the Internet of Data, Internet of Services, Internet of Things etc. However, we have to start somewhere, and this is our embarassingly simplistic and not-so-pretty MVP :-)
You are right, the term "self describing" as used in our docs could be more clear. Being self describing means that you do not need a schema to make sense of a stream of data of that format.
However, there is also a degree to which a data format can be self describing. A CSV file is reasonably self describing because you can see where one field ends and the next begins (at the comma / separator), and where one record ends and the next begins (new line). With a header line of column names a CSV file becomes more self describing, as you now also have a name indicating the semantic meaning of fields in that column. If a CSV file could somehow contain a specification of the data type of each column, it would be even more self describing etc.
This is what we are trying to achieve with ION. If you need speed, you can omit most of the meta data like property names etc. If you need messages to be self describing, you can add a lot of meta data (like class / schema names + version, property names etc.).
I apologize for having written incorrect documentation. If you wrote those docs for Google Protocol Buffers, part of that is on you. They are not exactly crystal clear ;-) (our doc's aren't either - still working on them!)
Thank you for clearing up that Protobuf fields can be distinguished in a stream of Protobuf fields, even without schema. That was unclear to me before now. By the way, that is pretty clear in Cap'n Proto - your invention right? So - better docs already!
And - thank you for clearing up the difference in the encoding of Cap'n Proto. Any link to where I can read about that encoding style in more details?
The format is, of course, a lot like how in-memory data structures are laid out in C (fields of a struct have fixed offsets; variable-size fields are behind pointers). Unlike native pointers, though, Cap'n Proto's pointers are designed to be relocatable and easy to bounds-check, and they contain just enough type information for the message to be minimally self-describing (so that you can e.g. make a copy of a particular sub-object without knowing its schema).
Yes, a guy told us that Amazon has an internal data format called ION. We googled for it, but didn't find it, so we assumed Amazon wants to keep it internal.
The challenge is not for others to write a data streaming engine to give to us. The challenge is for us to write a data streaming engine and give to you! ... but if you want to try beating the 1BRS milestone too - that would be fun too :-)