Hacker Newsnew | past | comments | ask | show | jobs | submit | theikkila's commentslogin

Does WarpStream guarantee correct order inside partition only for acknowledged messages or also among the acknowledged messages (in different batches)? If so how do you keep clocks synchronized between the agents?


(WarpStream founder)

It guarantees correct ordering inside a partition for all acknowledged messages regardless of which batch they originated from. We don't synchronize clocks, the agents call out to our cloud metadata store which runs a per-cluster metadata store that assigns offsets to messages at commit time. Effectively "committing" data involves two steps:

1. Write a file to S3 2. "Commit" that file to the metadata store which will then assign the partitions at commit time 3. Return the assigned partitions to the client


Then the order is on batch level? Say batch1 is committed with a smaller timestamp than batch2, then all messages in batch1 are considered prior/earlier than any message in batch2?


It’s getting late and I’m not 100% confident I’m sure what you’re asking, but I believe the answer to your question is yes. When you produce a message/batch you get back offsets for each message you produced. If you produce again after receiving that acknowledgement, the next set of offsets you receive are guaranteed to be larger than any of the offsets you received from your previous request.


Related to 1. If I understood corrently the agent generates single object per each flushing interval containing all data accross all topics it has received. Does this mean that when reading the consumer needs to read multiple partition data simultaneously to access just single partition? How about scaling consumers horizontally how does WarpStream Agent handle horizontal partitioning of the stream from consuming side?


[WarpStream co-founder here]

That is correct about flushing. RE: consuming. The TLDR; is that the agents in an availability zone cluster with each other to form a distributed file cache such that no matter how many consumers you attach to a topic, you will almost never pay for more than 1 GET request per 4MiB of data, per zone. Basically when a consumer fetches a block of data for a single partition, that will trigger an "over read" of up to 4MiB of data that is then cached for subsequent requests. This cache is "smart" and will deduplicate all concurrent requests for the same 4MiB blocks across all agents within an AZ.

It's a bit difficult to explain succinctly in an HN comment, but hopefully that helps.


Is there a reason you built that cache layer yourself (rather than each node "just" running its own sidecar MinIO instance, that write-throughs to the origin object store?)


(WarpStream co-founder)

The cache is for reads, not writes. There is no cache for writes.

We built our own because it needed to behave in a very specific way to meet our cost/latency goals. Running a MinIO sidecar instance means that every agent would effectively have to download every file in its entirety which would not scale well and would be expensive. We also have a pretty hard and fast rule about keeping deploying WarpStream as simple as rolling out a single stateless binary.


I have somehow feeling that this does not answer the original question. Could you give some examples of impossible or not easy tasks this achieves and other solutions don’t?


Take the recently popular actor model as an example. Each player is an actor, which can not only ensure the simplicity of development, but also ensure the atomicity of the player's data in one operation. But if a player-to-player transaction system is designed, it is difficult for this actor model to achieve the atomicity of this transaction.

But Lockval Engine can easily do all this. Because Lockval Engine is data-oriented programming. It provides a pair of APIs: GetAndLock and PutAndUnlock. You lock the data when you fetch it, and unlock it when you complete the modification.

And this pair of APIs will also synchronize the modified data to the front end. In this way, you don't need to worry about how to send data to the front end when designing functions. Other frameworks or codes need to complete this function by themselves.

Other advantages include but are not limited to: In this distributed architecture, the atomicity of hot update code and configuration. It won't happen that half the system is an old script and half is a new script.


Can you elaborate on an example with games that have some form of "fog of war"? Or rather only some state is seen by X players?


In Lockval Engine, a player data pair is on UID. A map block data can also be mapped to a UID. Or an observation point data can also correspond to a UID.

Lockval Engine provides a Watch function. Players can watch the UID of a certain map block or the UID of a certain observation point, so that a function similar to the fog of war can be realized.

This demo can be viewed in "Prepare a globalChat struct" and "send A message to globalChat" on the apidemo page. This demo can evolve a variety of similar functions.


wasn't that 4k requests per minute?


I'm little bit troubled by the model created for classifying the vein patterns. With only corpus of size 40 and using two classes without augmentation will most likely end up overfitting the model. I would say the model is currently learning to classify left or right hand but doesn't really care about the veins much.

Have you tried the performance with some other user?

I would also probably use data augmentation and ie. flip and rotate images, vary contrast etc. That might prevent some amount of overfitting.

With these kinds of problems usually classifying models are not very well suited. Basically with neural networks you are causing the manifold to partition whole output space and so you can expect that there is practically unlimited amount of different patterns that look equal to your hand (the class you have trained to be 'your hand')

For better model you need more data, it can be labeled of course but there is also unsupervised options you could consider such as autoencoders. With facial recognition the siamese networks and triplet loss based networks are pretty popular and you could maybe take a look into them.


Not at all, deep nets are difficult to train and they need lot's of processing before they learn same kind of classifying features than ie. SVM has. So yes you can simulate SVM with deep networks but usually it's not very good solution.

SVM can be also used as part of the neural network such as in classifying layer


That sounds interesting, it's funny if you really can use their own models as base and do that. For the platform sake, Google offers also SaaS where you can train and evaluate your own models but then the base model is something you have to provide yourself

EDIT: I tried to google that up but couldn't find anything. Could you provide a link for that



Hmm ok, yeah apparently you can leverage their API's and teach new labels for your own data.

https://azure.microsoft.com/en-us/services/cognitive-service...

That is great and will definetly help with problems where your task isn't just to recognize cats and dogs the only downside is that you are giving your data away and it will also help your competitors.


I wanted to try out different types of final layers (SVM, Logistic Reg.) so the direct example didn't suit very well for that kind of testing out. For starters that good though.


Hello author of the article here!

I will probably publish in some time (when I'm not so busy doing other projects) some code samples too but until that I can give you tips.

You should start inspecting how the pre trained models (Inception v3/v4) works, what kind of layers they have and then decide what layers you want to use and what not. In case of Tensorflow, the tensorboard is very good tool for inspecting the model inner layers.

If you want to get started even more easier you should probably take a look into Tensorflow Slim models (https://github.com/tensorflow/models/tree/master/slim)

There is quite beginner friendly instructions for simple fine tuning of the models and it should take you pretty far.


I find this interesting to use even with common web applications as a design pattern. It's usually easy to just dispatch events but parsing them is always complicated and if you need to do that in all clients youre essentially decoupling the logic. If you can provide already parsed state you are providing the view into that data.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: