More

jroseattle · on Oct 20, 2024

We reviewed Redis back in 2018 as a potential solution for our use case. In the end, we opted for a less sexy solution (not Redis) that never failed us, no joke.

Our use case: handing out a ticket (something with an identifier) from a finite set of tickets from a campaign. It's something akin to Ticketmaster allocating seats in a venue for a concert. Our operation was as you might expect: provide a ticket to a request if one is available, assign some metadata from the request to the allocated ticket, and remove it from consideration for future client requests.

We had failed campaigns in the past (over-allocation, under-allocation, duplicate allocation, etc.) so our concern was accuracy. Clients would connect and request a ticket; we wanted to exclusively distribute only the set of tickets available from the pool. If the number of client requests exceeded the number of tickets, the system should protect for that.

We tried Redis, including the naive implementation of getting the lock, checking the lock, doing our thing, releasing the lock. It was ok, but administrative overhead was a lot for us at the time. I'm glad we didn't go that route, though.

We ultimately settled on...Postgres. Our "distributed lock" was just a composite UPDATE statement using some Postgres-specific features. We effectively turned requests into a SET operation, where the database would return either a record that indicated the request was successful, or something that indicated it failed. ACID transactions for the win!

With accuracy solved, we next looked at scale/performance. We didn't need to support millions of requests/sec, but we did have some spikiness thresholds. We were able to optimize read/write db instances within our cluster, and strategically load larger/higher-demand campaigns to allocated systems. We continued to improve on optimization over two years, but not once did we ever have a campaign with ticket distribution failures.

Note: I am not an expert of any kind in distributed-lock technology. I'm just someone who did their homework, focused on the problem to be solved, and found a solution after trying a few things.

nh2 · on Oct 20, 2024

You are right that anything that needs up to 50000 atomic, short-lived transactions per second can just use Postgres.

Your UPDATE transaction lasts just a few microseconds, so you can just centralise the problem and that's good because it's simpler, faster and safer.

But this is not a _distributed_ problem, as the article explains:

> remember that a lock in a distributed system is not like a mutex in a multi-threaded application. It’s a more complicated beast, due to the problem that different nodes and the network can all fail independently in various ways

You need distributed locking if the transactions can take seconds or hours, and the machines involved can fail while they hold the lock.

fny · on Oct 20, 2024

You could just have multiple clients attempt to update a row that defines the lock. Postgres transactions have no limit and will unwind on client failure. Since connections are persistent, there’s no need to play a game to determine the state of a client.

nh2 · on Oct 20, 2024

Your scenario still uses a centralised single postgres server. Failure of that server takes down the whole locking functionality. That's not what people usually mean by "distributed".

"the machines involved can fail" must also include the postgres machines.

To get that, you need to coordinate multiple postgres servers, e.g. using ... distributed locking. Postgres does not provide that out of the box -- neither multi-master setups, nor master-standby synchronous replication with automatic failover. Wrapper software that provides that, such as Stolon and Patroni, use distributed KV stores / lock managers such as etcd and Consul to provide it.

jroseattle · on Oct 21, 2024

> up to 50000 atomic, short-lived transactions per second

50000?

> You need distributed locking if the transactions can take seconds or hours, and the machines involved can fail while they hold the lock. From my experience, locks are needed to ensure synchronized access to resources. Distributed locks are a form of that isolation being held across computing processes, as opposed to the mutex example provided.

And while our implementation definitively did not use a distributed lock, we could still see those machines fail.

I fail to understand why a distributed lock is needed for anything due to it's duration.

throwawaythekey · on Oct 21, 2024

Mostly guessing but -> duration is usually inversely correlated with throughput.

If you require high throughput and have a high duration then partitioning/distribution are the normal solution.

stickfigure · on Oct 20, 2024

I think this illustrates something important, which is that: You don't need locking. You need <some high-level business constraint that might or might not require some form of locking>.

In your case, the constraint is "don't sell more than N tickets". For most realistic traffic volumes for that kind of problem, you can solve it with traditional rdbms transactional behavior and let it manage whatever locking it uses internally.

I wish developers were a lot slower to reach for "I'll build distributed locks". There's almost always a better answer, but it's specific to each application.

jroseattle · on Oct 21, 2024

This is exactly how we arrived at our solution. We needed to satisfy the constraint; locking was one means of addressing the constraint.

Maybe we were lucky in our implementation, but a key factor for our decision was understanding how to manage the systems in our environment. We would have skilled up with Redis, but we felt our Postgres solution would be a good first step. We just haven't had a need to go to a second step yet.

nasretdinov · on Oct 20, 2024

So basically your answer (and the correct answer most of the time) was that you don't really need distributed locks even if you think you do :)

tonyarkles · on Oct 20, 2024

Heh, in my local developer community I have a bit of a reputation for being “the guy” to talk to about distributed systems. I’d done a bunch of work in the early days of the horizontal-scaling movement (vs just buying bigger servers) and did an M.Sc focused on distributed systems performance.

Whenever anyone would come and ask for help with a planned distributed system the first question I would always ask is: does this system actually need to be distributed?! In my 15 years of consulting I think the answer was only actually “yes” 2 or 3 times. Much more often than was helping them solve the performance problems in their single server system; without doing that they would usually just have ended up with a slow complex distributed system.

Edit: lol this paper was not popular in the Distributed Systems Group at my school: https://www.usenix.org/system/files/conference/hotos15/hotos...

“You can have a second computer once you’ve shown you know how to use the first one.”

Agingcoder · on Oct 20, 2024

I wanted to post the same paper. With Adrian Colyer’s explanations: https://blog.acolyer.org/2015/06/05/scalability-but-at-what-...

etcd · on Oct 20, 2024

I guess this is embarassingly parralelizable in that you can shard by concert to different instances. Might even be a job for that newfangled cloudflare sqlite thing.

wwarner · on Oct 20, 2024

This is the best way, and actually the only sensible way to approach the problem. I first read about it here https://code.flickr.net/2010/02/08/ticket-servers-distribute...

hansvm · on Oct 20, 2024

> only sensible way

That's a bit strong. Like most of engineering, it depends. Postgres is a good solution if you only have maybe 100k QPS, the locks are logically (if not necessarily fully physically) partially independent, and they aren't held for long. Break any of those constraints, or add anything weird (inefficient postgres clients, high DB load, ...), and you start having to explore either removing those seeming constraints or using other solutions.

wwarner · on Oct 20, 2024

Ok fair; I'm not really talking about postgres (the link i shared uses mysql). I'm saying that creating a ticket server that just issues and persists unique tokens, is a way to provide coordination between loosely coupled applications.

zbobet2012 · on Oct 20, 2024

Yeah that's cookies. They are great.

OnlyMortal · on Oct 20, 2024

Interesting. We went through a similar process and ended up with Yugabyte to deal with the locks (cluster).

It’s based on Postgres but performance was not good enough.

We’re now moving to RDMA.

apwell23 · on Oct 20, 2024

Classic tech interview question

jroseattle · on Sept 23, 2024

Congrats on writing and completing a book! I was involved in a few myself long ago, when I had the time available to contribute to those endeavors. In a world that often measures "the juice being worth the squeeze", I'm not sure authoring technical manuals would ever meet the criteria.

One of my personal photos I keep around was taken long ago in what was the biggest bricks/mortar bookseller. I was looking at the selection of books on Java available at the time. O'Reilly was the dominant publisher, and thus had several offerings on the wall. Most of the books were at least 2 inches thick. (If you were ever involved with writing a technical book in the early 2000s, you'll understand the publisher metrics at the time were based on the width of the spine on the shelf.)

Among the many Java manuals of significant girth was a small, THIN book with the title "Java -- the Good Parts". :-{}

jroseattle · on July 29, 2024

> The vulnerability was addressed with the release of Docker Engine v18.09.1, but it was not included in subsequent major versions, causing a regression.

Without further information, this sounds like code introduced in a hotfix that wasn't merged back to feature branches.

Surely it's not that simple?

jroseattle · on July 17, 2024

> Resume fraud is rampant.

So is interview fraud. The remote-interviewee-answers-questions-while-her-face-reflects-windows-popping-up-on-her-screen is tiring at this point. So, I decided to find a way to inform me if someone was being fed answers in a tech interview.

Behold, the low-tech whiteboard. Also known as a piece of paper and a pencil. With the candidates I've run into that do not pass the "smell" test -- where I think they are being fed answers -- I ask them to draw some things, on paper. It's not a true validation, but it gives me something of a clue.

I ask for a simple diagram. Different services in a network, for example. Or a mini-architecture. For their level, I'll ask for something that should be drop-dead easy.

I ask them to show me their drawing.

The responses I've received run the gamut of "I don't know" (after 5 seconds of deliberation) to "I don't understand the purpose" (after 5 minutes of silence) to "I need to shut off my screen for a while" (while refusing to explain why) to "it depends if your cloud is AWS" (not in any way remotely related to the question.) I did have a candidate follow-up with a series of questions about the drawing, which were feasibly legitimate.

This hand-written diagram is not an absolute filter (I've only used it maybe four times), but rather it can confirm some suspicions. I think I can generally gauge honesty from questions/tasks like this. And that's really what I'm after -- are you being honest with me?

It's imperfect, but it has been helpful.

lucb1e · on July 18, 2024

Maybe easier is to just ask that they show their hands while you ask a short question until they gave the answer. Could even be up front about it and say you suspect they're looking up the answers, since it's not like you care much if they get upset at a false suspicion, or just say "to avoid looking up answers, our standard procedure involves this".

The drawing approach also sounds like a good idea, though it's not like software is not going to evolve to be able to draw answers graphically which the candidate could copy down. By having them not able to input something into the machine, the only remaining option is someone listening in and feeding the answer on screen. Plausible, but that's a level of being prepared to cheat that the helper could also prepare to draw stuff out. Or they type with their feet but that's also a scenario where I'd be happy to have them come in for a final interview and demonstrate this amazing ability!

foobarian · on July 18, 2024

A while ago I ran across some team members so bad, I could virtually guarantee they would not have passed even the fizzbuzz phone screens we use before the stricter interview gauntlet. It made me wonder if they got a friend or paid a stand-in to do the interviews. When you think about it, who will check that it's the same person? The only person who might see the candidate in different contexts is the hiring manager, who doesn't do the actual interview.

lucb1e · on July 18, 2024

All the places I've interviewed for, you talked either to the person who was going to be your boss (or teamlead or whatever the word is), or at least someone who would be a direct colleague on a daily basis. If a sibling or cousin could do my voice and mannerisms reasonably (as well as the job I want to get), perhaps that could pass, but otherwise I don't really see this happening.

Hm. Unless the employees don't want to ask because it would be so awkward if they're wrong about the candidate being a different person from who shows up for the job?

geraldwhen · on July 18, 2024

Chat gpt can ace any pre interview sadly. You really need video on the person with back and forth questions to detect if they’re copying and pasting from AI.

All of this could be mitigated with in person interviews, but I’m forced to hire abroad for cost.

adamhp · on July 18, 2024

If an interviewer asked me to "show me your hands", I'd laugh in their face and immediately disconnect.

xandrius · on July 18, 2024

Can I interview at your company? :P

I wish interviews were like this, instead most I've found are either trying to read the interviewer's mind on how to approach a vague situation and answer the way they want or have to reimplenent a full library in 30 min without any resource available that normally you'd look up, solve in minutes and move on.

I wish more took your path and literally just tested for actual industry experience: general architecture, asking questions when the situation is unclear and explaining unexpected/interesting findings from a previous project. And anyway, if they end up actually being a fraud, get rid of them after the initial probation time is up.

hipadev23 · on July 17, 2024

Glad to hear it. Whiteboards remain the ultimate interview tool, even remotely.

jroseattle · on July 11, 2024

What I find more important than the timestamp format -- the timestamp source.

Centralize from where the ts is set (the db is a great place to do this.) Don't let client-code set the timestamp.

jroseattle · on May 19, 2024

You learned some great lessons there, but I would challenge one item early in the "script":

> 2. Verify if it’s a problem from search volume.

It contextually depends, but correlating a problem-to-be-solved with search analytics can be really tenuous. I'd suggest a different phrasing:

  Verify it it's a problem by speaking with customers.

You can still use all the tools, but in the end you want to talk to those who you intend to serve. At that point, you'll have zeroed in on the actual problem they may have and are willing to pay you to solve.

Do it better the next time!

jroseattle · on April 30, 2024

I really like your on-device storage approach. I did something similar with an app for my smartwatch a few years ago to prompt/track my gym workouts. At the time and on the particular platform, other apps felt very spammy/scammy, and I just didn't care to go through the hassle of pushing/pulling my data with another application for what seemed like little to no value to me.

I haven't touched it in a while, but I've been thinking of a sync-to-another-device capability that I could then use with a locally-running AI instance to dig into. (Disclaimer: this is purely exploratory on my part; I'm sure someone will say 'why not do so-and-so'). As always, your mileage may vary. :-)

jroseattle · on April 25, 2024

Way back in the day, I picked up a cheap ($5!!!) copy of Readings in Database Systems from Michael Stonebraker. I found it fascinating to read the original papers that proposed concepts, and then to see those concepts implemented and become the norm.

What I didn't expect was the amount of drama within the context of those papers. In Codd's original paper on relational theory in the 60s, he spends a chunk of time dedicated to talking about IBM as "the man". Hilarious.

commandlinefan · on April 25, 2024

> concepts implemented and become the norm

I had a coworker some time in the late 90's who was reading the book "Design Patterns". We were working in Java at the time and he remarked how odd it seemed to see a book that was targeted at C++ programmers use terms like "interfaces" which were "Java concepts".

jroseattle · on April 8, 2024

This is generally good advice with depth, but I would add a disclaimer: organizational practices and idioms should be taken into account. A few examples where these points would need some adjustment in my org:

We're not crazy about irregular cadences for project update communications. Why? Because there are many projects of complex size and shape going on. The stakeholders of my project care about many of those other projects as well, and they plan their schedules expecting to consume your update at that time.

Another case how we operate: the unpleasant surprise. Don't sugarcoat it, and bring things to attention as soon as possible. Rip the bandaid off, as we say.

As for tone and content, we focus on the purpose and delivery on that purpose. This is established early in the project, and our stakeholders understand both before the project ever takes off. We anchor our updates on those aspects, as it gives stakeholders an understanding of progress, no matter what the project update includes.

jroseattle · on April 1, 2024

As April Fools jokes go, this is really well done.

On one hand, this could easily be read as the musings of some overzealous, re-awakened and re-charged techbro/middle-manager who has tried and failed a few times. But this time, things are gonna be different....

But also, there's a modicum of truth in there. Only the experienced will recognize the red flags buried in the brush-away commentary.