How do sync engines address issues where we need something to be more dynamic? Currently I'm building a language learning app and we need to display your "learning path" - what lessons you have finished and what are your next lessons. The next lessons aren't fixed/same for everyone. It will change depending on how the score of completed lessons. Is any query language dynamic enough to support use cases like this? Or is it expected to recalculate the next lessons whenever the user completes a lesson and write it out to a table which can then be queried easily?
Seems like a lot of extra work in cases where we change the scoring mechanism, we will then have to invalidate the existing entries, recalculate and write it out again compared to just having an endpoint that will take all previous lessons and generate the next lessons on demand.
I wonder if the reason the models have problem with this is that their tokens aren't the same as our characters. It's like asking someone who can speak English (but doesn't know how to read) how many R's are there in strawberry. They are fluent in English audio tokens, but not written tokens.
The way LLMs get it right by counting the letters then change their answer at the last second makes me feel like there might be a large amount of text somewhere (eg. a reddit thread) in the dataset that repeats over and over that there is the wrong number of Rs. We've seen may weird glitches like this before (eg. a specific reddit username that would crash chatgpt)
The amazing thing continues to be that they can ever answer these questions correctly.
It's very easy to write a paper in the style of "it is impossible for a bee to fly" for LLMs and spelling. The incompleteness of our understanding of these systems is astonishing.
Is that really true? Like, the data scientists making these tools are not confident why certain patterns are revealing themselves? That’s kind of wild.
Yeah that’s my understanding of the root cause. It can also cause weirdness with numbers because they aren’t tokenized one digit at a time. For good reason, but it still causes some unexpected issues.
I believe DeepSeek models do split numbers up into digits, and this provides a large boost to ability to do arithmetic. I would hope that it's the standard now.
Could be the case, I’m not familiar with their specific tokenizers. IIRC llama 3 tokenizes in chunks of three digits. That seems better than arbitrary sized chunks with BPE, but still kind of odd. The embedding layer has to learn the semantics of 1000 different number tokens, some of which overlap in meaning in some cases and not in others, e.g 001 vs 1.
Would be interesting. Considering TigerBeetle is written in Zig. And Uber is probably only rare big company which has support contract with Zig foundation.
Only Uber can present cross-compilation as if that hasn't been done for decades. I think the newer generation is just more stupid. Must be the plastic in the water supply.
A very much needed feature. Had a nightmare scenario in my previous startup where Google Cloud just killed all our servers and yanked out access. We got back access in an hour or so, but we had to recreate all the servers. At that point we were taking Postgres base backups (to Google Cloud Storage) daily at 2:30 AM. The incident happened at around 15:00 so we had to replay the WAL for the period of about 12.5 hours. That was the slowest part and it took about 6-7 hours to get the DB back up. After that incident we started taking base backups every 6 hours.
That's incorrect. You definitely do want backups in the same location as production if possible to enable rapid restore. You just don't want that to be your only copy.
The canonical strategy is the 3-2-1 rule: three copies, two different media, one offsite; but there are variations, so I'd consider this the minimum.
Historically tape, but in practice these days it means "not on the same storage as your production data". For example in addition to a snapshot on your production system (rapid point in time recovery if the data is hosed), a local copy on deduplicated storage (recovery if the production volume is hosed), and an offsite copy derived from replicated deltas (disaster recovery if your site is hosed).
The same principle can be applied to cloud hosted workloads.
Backups on a pgbackrest node directly next to the postgres cluster. This way, if the an application figures a good migration would include TRUNCATE and DROP TABLE or terrible UPDATEs, a restore can be done in some 30 - 60 minutes for the larger systems.
This dataset is pushed to an archive server at the same hoster. This way, if e.g. all our VMs die because someone made a bad change in terraform, we can relatively quickly restore the pgbackrest dataset from the morning of that day, usually in an hour or two .
And this archive server is mirrored by and is mirroring some archive servers at different hosters entirely, also geographically far apart. This way, even if a hoster cuts a contract right now without warning we'd lose at most 24 hours of archives, which can be up to 48 hours of data (excluding things like offsite replication for important data sets).
In the original version that means tape, yes. It's the point most startups skip, but it has some merit. A hacker or smart ransomware might infect all your backup infrastructure, but most attackers can't touch the tapes sitting on a shelf somewhere. Well, unless they just wait until you overwrite them with a newer backup.
Don't forget to test the tapes, ideally in an air-gapped tape drive. One attack scenario I posed in tabletop exercise was to silently alter the encryption keys on the tape backups, wait for a few weeks/months, then zero the encryption keys at the same time the production data was ransomed. If the tape testing is being done on the same servers where the backups are being taken you might never notice your keys have been altered.
(The particular Customer I was working with went so far as to send their tapes out to a third-party who restored them in and verified the output of reports to match production. It was part of a DR contract and was very expensive but, boy, the piece of mind was nice.)
I thought papyrus lasted a really long time, as long as you sealed it in huge stone tombs in the desert.
I think we should build a big library in a lava tube on the Moon to store all the most important data humanity has generated (important works of art and literature, Wikipedia, etc.). That's probably our best hope of really preserving so much knowledge.
Depending on the size of your data corpus a few USB disks w/ full disk encryption could be a cheap insurance policy. Use a rotating pool of disks and make sure only one set is connected at once.
Force the attacker to restore to kinetic means to completely wipe out your data.
Yes, the egress fees on base backups alone were higher than the cost of the DB VMs. If we replicate the WAL also, it would be way higher. In the post, the example DB was 4.3 GB, but the WAL created was 77 GB.
I have forgotten the exact reason but it had something to do with not having a valid payment method. Some change on Google Cloud end triggered it - they were billing initially with the Singapore subsidiary and when they changed it to the India one, something had to be done from our end. Hardly got any notices and also we had around 100k USD in credits at the time. Got it resolved by reaching out to some high level executive contact we got via our investor. Their normal support is pretty useless.
I've read about this happening a lot with google cloud
If your payments fail for whatever reason google will happily kill your entire account after a few weeks with nothing other than a few email warnings (which obviously routinely get ignored)
You shouldn't because a filesystem snapshot should be equivalent to hard powering off the system. So any crash-safe program should be able to be backed up with just filesystem snapshots.
There will likely be some recovery process after restoring / rollback as it is effectively an unclean shutdown but this is unlikely to be much slower than regular backup restoration.
Nope, CoW is wonderful. Postgres will start up in crash recovery mode if you recover from a snapshot, but as long as you don’t have an insane amount of WAL to chew through, it’s fine.
Don't remember the size, but the disk we were using had the highest IOPS available on Google Cloud. That was one of the reason why we had restore from GCS since these disks wouldn't persist if the VM shut down. I think it's called Local SSDs [0]. We were aware of this limitation and had 2 standbys in place, but we didn't ever consider the situation when Google Cloud would lock us out of our account, without any warning.
It has become very practical / doable in the recent year or so. In my experience, if you have lot of frontend web experience, the easiest way to ship a RN app is by using Solito [0]. Also check out Nativewind [1] which allows you to style native apps the same way like you would on web. I was able to ship the first version of our app in about 1.5 weeks with this stack. Also checkout Tamagui [2].
Solito seems to let you share component code, but I assume that it does not share data fetching code? It doesn't do SSR for the native app? And in expo you'd be expected to `fetch` or otherwise get the data yourself (compared to next land where you can just use the next data loading stuff)?