More

theanirudh · 2025-07-22T08:14:19 1753172059

In English, silence is transcribed to "Please like and subscribe"

cheschire · 2025-07-22T10:04:35 1753178675

I get thanks for watching a lot when using speech to text on ChatGPT

theanirudh · 2025-07-01T18:05:09 1751393109

I see benchmarks against Aurora, but I’m curious to see how this compares to Aurora Optimized reads which caches data in a local NVMe SSD. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

theanirudh · 2025-03-22T06:10:11 1742623811

How do sync engines address issues where we need something to be more dynamic? Currently I'm building a language learning app and we need to display your "learning path" - what lessons you have finished and what are your next lessons. The next lessons aren't fixed/same for everyone. It will change depending on how the score of completed lessons. Is any query language dynamic enough to support use cases like this? Or is it expected to recalculate the next lessons whenever the user completes a lesson and write it out to a table which can then be queried easily?

theanirudh · 2025-03-22T06:11:33 1742623893

Seems like a lot of extra work in cases where we change the scoring mechanism, we will then have to invalidate the existing entries, recalculate and write it out again compared to just having an endpoint that will take all previous lessons and generate the next lessons on demand.

theanirudh · 2025-01-21T04:42:34 1737434554

I wonder if the reason the models have problem with this is that their tokens aren't the same as our characters. It's like asking someone who can speak English (but doesn't know how to read) how many R's are there in strawberry. They are fluent in English audio tokens, but not written tokens.

max51 · 2025-01-22T18:53:21 1737572001

The way LLMs get it right by counting the letters then change their answer at the last second makes me feel like there might be a large amount of text somewhere (eg. a reddit thread) in the dataset that repeats over and over that there is the wrong number of Rs. We've seen may weird glitches like this before (eg. a specific reddit username that would crash chatgpt)

ijidak · 2025-01-21T15:41:28 1737474088

Agree. We've given them a different alphabet than ours.

They speak a different language that captures the same meaning, but has different units.

Somehow they need to learn that their unit of thought is not the same as our speech. So that these questions need to map to a different alphabet.

That's my two cents.

theanirudh · 2025-01-21T16:10:44 1737475844

Do they find ARC AGI also tough due to the same reason? I’ve seen some examples where the input was ASCII art versions of the actual image.

andrewla · 2025-01-21T16:40:26 1737477626

The amazing thing continues to be that they can ever answer these questions correctly.

It's very easy to write a paper in the style of "it is impossible for a bee to fly" for LLMs and spelling. The incompleteness of our understanding of these systems is astonishing.

hcurtiss · 2025-01-22T02:42:43 1737513763

Is that really true? Like, the data scientists making these tools are not confident why certain patterns are revealing themselves? That’s kind of wild.

maxrmk · 2025-01-21T05:14:35 1737436475

Yeah that’s my understanding of the root cause. It can also cause weirdness with numbers because they aren’t tokenized one digit at a time. For good reason, but it still causes some unexpected issues.

versteegen · 2025-01-21T20:29:40 1737491380

I believe DeepSeek models do split numbers up into digits, and this provides a large boost to ability to do arithmetic. I would hope that it's the standard now.

maxrmk · 2025-01-22T00:00:13 1737504013

Could be the case, I’m not familiar with their specific tokenizers. IIRC llama 3 tokenizes in chunks of three digits. That seems better than arbitrary sized chunks with BPE, but still kind of odd. The embedding layer has to learn the semantics of 1000 different number tokens, some of which overlap in meaning in some cases and not in others, e.g 001 vs 1.

theanirudh · 2024-12-27T07:02:58 1735282978

Yes, it is. Try setting font-thicken = true https://ghostty.org/docs/config/reference#font-thicken

favadi · 2024-12-27T13:49:53 1735307393

Thanks, that's exactly what I am used to!

theanirudh · on May 20, 2024

I wonder if they considered https://tigerbeetle.com

geodel · on May 20, 2024

Would be interesting. Considering TigerBeetle is written in Zig. And Uber is probably only rare big company which has support contract with Zig foundation.

sa46 · on May 21, 2024

One interesting caveat is that Uber only uses the Zig toolchain, not the Zig language.

https://www.uber.com/en-US/blog/bootstrapping-ubers-infrastr...

bobby_the_whal · on May 21, 2024

Only Uber can present cross-compilation as if that hasn't been done for decades. I think the newer generation is just more stupid. Must be the plastic in the water supply.

theanirudh · on April 20, 2024

I might beat that for this batch! This will be my 8th application. 2 for my previous company, and 6 with my current startup and co founders.

theanirudh · on Jan 14, 2024

Related: AnswerOverflow makes Discord messages searchable on Google/ other search engines. It’s open source.

https://www.answeroverflow.com/

theanirudh · on Jan 12, 2024

A very much needed feature. Had a nightmare scenario in my previous startup where Google Cloud just killed all our servers and yanked out access. We got back access in an hour or so, but we had to recreate all the servers. At that point we were taking Postgres base backups (to Google Cloud Storage) daily at 2:30 AM. The incident happened at around 15:00 so we had to replay the WAL for the period of about 12.5 hours. That was the slowest part and it took about 6-7 hours to get the DB back up. After that incident we started taking base backups every 6 hours.

teaearlgraycold · on Jan 12, 2024

> Google Cloud just killed all our servers

> we were taking Postgres base backups (to Google Cloud Storage)

Rule #1 of backups - do not host backups in the same location as the primary

scoot · on Jan 12, 2024

That's incorrect. You definitely do want backups in the same location as production if possible to enable rapid restore. You just don't want that to be your only copy.

The canonical strategy is the 3-2-1 rule: three copies, two different media, one offsite; but there are variations, so I'd consider this the minimum.

JeffSnazz · on Jan 12, 2024

> You definitely do want backups in the same location as production if possible to enable rapid restore.

This is a distant second priority to ensuring any reliable backup.

aembleton · on Jan 12, 2024

What other media should you store backups in? Tape? Paper print out?

scoot · on Jan 12, 2024

Historically tape, but in practice these days it means "not on the same storage as your production data". For example in addition to a snapshot on your production system (rapid point in time recovery if the data is hosed), a local copy on deduplicated storage (recovery if the production volume is hosed), and an offsite copy derived from replicated deltas (disaster recovery if your site is hosed).

The same principle can be applied to cloud hosted workloads.

tetha · on Jan 12, 2024

As an example, for postgres, we have:

Backups on a pgbackrest node directly next to the postgres cluster. This way, if the an application figures a good migration would include TRUNCATE and DROP TABLE or terrible UPDATEs, a restore can be done in some 30 - 60 minutes for the larger systems.

This dataset is pushed to an archive server at the same hoster. This way, if e.g. all our VMs die because someone made a bad change in terraform, we can relatively quickly restore the pgbackrest dataset from the morning of that day, usually in an hour or two .

And this archive server is mirrored by and is mirroring some archive servers at different hosters entirely, also geographically far apart. This way, even if a hoster cuts a contract right now without warning we'd lose at most 24 hours of archives, which can be up to 48 hours of data (excluding things like offsite replication for important data sets).

wongarsu · on Jan 12, 2024

In the original version that means tape, yes. It's the point most startups skip, but it has some merit. A hacker or smart ransomware might infect all your backup infrastructure, but most attackers can't touch the tapes sitting on a shelf somewhere. Well, unless they just wait until you overwrite them with a newer backup.

EvanAnderson · on Jan 12, 2024

Don't forget to test the tapes, ideally in an air-gapped tape drive. One attack scenario I posed in tabletop exercise was to silently alter the encryption keys on the tape backups, wait for a few weeks/months, then zero the encryption keys at the same time the production data was ransomed. If the tape testing is being done on the same servers where the backups are being taken you might never notice your keys have been altered.

(The particular Customer I was working with went so far as to send their tapes out to a third-party who restored them in and verified the output of reports to match production. It was part of a DR contract and was very expensive but, boy, the piece of mind was nice.)

shiroiuma · on Jan 12, 2024

Either papyrus or clay tablets if you want it to last.

More seriously, perhaps the "2 different media" means don't use, for instance, the same brand and/or model of hard drive for your multiple backups.

fatihpense · on Jan 12, 2024

Papyrus doesn't last :) You want clay tablets, buried in the ground. Looking at Sumerian tablets that would give you 5000 years.

shiroiuma · on Jan 12, 2024

I thought papyrus lasted a really long time, as long as you sealed it in huge stone tombs in the desert.

I think we should build a big library in a lava tube on the Moon to store all the most important data humanity has generated (important works of art and literature, Wikipedia, etc.). That's probably our best hope of really preserving so much knowledge.

tim333 · on Jan 12, 2024

Some lasted at least 3000 years https://www.britannica.com/topic/Ebers-papyrus

EvanAnderson · on Jan 12, 2024

Depending on the size of your data corpus a few USB disks w/ full disk encryption could be a cheap insurance policy. Use a rotating pool of disks and make sure only one set is connected at once.

Force the attacker to restore to kinetic means to completely wipe out your data.

ddorian43 · on Jan 12, 2024

The egrees fees will be bigger than your db cost.

silon42 · on Jan 12, 2024

Yes, maybe (some kind of diff/sync could maybe help), but this means using such a cloud is a bad IT practice.

theanirudh · on Jan 12, 2024

Yes, the egress fees on base backups alone were higher than the cost of the DB VMs. If we replicate the WAL also, it would be way higher. In the post, the example DB was 4.3 GB, but the WAL created was 77 GB.

sgarland · on Jan 12, 2024

The joys of WAL bloat [0]. UUIDv4s?

[0]: https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-pa...

tticvs · on Jan 12, 2024

Did you have any recourse against Google Cloud? Did you ever find out why they did that?

theanirudh · on Jan 12, 2024

I have forgotten the exact reason but it had something to do with not having a valid payment method. Some change on Google Cloud end triggered it - they were billing initially with the Singapore subsidiary and when they changed it to the India one, something had to be done from our end. Hardly got any notices and also we had around 100k USD in credits at the time. Got it resolved by reaching out to some high level executive contact we got via our investor. Their normal support is pretty useless.

booi · on Jan 12, 2024

i'm surprised the solution here isn't... moving out of google cloud. that is terrible

idlephysicist · on Jan 13, 2024

> Got it resolved by reaching out to some high level executive contact we got via our investor.

Oh man that is my nightmare. Nothing says "broken system" like having to circumvent the system to get something done.

shawabawa3 · on Jan 12, 2024

I've read about this happening a lot with google cloud

If your payments fail for whatever reason google will happily kill your entire account after a few weeks with nothing other than a few email warnings (which obviously routinely get ignored)

zilti · on Jan 12, 2024

We simply take incremental ZFS snapshots

WolfOliver · on Jan 12, 2024

do you need to stop the db for the backup in order to ensure consistency of the snapshot?

kevincox · on Jan 12, 2024

You shouldn't because a filesystem snapshot should be equivalent to hard powering off the system. So any crash-safe program should be able to be backed up with just filesystem snapshots.

There will likely be some recovery process after restoring / rollback as it is effectively an unclean shutdown but this is unlikely to be much slower than regular backup restoration.

sgarland · on Jan 12, 2024

Nope, CoW is wonderful. Postgres will start up in crash recovery mode if you recover from a snapshot, but as long as you don’t have an insane amount of WAL to chew through, it’s fine.

winrid · on Jan 12, 2024

wow! how big was the WAL? what kinda IOPS/disks are you using?

theanirudh · on Jan 12, 2024

Don't remember the size, but the disk we were using had the highest IOPS available on Google Cloud. That was one of the reason why we had restore from GCS since these disks wouldn't persist if the VM shut down. I think it's called Local SSDs [0]. We were aware of this limitation and had 2 standbys in place, but we didn't ever consider the situation when Google Cloud would lock us out of our account, without any warning.

0 - https://cloud.google.com/compute/docs/disks/local-ssd

theanirudh · on July 26, 2023

It has become very practical / doable in the recent year or so. In my experience, if you have lot of frontend web experience, the easiest way to ship a RN app is by using Solito [0]. Also check out Nativewind [1] which allows you to style native apps the same way like you would on web. I was able to ship the first version of our app in about 1.5 weeks with this stack. Also checkout Tamagui [2].

[0] - https://solito.dev

[1] - https://www.nativewind.dev

[2] - https://tamagui.dev

zachrip · on July 26, 2023

Solito seems to let you share component code, but I assume that it does not share data fetching code? It doesn't do SSR for the native app? And in expo you'd be expected to `fetch` or otherwise get the data yourself (compared to next land where you can just use the next data loading stuff)?