Question-- with the yelp api now being paywalled-- and obviously you'd have to write code to handle scraping-- but can you legally scrape yelp to do this?
Does the linkedin scraping lawsuit permit scraping yelp?
> Scanning and updating every value because you need to update some subset sucks.
Mirrors my experience exactly. Querying json can get complex to get info from the db. SQLite is kind of forgiving because sequences of queries (I mean query, modify in appliation code that fully supports json ie js, then query again) are less painful meaning it's less moprtant to do everytning in the database for performance reasons. But if you're trying to do everything in 1 query, I think you pay for it at application-writing time over and over.
> These days, my go-to solution is SQLite with two fields (well, three, if you count the implicit ROWID, which is invaluable for paging!): ID and Data, the latter being a JSONB blob.
Really!? Are you building applications by chance or something else? Are you doing raw sql mostly or an ORM/ORM-like library? This surprises me because my experience dabbling in json fields for CRUD apps has been mostly trouble stemming from the lack of typechecks. SQLite's fluid type system haa been a nice middle ground for me personally. For reference my application layer is kysely/typescript.
> my experience dabbling in json fields for CRUD apps has been mostly trouble stemming from the lack of typechecks
Well, you move the type checks from the database to the app, effectively, which is not a new idea by any means (and a bad idea in many cases), but with JSON, it can actually work out nicely-ish, as long as there are no significant relationships between tables.
Practical example: I recently wrote my own SMTP server (bad idea!), mostly to be able to control spam (even worse idea! don't listen to me!). Initially, I thought I would be really interested in remote IPs, reverse DNS domains, and whatever was claimed in the (E)HLO.
So, I designed my initial database around those concepts. Turns out, after like half a million session records: I'm much more interested in things like the Azure tenant ID, the Google 'groups' ID, the HTML body tag fingerprint, and other data points.
Fortunately, my session database is just 'JSON(B) in a single table', so I was able to add those additional fields without the need for any migrations. And SQLite's `json_extract` makes adding indexes after-the-fact super-easy.
Of course, these additional fields need to be explicitly nullable, and I need to skip processing based on them if they're absent, but fortunately modern C# makes that easy as well.
And, no, no need for an ORM, except `JsonSerializer.Deserialize<T>`... (And yeah, all of this is just a horrible hack, but one that seems surprisingly resilient so far, but YMMV)
> Fortunately, my session database is just 'JSON(B) in a single table', so I was able to add those additional fields without the need for any migrations. And SQLite's `json_extract` makes adding indexes after-the-fact super-easy.
Our solution for a similar situation involving semi-structured data (in postgres) was to double it up: put all the json we send/receive with a vendor into a json field, then anything we actually need to work on gets extracted into regular table/columns. We get all the safety/performance guarantees the database would normally give us, plus historical data for debugging or to extract into a new column if we now need it. The one thing we had to monitor in code reviews was to never use the json field directly for functionality.
This is exactly what I've tried (and failed at) doing! Can I ask how you handle normalization from vendor data when it contains relationships and multilevel nesting? How do you know when to create a new child table, and which ones to create, and their relationships etc. I haven't found a good balance yet.
Basically what the other reply said - handle it the same as you would any complex data. You just don't need to handle all of the json immediately, only the parts you plan on using for the moment.
Odd-shaped miscellaneous data that you only need to retrieve is a good candidate for a JSON field. Once you're heavily using some piece of data, or if you need to index it (which means you are heavily using it), you should insert the data in the database "properly".
If some vendor is giving you a list of categories you don't care about, there's no need to make a vendor categories table and a many-to-many link table until you actually need them.
The point is that putting data properly in the database lets you use database features on it and get database performance.
> Well, you move the type checks from the database to the app, effectively, which is not a new idea by any means (and a bad idea in many cases), but with JSON, it can actually work out nicely-ish, as long as there are no significant relationships between tables.
That way you're throwing away 50% of the reason you use a relational database in the first place. Has it occurred to you that MongoDB exists?
Also I don't understand why you're afraid of migrations, especially since you're the only developer on your own SMTP server.
> My original comment started with "but it feels "prior to the MongoDB-is-webscale memes""
Which feels off by six generations or so of memes. It feels prior to "memes" existing in the first place (at least this modern iteration, pics with captions in them; not Dawkins' original verion). I'd guess it is, ironically, chronologically closer to, well, your username here.
You certainly would lose a lot of things, like a well supported path to linking with to the database engine, and a straightforward way to start to introduce relational tables as the project matures. Nothing completely insurmountable, of course, but carry a lot of extra effort for what benefit?
How does MongoDB handle someone pulling the power cord out of the server? Because that’s another reason to use something like SQLite, and it often gets used in embedded systems.
> That way you're throwing away 50% of the reason you use a relational database in the first place. Has it occurred to you that MongoDB exists?
Did you miss that he’s using sqlite? The dev experience with a sqlitedb is way better than running yet another service, especially for personal projects.
Sqlite is used just as much as an application file format as it is a relational database.
> Fortunately, my session database is just 'JSON(B) in a single table', so I was able to add those additional fields without the need for any migrations.
> And SQLite's `json_extract` makes adding indexes after-the-fact super-easy.
That's a migration.
> Of course, these additional fields need to be explicitly nullable, and I need to skip processing based on them if they're absent
That's an effect of not migrating - having to process null and absent fields instead of just null fields. After doing more of these, you'll run into the same thing that made people stop using NoSQL databases: with no schema, your code has to parse all previous versions of the data format and they probably aren't even well-documented. While an RDBMS can just set the new column to null in existing rows.
> And, no, no need for an ORM, except `JsonSerializer.Deserialize<T>`... (And yeah, all of this is just a horrible hack, but one that seems surprisingly resilient so far, but YMMV)
I do the same thing with serde_json in Rust for a desktop app sqlitedb and it works great so +1 on that technique.
In Rust you can also tell serde to ignore unknown fields and use individual view structs to deserialize part of the JSON instead of the whole thing and use string references to make it zero copy.
“Without the need for any migrations” seems like a weird one to me. Of all of the things I don’t like having to do, migrations are pretty low on my list - far below any scripting-in-prod or applying hacks or even just rudimentary business logic changes. Granted, I used to fear them a lot back when I was expected to apply them by hand rather than use CICD and testing to validate/apply them
Recently I’ve been tempted to make an SMTP server that translates emails into a web-hook. Please tell me more horror stories so that I might be convinced not do it.
because they put everything in JSON. Migration means running a script to parse and edit each JSON item instead of letting the database do database things automatically.
In my current company, we're using a similar approach: just shove everything into a JSON blob. If you need a constraint or an index, you can create a computed column (in PostgreSQL) that pulls out a field from JSON.
For the data schema, we're using Protobufs with buf validate. This works surprisingly well, you can use the same types in the backend API and on the frontend. We even have a cron job that reads all the data periodically and verifies that the JSON blobs conform to the schema. Our next intern is going to write a PostgreSQL extension to do that on commit/update :)
One real advantage of this approach is that it's easy to do stuff like "search everywhere".
Do you go to the trouble of updating individual values in objects using some kind of deep/partial updating function or do you just accept race conditions that come with updating full objects?
We use optimistic versioning, with a dedicated "version" field (that is actually always pulled out of the blob in all tables).
Classic fine-grained schemas are not that much different. A lot of high-level ORM frameworks simply save all the objects' fields on update, without doing fine-grained diffs.
In addition, our frontend apps also support offline mode. They can get all the relevant objects, and then operate on them locally. So our API was designed from the start to deal with conflicts.
> We use optimistic versioning, with a dedicated "version" field (that is actually always pulled out of the blob in all tables).
All well and good but you do need to handle failures elegantly in code. The nice thing about flat DB tables and SQL is you don't really have to care if your goal is to update a single column authoritatively. The state of the other values in the table are, often, immaterial. It gets even more complicated reconciling deeply nested conflicts in open schemas.
Not knocking your approach, it's just a trade-off I guess.
Yeah, makes sense. To write safe code though at that point don't you have to enforce rigid schemas?
An example would be updating Kubernetes resources, the admission controller will verify the correctness of the change you are trying to make and reject the patch if it's non-comformant. Nested values have... value... in terms of contextualizing and isolating/localizing data leaves but at the end of the day aren't you still dealing with strict schemas and, when those schemas change you have to reconcile schema migrations?
> To write safe code though at that point don't you have to enforce rigid schemas?
Certainly. But you have to do that with SQLite anyway, because (by default, without strict mode [1] enabled) it won't stop you from putting the wrong type of data into a column.
I've worked on a system were a guy "migrated" a database from NoSQL to SQL as a proof-of-concept. Except he really didn't migrate anything. He basically created tables with "ID" (UUID) and "data" (json) columns everywhere. No indexes.
The actual "data" is a mess: different data types for the same JSON field. Imagine storing "price" as both a float and a string, depending on whatever buggy code happened to do the inserts.
It worked enough for a prototype and that was enough for management to believe the project was a success. I can't wait until we actually try and deploy it.
Assume there's just a 0.1% of a meltdown and you're better off sticking to Vanguard ETFs.
https://archive.is/G3cYF