Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you’re interested in this ‘Personal Data Warehouse’ concept, you might also be interested in @karlicoss’s articles[0][1][2] about his infrastructure on saving personal data.

There are various technical differences, for example @simonw preferring lightweight SQLite databases compared to @karlicoss preferring dumping the data and parsing-when-needed[3], but the purposes are similar enough that I think they should be mentioned.

There were also some very constructive HN discussion[4] on these articles, where @simonw have also introduced Dogsheep before :-)

[0]: Building data liberation infrastructure — https://beepb00p.xyz/exports.html

[1]: Human Programming Interface — https://beepb00p.xyz/hpi.html

[2]: The sad state of personal data and infrastructure — https://beepb00p.xyz/sad-infra.html

[3]: Against unnecessary databases — https://beepb00p.xyz/unnecessary-db.html

[4]: https://news.ycombinator.com/item?id=21844105



I'm building something like that, although my goal is to create a certain UI, not to warehouse my own data. I want the usual photos timeline, but extended to include other artefacts of my daily life. The goal is to roll back to a point in time and find my photos, conversations, transactions, actions and location history.

Working on my own files is easy: pull files incrementally to a central place with rsync, then process the changed files. Use their checksum to create previews without duplicates.

Working on my own websites is also easy. I just need to add an RSS feed.

Unfortunately, fetching data from other sources is much harder. It made me realise how much of my data is held hostage. Most of it can be retrieved manually, but not with a script that runs regularly. Instant messaging and location history are two big examples.

Repeatability is another problem. For example, Reddit only lets you access your last 1000 comments.

And at last, you must deal with updates. If you go back and revise a comment or a post, the data should be updated.


I feel you.

For IM specifically, one thing you could do (if the IM platforms you want are supported) do is to set up a Synapse homeserver with bridges and then you have everything (encrypted or unencrypted depending on config) in a SQL database. May not be worth the overhead depending on your hassle-tolerance if you're not interested in using matrix.org otherwise (I recommend it though).

At least a lot of the annoying groundwork is already done. If you do hit some snags, bug reports and PRs are generally very much appreciated in the bridges projects.


I'm also interested in working on something like that but lets you access your data on the go even when your personal data is behind an isp nat layer.


Have you tried Tailscale? It uses WireGuard to setup a secure mesh network between your devices that punches through NAT, and it is incredibly easy to setup.


That's really cool. I'm also in this space and have been thinking about this recently, with no answers.

Tailscale seems like a great option but i'm mildly concerned about relying on a company for this. Plus the solo plan with a family option sounds a little.. meh.

Definitely an interesting take on this problem, appreciated :)


My idea is to use webrtc very cleverly. You could make money as a default handshake server. It's an interesting model for disruption because you don't have to worry about legal exposure to the rapidly eroding safe harbor landscape.


You could always host the files elsewhere. I don't do that because this is first and foremost a file backup tool (rsync-based). Otherwise I'd put it on a cheap DigitalOcean VPN.


It's great that you're building an interface for that, please keep us updated!

I think it's great if we solve complementary problems (i.e. I've been heavily on the 'liberate and access data' so far), and plug into each other's solutions.


It's at github.com/nicbou/backups. It's already live on my home server, but not really built to be distributed. I build some software like meals: to be enjoyed by myself and a few guests.


Thanks! Are there screenshots or something like that?

"not really built to be distributed" -- completely understandable... sharing has been much harder than I imagined! For myself I'd probably be better off with some huge monorepo.


Late to the thread, but there's an open source app called fluxtream (no S) that was designed as a personal data logger and aggregator. It might be useful for inspiration. I haven't gotten around to trying it, and don't know how well maintained it is now.


I too have wanted something like this for a long time.


> Most of it can be retrieved manually, but not with a script that runs regularly

this may be down to ability of the script or its author. most things retrievable manually are retrievable by scripts, bots, scrapers, etc. the bigger data captors also provide APIs which can help to a degree, and where limits are imposed there are usually workarounds


Sure, but it's hard to build something reliable and long-lasting that relies on scraping websites that actively try to prevent scrapers. It's not that I can't build it, but rather that I won't.


Thanks for sharing, me and @simonw seem to be bumping into each other regularly :)

I've been meaning to give Datasette a try and plug it into my system! Even though I prefer to rely on code as the main interface, in most cases I already have sqlite for free, because it's used as a cache [0]. If a function is marked with @cachew decorator, its results would be cached on the disk, invalidated on arguments changes, etc.

[0] https://github.com/karlicoss/cachew#what-is-cachew


P.S. just gave it a go!

  pipx install datasette && datasette install datasette-cluster-map
  ## warm up HPI (to invalidate and dump a newer sqlite cache file)
  python3 -c 'import my.photos as P; P.print_all()'
  ## run Datasette
  datasette /path/to/cache/photos.sqlite
tada! https://imgur.com/a/N25Zirs very nice :)

Only had to adjust the query in the last step to conform to the naming (i.e. `select _cachew_union_repr_Photo_geo_lat as latitude, _cachew_union_repr_Photo_geo_lon as longitude`)


That's really cool!


The other day, I was considering a mix of these - ddl/dml as a serialization/synchronization format, which gets processed locally to form a sqlite db. Then use rad tools, to craft custom UI for your needs.


I usually prefer to do both: store the original documents, and try to parse and store them in the database for the easiness and power of SQL




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: