If you’re interested in this ‘Personal Data Warehouse’ concept, you might also b...

nicbou · on Nov 14, 2020

I'm building something like that, although my goal is to create a certain UI, not to warehouse my own data. I want the usual photos timeline, but extended to include other artefacts of my daily life. The goal is to roll back to a point in time and find my photos, conversations, transactions, actions and location history.

Working on my own files is easy: pull files incrementally to a central place with rsync, then process the changed files. Use their checksum to create previews without duplicates.

Working on my own websites is also easy. I just need to add an RSS feed.

Unfortunately, fetching data from other sources is much harder. It made me realise how much of my data is held hostage. Most of it can be retrieved manually, but not with a script that runs regularly. Instant messaging and location history are two big examples.

Repeatability is another problem. For example, Reddit only lets you access your last 1000 comments.

And at last, you must deal with updates. If you go back and revise a comment or a post, the data should be updated.

3np · on Nov 14, 2020

I feel you.

For IM specifically, one thing you could do (if the IM platforms you want are supported) do is to set up a Synapse homeserver with bridges and then you have everything (encrypted or unencrypted depending on config) in a SQL database. May not be worth the overhead depending on your hassle-tolerance if you're not interested in using matrix.org otherwise (I recommend it though).

At least a lot of the annoying groundwork is already done. If you do hit some snags, bug reports and PRs are generally very much appreciated in the bridges projects.

dnautics · on Nov 14, 2020

I'm also interested in working on something like that but lets you access your data on the go even when your personal data is behind an isp nat layer.

simonw · on Nov 14, 2020

Have you tried Tailscale? It uses WireGuard to setup a secure mesh network between your devices that punches through NAT, and it is incredibly easy to setup.

adkadskhj · on Nov 14, 2020

That's really cool. I'm also in this space and have been thinking about this recently, with no answers.

Tailscale seems like a great option but i'm mildly concerned about relying on a company for this. Plus the solo plan with a family option sounds a little.. meh.

Definitely an interesting take on this problem, appreciated :)

dnautics · on Nov 14, 2020

My idea is to use webrtc very cleverly. You could make money as a default handshake server. It's an interesting model for disruption because you don't have to worry about legal exposure to the rapidly eroding safe harbor landscape.

nicbou · on Nov 14, 2020

You could always host the files elsewhere. I don't do that because this is first and foremost a file backup tool (rsync-based). Otherwise I'd put it on a cheap DigitalOcean VPN.

karlicoss · on Nov 14, 2020

It's great that you're building an interface for that, please keep us updated!

I think it's great if we solve complementary problems (i.e. I've been heavily on the 'liberate and access data' so far), and plug into each other's solutions.

nicbou · on Nov 14, 2020

It's at github.com/nicbou/backups. It's already live on my home server, but not really built to be distributed. I build some software like meals: to be enjoyed by myself and a few guests.

karlicoss · on Nov 14, 2020

Thanks! Are there screenshots or something like that?

"not really built to be distributed" -- completely understandable... sharing has been much harder than I imagined! For myself I'd probably be better off with some huge monorepo.

nitrogen · on Nov 15, 2020

Late to the thread, but there's an open source app called fluxtream (no S) that was designed as a personal data logger and aggregator. It might be useful for inspiration. I haven't gotten around to trying it, and don't know how well maintained it is now.

kilroy123 · on Nov 14, 2020

I too have wanted something like this for a long time.

867-5309 · on Nov 14, 2020

> Most of it can be retrieved manually, but not with a script that runs regularly

this may be down to ability of the script or its author. most things retrievable manually are retrievable by scripts, bots, scrapers, etc. the bigger data captors also provide APIs which can help to a degree, and where limits are imposed there are usually workarounds

nicbou · on Nov 14, 2020

Sure, but it's hard to build something reliable and long-lasting that relies on scraping websites that actively try to prevent scrapers. It's not that I can't build it, but rather that I won't.

karlicoss · on Nov 14, 2020

Thanks for sharing, me and @simonw seem to be bumping into each other regularly :)

I've been meaning to give Datasette a try and plug it into my system! Even though I prefer to rely on code as the main interface, in most cases I already have sqlite for free, because it's used as a cache [0]. If a function is marked with @cachew decorator, its results would be cached on the disk, invalidated on arguments changes, etc.

[0] https://github.com/karlicoss/cachew#what-is-cachew

karlicoss · on Nov 14, 2020

P.S. just gave it a go!

  pipx install datasette && datasette install datasette-cluster-map
  ## warm up HPI (to invalidate and dump a newer sqlite cache file)
  python3 -c 'import my.photos as P; P.print_all()'
  ## run Datasette
  datasette /path/to/cache/photos.sqlite

tada! https://imgur.com/a/N25Zirs very nice :)

Only had to adjust the query in the last step to conform to the naming (i.e. `select _cachew_union_repr_Photo_geo_lat as latitude, _cachew_union_repr_Photo_geo_lon as longitude`)

simonw · on Nov 14, 2020

That's really cool!

Ingon · on Nov 14, 2020

The other day, I was considering a mix of these - ddl/dml as a serialization/synchronization format, which gets processed locally to form a sqlite db. Then use rad tools, to craft custom UI for your needs.

sdepablos · on Nov 14, 2020

I usually prefer to do both: store the original documents, and try to parse and store them in the database for the easiness and power of SQL