I'm not sure why noone in 150 comments recommended RDBMS yet, but it's great way...

johntash · on Sept 2, 2018

I'd be interested in hearing more about the system you use.

Have you written anything about it before, I'm curious about more specific examples around what you're doing.

Do you automate most/some of what you put in?

I've thought about writing some scripts to export things from various online accounts/services/apis and importing them into a personal database mostly for backup purposes. It sounds like you're already doing that, and more.

megous · on Sept 2, 2018

I don't know where to start, so here are some random thoughts.

- Some of the data import is for archival purposes and mostly one-shot. I'm old enough to have seen many services where I put my data in, or where I had interesting conversations, go and delete everyhing in the process.

- Some of it is for ease of access and offline use. For example Jira can be a very slow system, where each action takes 10 seconds to complete. Some services have too bloated an UI to be useful on slow computers. You can easily lose access to some data/technical comments if permissions change, etc. If you have data locally, it's possible to avoid all those problems and present the data in any way you want.

Yes, I automate imports from external services. In the past I used PHP, now I mostly use Node or Electron, because it has a transparent support for binary data and JSON columns in postgresql, and if I want to parallelize HTTP requests it's easier to do in Node.

Import/sync scripts always have the same structure. First there's some way to gather the data from the external service, then I import it into DB with update/upsert/insert helpers, for example:

    upsert(db, 'table_name', {id: 123, name: 'qwe', ...}, ['id']);

I always try to find and use entity IDs from the external data source, which eases future syncing. Usually there's not much data so I fetch everything and update the database (keeping what was deleted by the remote data source). Often times the service supports queries that result in entities that changed since some date, which is ideal for incremental syncing. If not, the service usualy has at least a way to order entities by date, so I sync new entries until I start hitting entries that I already have in the database. It's very useful to use async generators for this in javascript, so that I can separate data fetching and data storage logic in a clean way.

One other method I use, with web services that are too messy or complicated, is to create a userscript for Violentmonkey that gathers data as I browse the website and posts it to the database via a small localhost http service. This is useful for JS heavy websites that don't have nice JSON APIs. The script just runs on the background (or can be triggered by a keyboard shortcut) and uses the current state of the DOM to get the data and send them to the database. This method of import is invisible to the service itself.

Most of the scripts are fairly simple (< 100 lines), because most services are built around ~2 interesting primary entities that are worth storing.

Because I use upsert most of the time, I can and I sometimes do add some extra columns for annotation purposes. For example I mentined Jira. I added columns for ordering issues and for marking them with whether I wait for some feedback or not, and then I have a simple Electron app for ordering and listing my issues that I need to work on where issues that wait for external feedback are not shown until the feedback is provided. It's uncluttered, and as fast as you can imagine, being backed by a localhost database.

In Node I use: https://github.com/request/request-promise-native and https://github.com/request/request and https://github.com/vitaly-t/pg-promise

PostgreSQL replication is well docummented elsewhere. I replicate the entire cluster, which doesn't require any maintenance when creating databases/tables, so it's absolutely painless after the initial setup. All you need to do is check the logs from time to time. All I do is that I run the backup database server on a different UNIX socket and disable the TCP/IP interface. It is very flexible, as you can also put the backup server on a different machine, you can have multiple backup servers, chain them and it just works even if you have intermittent connectivity between the machines.

Most of my system is not universal or systematic and is very specific to my use cases. But that will be true for anyone wishing to preserve some of their data in a world that is fairly hostile to data portability. In fact, lack of universality/configurability reduces the complexity by a lot and makes it all easily manageable for a single person with at most a few hours/data source of time investment.

grusel · on Sept 2, 2018

I am curious about a longer more technical explanation of what you do, too.