Deleting data distributed throughout a microservice architecture

hinkley · on May 5, 2020

> First, you’ll need to find the data that needs to be deleted.

Microservices do not get you out of having to have an information architecture. They add more friction if you don't have one, but it's entirely possible to have an unspoken/undocumented information architecture that mostly works.

If you don't have a System of Record for data, you for sure aren't going to be able to find it. Similar problem with no Source of Truth. For some business models you will have both and they will be separate (especially with 3rd party data).

You still have the problem of logs, but at least the problem is tenable. Without any of this it's just chaos and who knows where the data went or really even where it came from?

gravypod · on May 6, 2020

It's also possible to completely automate this. Flow.io has a great talk where the CTO talks about everything they do from an engineering perspective. Just annotate data in your API spec language as PII. This can allow you to also have policies that can be verified by this. If some advice has PII you can enforce that some other service can never talk to it, for example.

616c · on May 6, 2020

Link? I would love to see this as excuses not to do so come up often at work, sigh.

mcintyre1994 · on May 6, 2020

I'm pretty sure they're referring to this talk: https://www.youtube.com/watch?v=j6ow-UemzBc

ijcd · on May 6, 2020

My favorite approach is encrypting all of a user’s data (everywhere) and just deleting the key from a central store instead of actually erasing.

noident · on May 6, 2020

Can you be certain that nobody has retained a copy of this key?

ijcd · on May 6, 2020

Can’t be certain that no one has copied the unencrypted user’s data either.

scollet · on May 7, 2020

I don't know if this addresses the underlying issue.

hangonhn · on May 6, 2020

I was about to suggest the exact same. It is literally what we do at my company. Of course you now have to them deal with searching and indexing. To mitigate this issue, we use UUIDs to reference users throughout our systems. Any fields that would link that UUID back to a real person or contain personal info is then kept encrypted. That way, we can still gather all the information from our various components but if we needed to, we can essentially make the user data unreabable.

rrobukef · on May 6, 2020

Doesn't this still leak information? "A person known by these 5 people was in contact with the suicide hotline."

Polylactic_acid · on May 6, 2020

Doesn't this leave you with a bunch of junk data taking up space?

FridgeSeal · on May 6, 2020

What's the problem with actually just deleting this data when you're done with it though?

vmateixeira · on May 6, 2020

Because FKs on a relational model.. as an example, deleting a user/account might end up being a task on going through every reference to it, and the references to its references, etc.

This is actually the reason some companies do not delete users/accounts [0].

[0] https://news.ycombinator.com/item?id=23005060

chrisacky · on May 6, 2020

One of our popular accounts has about 70M worth of rows. I can't imagine how we would go about deleting their data. We rotate out old data each month when it doesn't become needed, but maybe 40M records (4M new each month gets added).

Xorlev · on May 6, 2020

Not all customers are cool with this approach.

luckylion · on May 6, 2020

For regulatory reasons or because of "you didn't really delete it"? This is similar to deleting a file but not overwriting the space it occupied on the disk, isn't it? The data is still "there", but it's not accessible by normal means.

evbpcapfxy · on May 6, 2020

Is that okay from a GDPR perspective? What if there's an exploit discovered in the implementation of the encryption? Or what if quantum computers can crack it easily in the future?

rommel917 · on May 6, 2020

It is ok for GDPR, it is axon uses it in their commercial GDPR modul (better then our but same principale).

Broken encryption, 20 years better machines in future and quantum are solved with same trick. We have event sourcing.

Implement best current encryption, delete everything except event store, decrypt events with old encryption publish events and everything in now encrypted with best current encryption events in new store. Delete deprecated old event store. Skip aggregates with deleted old key.

valera_rozuvan · on May 6, 2020

One should really consider DevSecOps when dealing with a large infrastructure involving dozens or hundreds of micro services. Security as Code by design should be implemented as soon as possible. In my current gig - the following 3 points are key:

- 24x7 Proactive Security Monitoring

- Shared Threat Intelligence

- Compliance Operations

I would really advise anyone interested in this subject to check out DevSecOps manifesto [1].

Also - if you are running a k8s cluster for all your micro-services, there are several guides online as to best practices (see [2] for example).

----------

[1] https://www.devsecops.org/

[2] https://dev.to/petermbenjamin/kubernetes-security-best-pract...

BrentOzar · on May 5, 2020

This is relevant for enterprises with legacy systems, too, like shops that have multiple interfaces that extract, transform, and load data across point-of-sale systems, warehouse fulfillment systems, and data warehouses.

golover721 · on May 5, 2020

This gives me nightmares about having to retrofit GDPR requirements into a complicated system with many data stores and applications. Not only the difficulty of ensuring data is deleted but tracking data lineage so you can delete derived data too. Fun times!

Hokusai · on May 6, 2020

> This gives me nightmares about having to retrofit GDPR requirements into a complicated system with many data stores and applications.

This is true when the domain is not well defined, that is the case for many legacy systems. Usually this systems have also problems with de-normalized data where there is several copies of the same entity across the system. Copies get out of sync.

I do not think that nowadays is as common as it used to be. When I was a kid, I saw many systems where you had "customer" data that was replicated in the two or three applications that were using it. Then, maybe at night, some task will "make sure" that all the data was on sync or in real-time with triggers in the database. Applications that were not part of this synchronization will pop up, some fields will be in one application but not in another one, etc. Bad-defined domain objects and identifiers will make sometimes fields too small to fit the original data, being in the wrong format (text vs numeric) or miss unique keys and duplicate rows.

Q: "Can you send a mail (stamp-based-mail) to everybody that works the Christmas shift?" A: "No. That data is in the scheduling system. We can only send mails from the Human Resources system as is the only one that stores the address".

I hope new generations of developers do not find themselves in this situations, but they will need to maintain the many legacy systems that still live way beyond what anyone expected.

echelon · on May 6, 2020

This is what I like to bring up every time people say the GDPR imposes no costs.

On the other hand, like any other regulation, it also becomes a barrier to entry for new players that have to expend engineering and compliance efforts.

That said, I'm wholly for data protection, data portability, and privacy. It's just a nuanced subject that is more complicated than it seems.

Ram_Lakshmanan · on May 6, 2020

This will help us https://bitmovin.com/finding-memory-leaks-java-p2/

crimsonalucard · on May 6, 2020

My company recently decided to switch over to microservices. It was a long an arduous process but we chose not to compromise.

Basically for the greatest amount of modularity, we divided all 400 functions in our monolithic application into 400 individual servers because obviously functions aren't modularizing everything enough. You really need to put more and more wrappers around all of your functions. First put a framework around your function, than put an http api layer around it, then wrap a server app around it, then put an entire container around it and boom! More wrappers == Less technical debt. To illustrate how this works see example below:

   wrapper(
     wrapper(
       wrapper(
         wrapper(
           f(x)
         )
       )
     )
   )

See that? Obviously for every additional wrapper you add around your original function your technical debt becomes less. This is why it makes sense to make to not just use functions to modularize your data but to wrap all your functions in containers and then put those containers in containers.

Now all 400 of our engineers each as an individual manages one entire function within one entire container. It's amazing, they don't have to think about two things anymore, they can just concentrate on one thing.

While I'm not sure what has improved yet, everything feels better. Our company is following industry trends and buzzwords. Technical debt actually went up but that's just our fault for building it wrong.

Some engineer asked me why couldn't we just load balance our original monolith and scale it horizontally. I fired that engineer.

Another engineer came to me and told me that for some function:

   func some_func(a: int, b: int) -> int: {
      return a + b
   }

It's probably better to do 1. instead of 2.

   1. Call the function: some_func(2,3)

   2. Make a request: request.get("http://www.pointlessapi.com/api/morecomplexity/someextratechnicaldebt/randomletters/asdfskeidk/some_func?a=2&a=3") 
   and parse the json:

   {
     "metadata": {
         "date": "1/1/2020"
         "request_id": "123323432",
         "other_pointless_crap": ...
         "more_usless_info": ...
         "function_name (why?)": "some_func"
     }
     "actual_data": 5
   }

I fired that engineer too. Obviously 2. is better than 1. Also why am I using JSON? Not enough wrappers! You have to wrap that http call in additional wrappers like GraphQL or GRPC!!! (See wrapper logic above).

Have you guys heard of a new trend called Sololithic architecture? Basically the new philosophy states that all of 400 of our microservices should be placed in 400 containers and run under a cluster of 399 computers under kubernetes!

I may not understand where technical debt comes from and I also may not understand how all these architectures will fix the problem of technical debt forever... I know that industry trends and buzzwords are more intelligent than me and monoliths are bad bad bad and obviously the source of all technical debt! Just cut everything into little pieces and technical debt becomes ZERO.

Right? The technical definition of Technical debt is not enough cutting of your logic into tiny pieces and not enough wrappers around your modules so all you need to do is cut everything up and put wrappers around it and problem solved! Makes sense!

nell · on May 6, 2020

You used words like "buzzwords" which makes you sound facetious. I know you aren't, but others could get the wrong idea. This is a very useful piece of advice that should be taught in CS 101.

RandyRanderson · on May 6, 2020

The sad thing is, i was 1/3 into your post before it was clear that you were being cheeky. I've read several comments/articles on hn lately that were like this but serious.

Note: you forgot your container orchestration orchestrator. Also you should make clear that your CI/CD pipelines will also use the orchestration orchestrator pattern.

effoffhn · on May 5, 2020

[flagged]

philwelch · on May 5, 2020

Cynical tone aside, there’s a good question here. It’s just that the question has an actual, valid answer: it’s impossible to have a single database that operates at Twitter scale.

If you think it is possible to have a single database that operates at Twitter scale, fine, there’s probably an interesting and enlightening conversation to be had about how and why that is or is not the case.

Continue along this vein and you eventually you get to the point where you’re discussing realistic solutions, and maybe at the end of it you’ve either gained an understanding of how these things work or else you’ve actually come up with a better system design than Twitter. Either way you’ve gained something more valuable than the petty satisfaction of disparaging other people’s motivations.

devonkim · on May 5, 2020

Deletion is a form of cache invalidation if you think about it

flarg · on May 5, 2020

Thank you. And this sort of problem occurs in large organisations with lots of different monoliths all caching each others data.

_y5hn · on May 5, 2020

Records storing transactional facts, are NOT "caching each others data".

flarg · on May 5, 2020

Maybe you're right, but, out of interest, what would call it when half a dozen systems store elements of each others data and then refresh that data on a daily basis?

_y5hn · on May 6, 2020

If tied to an event: facts

If synced and overwritten between systems on daily basis: liability

hinkley · on May 5, 2020

If not the King, at least the Crown Prince of cache invalidation.

d_watt · on May 5, 2020

What if you're twitter? Which this person is.

I agree that if you can keep it simple, it's easier to do it. But sometimes you need distributed services. Saying only Google has that problem is a little reductive.

namanaggarwal · on May 5, 2020

This article is about microservices. If you are not at scale you might not need microservices in the first place.

When data is distributed, one team/service owns user data and other tweets. It becomes not so trivial.

AmericanChopper · on May 5, 2020

This particular article is about microservices, but there's plenty of ordinary business reasons that you may have some sort of asynchronous business process that runs across a distributed set of systems/teams/organisations, that do not relate to scale. I was working on a microservice recently (really it was a service-oriented architecture, but they seem to pretty much mean the same thing now), and it only processed around 10,000 transactions per day. But it almost had to be designed that way, due to the nature of the business processes it was supporting, and the systems it had to interface with.

pfranz · on May 5, 2020

Someone already mentioned cache invalidation. To extend that, I don't think it's all that different from an old paper system. If you want to delete your file it's probably kept in a cabinet in some department, but the billing department or marketing department also has a copy of your name and address in their records. Deleting everything is a multi-step process.

Centralized systems didn't scale in the physical or digital world and it distributed systems complicate things that seem trivial.