More

mathetic · on June 28, 2021

Sorry, I didn't mean to mislead. I was exemplifying ways of subtly leaking data. In the next paragraph, I clarify what we cover with the following:

> Namely, we implement a type system that regulates explicit and implicit dataflows as well as non-termination as a covert channel.

Yes, there is work on preventing timing attacks using static types. One is "A Hardware Design Language for Timing-Sensitive Information-Flow Security" which addresses exactly this problem. The second line of work is resource analysis. There are type systems that can specify the complexity of the program. Check out for example relational cost analysis [0]. This can be used for privacy and security purposes.

Needless to say, this is considerably more sophisticated than what I covered.

[0] https://dl.acm.org/doi/10.1145/2694344.2694372 [1] https://dl.acm.org/doi/10.1145/3009837.3009858

mathetic · on June 28, 2021

I used natural numbers as a simplification. In full generality, you can use a lattice where min and max are replaced by meet and join operations. Then you can have categories that are not comparable to each other.

As for privacy, you definitely have security levels. For example, think of employee record system in a simple company. The employee, payroll, human resources, and general public (perhaps accessing the system through the public-facing portion of their website) will all have different levels of access to information. You probably have the following lattice of access.

          employee
       /           \
      /             \
  payroll        human resources
     \              /
      \            /
          public

These security levels would make sense because public would have the least access (public bio perhaps), payroll would need access to things like your salary, human resources would need access to a record of disciplinary action, and you'd have access to all the information as you probably inputted most of it. This is one of many possible structures a company might have. I'd argue access to employee information is definitely a matter of privacy.

That said you're right that there is a dynamic nature of this discussion. What happens if access depends on my geographic location? That information will be stored in a database. This information is dynamic, but you can still use static checking. It'd look something like the following:

  location := getLocation(usedID);
  match location with
  | UK -> ...
  | US -> ...

Now, you don't know what location you have, but once you pattern match on it, you can check `...`s according to the correct level statically because you know that you can never execute that code if the location was elsewhere. So we use dynamic checks to reveal static information.

The big type-theoretic idea here is "existential types" that packs data and not reveal information until it is unpacked.

Hope that was helpful. That was a great great question.

didibus · on June 28, 2021

Thanks for taking time to answer, if you can have something more akin to permission categories, that makes it more interesting to me.

By dynamic, in your example, I'm meaning that for example as an employee I shouldn't have access to other employees data, only my own. I don't see how that can be handled statically.

Also, in the code for your lattice example, I guess you'd have a different page for each one? So like one page for human-resource and the variables in the code for that page would be of security-level human-resource. So in that way I get it, you can't accidentally for example on the public profile page add a payroll level information, because assigning from the payroll to the public page would error. But how do you now control access to which page the user has access too? Maybe for this you still need a dynamic permission system on top? And the compile checks only prevent programmer accidentally showing the wrong data on the wrong page?

Lastly, I still don't know if I buy into "levels". From my experience in enterprise systems, requirements always end up that there's no real hierarchy, just various overlapping sets. For example, it's very likely that human-resource would need to see data which the employee shouldn't see, like performance reviews, HR violations about the employee, compensation targets, etc. Similarly I can imagine payroll needing to see some company related financial with regards to the employee pay which the employee shouldn't have access too as well.

That's where I feel levels wouldn't meet requirements, since you rarely have scenarios where one category is a strict superset of another.

I can even imagine this causing leaks, as a programmer might get confused and think: "employee is the highest level, so I guess all those fields I added to the HR page should also go on employee and the assignment succeeds since employee is highest level of privacy, perfect I can leak no data". And now you've leaked private internal company data to the employee.

So here I think what be cool is if I could instead say that some function returns a value of security [payroll, employee] and another function returns a value of [payroll], while yet another returns [employee]. And now a [payroll, employee] can be assigned to a [payroll] or a [employee] or a [payroll, employee], but you can't assign a [payroll] to a [payroll, employee].

mathetic · on June 28, 2021

The goal is the same but the approach is different. This is entirely compile time and it covers all executions of the program, whereas taint analysis done at runtime and consider only some executions.

Also taint analysis typically doesn't track implicit flows described in the post, but detects the explicit ones.

a1369209993 · on June 28, 2021

> This is entirely compile time and it covers all executions of the program, whereas taint analysis done at runtime and consider only some executions.

That is literally just the difference between static versus dynamic typing.

mathetic · on June 28, 2021

The language is just assignment to simple variables (like `=` in C), sequencing (like `;` in C), if-then-else statements, and while loops.

The simple version uses algebraic data types (present in Rust, Haskell, OCaml to name a few) to describe the abstract syntax tree of the language.

The inference rules [0] (the bits that have lots of dashes in them) are standard notation for describing logic and type systems.

The rest of the post uses more advanced types, particularly generalised algebraic data types [1] which allow you to have interesting types that reveal information to the typechecker when you pattern match on them.

If you really want to get into type systems and programming language theory, Types and Programming Languages [2] is a classic.

If you just want to understand the privacy ideas, I describe what the rules are supposed to do in detail, so you can gloss over the actual syntax. The only thing that part adds is that I can make Haskell's typechecker prevent construction of privacy violating programs.

If anything is still unclear, feel free to reach out. I'm happy to explain and elaborate.

[0] https://en.wikipedia.org/wiki/Rule_of_inference

[1] https://en.wikipedia.org/wiki/Generalized_algebraic_data_typ...

[2] https://www.cis.upenn.edu/~bcpierce/tapl/

CTmystery · on June 28, 2021

Very helpful, thank you!

mathetic · on June 27, 2021

Let's say we have `a, b, d : 0` and `c : 999`. Then `a := b :>> c := d` according to your rule is well-typed and has a security level `999`.

Now let's say I have a conditional expression `private : 999`.

The following program typechecks according to the rest of the restrictions. In particular, `999` from `private` is less than or equal to the `999` from the sequenced command.

  ITE private $
    a := b :>>
    c := d

But we just leaked data because if the value of `a` (which we can observe) is not `b`, then we know that `private` was 0 (aka. false). This should never happen because `a` is public and `private` is private.

The intuition here is that the lower the security level a command has the more public the conditional has to be. Otherwise, private data would *influence* public data which we can then observe to deduce something about the private data.

> it seems like you could leak info by just assigning two variables, where one is always public.

I didn't quite follow this bit. But if you elaborate, I'm happy to discuss.

nv-vn · on June 28, 2021

I think maybe the key idea I was missing here is that the type of a variable is not always the same as the type of the scope it was defined in? If I'm understanding correctly, there's an implicit rule that's missing:

    a : la   b : lb   a := b
   --------------------------
             a : lb

So for example if `b : lb`:

   a := b :>>
   doSomething()

need not have a type `>= lb`.

Does that sound vaguely correct?

Btw thanks for clarifying

mathetic · on June 27, 2021

My bad, sorry =)

The share from this story is that we can enforce privacy just as we enforce more traditional types at no runtime cost. It scales because it is compositional and compiler can guide us through programs that won't leak data.

A more niche outcome is that Haskell's traditional type system is rich enough to encode this privacy enforcing type system.

I think if you just ignore the syntax, you can still gain a good intuition about how data might leak in subtle ways and what is sufficient to prevent it.

mathetic · on June 27, 2021

That is a different notion of typing. The type systems you are referring to (which is how they are commonly understood) classify data according to the operations they support. In this post, types classify privacy. These two notions are orthogonal and can be combined if needed. There are many many more notions of typing out there achieving wildly different goals.

mathetic · on June 27, 2021

Wouldn't that require the programmer to be adversarial?

Since all type systems have escape hatches, I don't think they would be resilient at all in adversarial settings. This is more geared towards detecting accidental and _subtle_ data leakage.

mathetic · on Nov 25, 2019

The point is often you don't need to. If you use Google docs, the chances are you will be editing it disjointly most of the time and occasionally you'll edit the same bit of the document. Then you don't really care if what comes out is a meaningful sentence, but that all parties involved (regardless in which order they see the operations) see the same version of the document so that they can fix it. That's what CRDTs buy you.

Without them, you have to use either explicit synchronisation which is a bad idea if you want local-first software (doesn't work offline, needs to be done even when editing disjoint bits of the file, etc.) or something like CRDTs. Now there is a great paper by Victor Gomes, Martin Kleppmann, and Dominic Mulligan that proves (in a proof assistant!) that pretty much all other alternatives that do not use strong consistency is flawed in edge cases.

As for triviality of the operations, I know for a fact that Martin Kleppmann has a paper in the works that show (and prove) how to do tree move efficiently as a CRDT. That means you can basically do arbitrary edits on JSON. Hopefully, that should clear up any confusion about for which applications it might be useful.

dunkelheit · on Nov 25, 2019

Thanks for your comment!

> The point is often you don't need to. If you use Google docs, the chances are you will be editing it disjointly most of the time and occasionally you'll edit the same bit of the document. Then you don't really care if what comes out is a meaningful sentence, but that all parties involved (regardless in which order they see the operations) see the same version of the document so that they can fix it. That's what CRDTs buy you.

Ok, so the hope is that semantic conflicts are rare and if they do happen, users can easily see them and promptly correct them. I guess this works well if the invariants are local and synchronization is fairly frequent (i.e. on the scale of seconds, not weeks).

> That means you can basically do arbitrary edits on JSON. Hopefully, that should clear up any confusion about for which applications it might be useful.

Not quite. Of course you can model everything in JSON, but the question remains as to how much additional implicit structure is needed atop the explicit tree-like structure that JSON provides and how easily this implicit structure is disturbed by merges.

dunham · on Nov 25, 2019

I think the subtlety with CRDTs is that how you choose to represent your data model via CRDT data structures encodes the semantics of how conflicts are resolved. As a simple example there are multiple versions of Set (remove once, add only, remove observed, etc.) that have different semantics.

As a more concrete example, Apple Notes edits tables and resolves conflicts via CRDTs. Now for editing tables in a word processor, you can have multiple users moving, adding, and deleting columns, in addition to editing the text within a cell. (One user adds a row, another adds a column. One user deletes a column and another moves it. etc.)

If you just naively represent this as a list of lists, the semantics expected by the user breaks. (One user adds a row, another adds a column, and you have a short row with cells in the wrong place.)

Apple ended up representing the columns as an ordered set of ids (piecing this together from other CRDTs) the rows as a second ordered set of ids, and then the text blobs as a map of column id, to a map of row id to a text crdt object. With the additional assumption that any "missing" text is in the blank, initial state. (I might have this transposed, it's from memory, but that's the basic idea.)

It works, and works well, but a lot of thought and effort was put into modeling the domain as CRDTs to capture the required semantics.

(Here, for ordered set, you can use a list CRDT, drop duplicates, and tack on an add-only set of tombstones to keep move/delete conflicts from resurrecting rows/columns.)

aboodman · on Nov 25, 2019

"often" isn't that useful to me as an application developer. If I'm going to build local-first software and put it out on hundreds of thousands of desktops, once it's deployed, the database is outside my control.

Yes, I need convergance, but I also need to ensure application invariants are maintained.

I agree with parent that a CRDT isn't that useful for local-first software unless you can encode arbitrarily complex programmatic logic in the transitions. That way I as a developer have control over how merges happen globally and all invariants. It's too complex to reason about otherwise.

It's not clear to me if this paper gives me what I'm looking for -- I'm looking forward to digging into it.

There are other approaches to CRDTs that I'm more excited about: Statebox (https://github.com/mochi/statebox) from years ago is a CRDT that encodes arbitrary transitions and converges by rewinding and replaying in a consistent order. This is basically the same trick blockchains typically do to merge forks. Radicle (https://radicle.xyz/) is another related approach.

I think things like this are the better path forward for local-first software, personally.

mathetic · on Oct 29, 2019

So I use QuickCheck often and I'm a true believer, but the point this post is making is absolutely justified. Tying the test generation to the type is often requires some creative ways of writing `Arbitrary` instances (the specification of how test cases should be generated).

However, I don't think what the author is suggesting in its place better. The advantage of tying the test case generation to the type is that it's compositional. Lots of sub Arbitrary instances come together to generate the test case and in the long term I find this to be more maintainable. Whereas the author's approach discourages compositionally.

The real problem is that the types in vanilla Haskell are too loose, so there lots of inhabitants of a type that we are not interested in. The author mentions in passing, dependent-types might be a solution and that is exactly right. What is needed is to make the irrelevant terms illegal altogether. This is pretty achievable and is already somewhat used. For example, the program synthesis (proof search) mechanism of Idris 2[0] uses linear[1] dependent types[2] to heavily constrain which terms are possible candidates for a given type. The specificity of the type reduces the search space, hence increases the effectiveness of the proof search.

[0] https://github.com/edwinb/Idris2

[1] https://en.wikipedia.org/wiki/Substructural_type_system#Line...

[2] https://en.wikipedia.org/wiki/Dependent_type