Introducing Cosette – a SQL solver for checking semantic equivalences of queries

LaPrometheus · on July 20, 2017

Author here. Happy to answer questions.

lukaseder · on July 20, 2017

Nice idea, but it breaks quite quickly (still). Tried to see if Cosette can prove equivalence of these two queries:

    /* define schema s1, 
       here s1 can contain any number of attributes, 
       but it has to at least contain integer attributes 
       x and y */
    schema s1(x:int, ya:int, ??);
    
    schema s2(yb:int, ??);        -- define schema s2
    
    table a(s1);            -- define table a using schema s1
    table b(s2);            -- define table b using schema s1
    
    query q1                -- define query q1 on tables a and b
    `select distinct x.x as ax from a x, b y
     where x.ya = y.yb`;
    
    query q2                -- define query q2 likewise
    `select x.x as ax from a x, a y, b z 
     where x.x = y.x and x.ya = z.yb
     union select 1 as ax from a x where 1 = 0`;
    
    verify q1 q2;           -- does q1 equal to q2?

Got a syntax error:

    Syntax Error. 
    ERROR: (line 19, column 8):
    unexpected "s"
    expecting "--" or "/*"

There are many other bugs up to the point where this is not really usable.

LaPrometheus · on July 20, 2017

Thanks for trying out. The parser error report is not great now. The real error actually, it should be "union all" rather than "union".

I changed your query:

/* define schema s1, here s1 can contain any number of attributes, but it has to at least contain integer attributes x and y */ schema s1(x:int, ya:int, ??);

    schema s2(yb:int, ??);        -- define schema s2
    
    table a(s1);            -- define table a using schema s1
    table b(s2);            -- define table b using schema s1
    
    query q1                -- define query q1 on tables a and b
    `select distinct x.x as ax 
    from a x, b y
     where x.ya = y.yb`;
    
    query q2                -- define query q2 likewise
    `(select x.x as ax 
     from a x, a y, b z 
     where x.x = y.x and x.ya = z.yb)
     union all (select 1 as ax from a x where 1 = 0)`;
    
    verify q1 q2;           -- does q1 equal to q2?

The Rosette execution indeed returns a counterexample (since they are not equal). There is an error in Coq code generation part, we are fixing that now.

lukaseder · on July 21, 2017

Yes, I understood later on that you're not supporting union yet. Would have been nice, though, there are quite some caveats when trying to transform distinct / union into one another... :)

Btw, another parser error happened when I indented the code, e.g. pasting the indented code from Hackernews here into your tool. The parser then complains about \r or something like that.

LaPrometheus · on July 25, 2017

Please try again now :)

lukaseder · on July 20, 2017

Me again - I really like the idea. I wish I had something like that at a previous job, where queries ran against some 10-level nested views, involving tons of unnecessary joins, projections, unions, sorts, distincts, and what not.

I'm pretty sure that Oracle figured it out all correctly, so the excessive SQL wasn't causing performance issues, but from a maintainability perspective, it would have been nice to have a tool that was able to reduce arbitrary SQL to a more minimal, semantically equivalent alternative.

Have you thought about this?

LaPrometheus · on July 20, 2017

Exactly, SQL minimization is a real problem. While there is no canonical form of all SQL in general, it is still possible to reduce Query.

There is some existing work. You might want to look at this one: https://pdfs.semanticscholar.org/c243/25d76c3ba91388e16085c1...

One problem is that bag semantic chase is very complicated to implement. We are actually working on a new chase algorithm on our SQL formalization right now.

lukaseder · on July 21, 2017

I'm sure bags cause some trouble, but I had something much "simpler" in mind. For instance, JOIN elimination is a powerful tool implemented by most modern RDBMS' optimisers.

Take this query:

    SELECT c.*
    FROM customer c
    JOIN address a
    ON c.address_id = a.address_id

With a foreign key on customer(address_id), the above JOIN can easily be proven useless, and thus eliminated. Furthermore, take this query:

    SELECT c.*
    FROM customer c
    LEFT JOIN address a
    ON c.address_id = a.address_id

In this case, we don't even need the foreign key as long as there is a unique constraint on address(address_id).

lukaseder · on July 20, 2017

Another question: How do you plan on handling non-deterministic expressions, such as CURRENT_TIMESTAMP or ROWNUM, et al.?

LaPrometheus · on July 20, 2017

CURRENT_TIMESTAMP is truly random. One thing we can do is to over approximate. For example, let CURRENT_TIMESTAMP be a symbolic variable, which essentially can be any value. If we still can prove the equivalence, then the equivalence is sound. But there might be some false negative.

ROWNUM is interesting. You can actually declare ROWNUM as an extra attribute of the table and then assert it is a key (We will add support of key constraints very soon). Then many cases will be covered.

lukaseder · on July 21, 2017

ROWNUM is actually more tricky than an extra attribute of the table, as it is influenced by ORDER BY. I guess I wouldn't prioritise support for ROWNUM, though. There are more interesting, and more standard things. E.g. comparing a correlated subquery to a window function.