Minimize Code, Maximize Data

olavk · on May 5, 2008

It's a good point, but of course someone is going to say that code is just a kind of data. So to be more specific, I like a principle like TBL's "Principle of Least Power" http://www.w3.org/2001/tag/doc/leastPower-2006-01-23.html : You should express information in the least powerful format possible. If possible, it is better to express logic in a constrained (e.g. declarative) DSL than in a general purpose language, and even better if it can be factored out in configuration files or a database.

kendowns · on May 5, 2008

Yes, the code-is-data is one of those frustratingly true-but-out-of-context things. I created a complete framework that builds databases out of a text file (YAML it so happens) including security and automations (http://www.andromeda-project.org), but was unaware of TBL's essay. I will have to read it carefully and determine what citations may be in order.

daniel-cussen · on May 5, 2008

Reminds me of Norvig's "more data beats better algorithms."

kendowns · on May 5, 2008

A cursory google search gave no obvious explanation of this, do you have a specific link to serve as a starting point?

abstractbill · on May 5, 2008

The grandparent was referring to this talk:

http://www.justin.tv/hackertv/98128/Peter_Norvig_Director_of...

bluishgreen · on May 6, 2008

How about this? http://anand.typepad.com/datawocky/2008/03/more-data-usual.h...

Norvigs point was similar, but Anand has more convincing examples.

dnaquin · on May 5, 2008

Minimize what you don't understand, maximize what you do?

The code-centric solution, which I mentioned we are afraid to touch, is full of conditionals and branches that make it dangerous to mess with for fear of causing unintended side-effects.

Just a case of programming by coincidence.

scroyston · on May 5, 2008

I'm all for data over code, but I'd use a Rete network so my program didn't run in O(DISTRIBUTIONS * RULES) time. Yes data design is important, but you need to be algorithm savvy in order to know the best way to design your data structures.

kendowns · on May 5, 2008

I think that in this case you get a non-trivial performance gain with some basic implementation improvements. You can replace the row-by-row round trip with a single SQL UPDATE with a LIMIT (or a LIMITed subquery if the server does it that way).

Also, the performance of the algorithm in the essay would be linear to the number of rows requiring update, can you elaborate on whether the RETE alogorithm can do any better in this case.

And finally, thanks for the note on RETE, I will have to investigate that.

scroyston · on May 6, 2008

Comment space is a bit limited to do an adequate explanation. For rules that have straightforward (but possibly compound) predicates, Rete will give you O(1) lookup. From wikipedia: "In most cases, the speed increase over naïve implementations is several orders of magnitude (because Rete performance is theoretically independent of the number of rules in the system)."

http://en.wikipedia.org/wiki/Rete_algorithm

My favorite reference on the subject is at the bottom of the wikipedia page: "Production Matching for Large Learning Systems - R Doorenbos"

Sadly (and strangely) many of the "Rete Engines" today do it wrong and will not give you effecient lookups.

wensing · on May 5, 2008

"Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming." --Rob Pike, 'Notes on C Programming'

Hexstream · on May 5, 2008

I very much agree with this article.

Reminds me of the evolution of my lisp HTML compiler. At first I simply made a macro that transforms:

  (:tag :attribute "value" (:p "Paragraph"))

into appropriate code to output the HTML at runtime. But the problem is that I couldn't inspect or transform my HTML; this macro-based scheme was way too brittle.

So instead I wrote another macro with the same syntax but that instead generates code to build a semantic representation of HTML at runtime:

  (make-instance 'html-node :type "tag" :attributes (cons "attribute" "value" :children (list (make-instance 'html-node :type "p" :children (list "Paragraph")))))

I then pass this representation to a compiler that first optimizes ("flattens" the structure and appends contiguous strings together) and then compiles into an efficient tree of closures. I still have a version of the macro that generates efficient code to output the HTML directly that I use in my dynamic HTML generation functions, but I use that only when necessary.

The advantage is that now I can do all sorts of static analysis on my HTML (I use a similar scheme for CSS). Believe it or not, in a "dynamic" site there's TONS of static (as in, known before someone even tries to access the page) information. I even have a scheme to "inject" stuff into pages, for example the (page-sensitive) navigation is automatically "baked" into the appropriate pages.

Eventually I want to make a "site debugger" with all the semantic data I have about my site. I'll have powerful querying capabilities to answer questions like: "What pages in my site have an A tag that links to page X and is nested in a container affected by CSS rule Y?". Maybe I can even make some kind of advanced editor. Owning your data opens lots of interesting possibilities indeed!

So, I get all the semantic data plus the great speed. The best of both worlds. I think this really illustrates the importance of compilers and of languages like lisp that facilitate their implementation. I never made a low-level compiler and it will be some time before I'm knowledgeable enough to make one but lisp allowed me to make those high-level HTML and CSS compilers comparatively easily.

Another important thing is separating policy from mechanism, one obvious problem of the programmer that represented rules directly in code as depicted in the article is that he hard-coded policy into his program, but the thing is that policy is volatile so it really belongs in a config file.

I think this somewhat resonates with the quote: "Programs must be written for people to read, and only incidentally for machines to execute." Except programs themselves often want to read programs for the purpose of analysis or manipulation, hence the advantage of representing the most logic you can in a "dumb" language or data format.