Torvalds' quote about good programmers

antirez · on Sept 23, 2012

This is one of the few programming quotes that is not just abstract crap, but one thing you can use to improve your programming skills 10x IMHO.

Something like 10 years ago I was lucky enough that a guy told me and explained me this stuff, that I was starting to understand myself btw, and programming suddenly changed for me. If you get the data structures right, the first effect is that the code becomes much simpler to write. It's not hard that you drop 50% of the whole code needed just because you created better data structures (here better means: more suitable to describe and operate on the problem). Or something that looked super-hard to do suddenly starts to be trivial because the representation is the right one.

smoyer · on Sept 23, 2012

+1 ... and it's been known for a long time. Back in the '70s, Fred Brooks said "Show me your [code] and conceal your [data structures], and I shall continue to be mystified. Show me your [data structures], and I won't usually need your [code]; it'll be obvious."

necrodome · on Sept 23, 2012

any concrete examples you can point to?

gbog · on Sept 23, 2012

Suppose you want to manage the marital status of some people. You could have one data structure, here a table in a database, where you keep (name, status) tuples. This is bad, for many reason: what if someone changes name? If you need to allow undoing, how can you know which was the previous status before "maried" has been entered (widow? Single?)

A better data structure here is an event table (who, did what, when), then not only you know the marital status but you know where it comes from, and can build much more on this data.

hnriot · on Sept 23, 2012

You're completely neglecting how the data will be used, by using an rcs like data storage of deltas you penalize the common case of wanting to query the current state efficiently.

A better way would be to store a person table without name and a marital status but use it a a primary key into a detail table that has multiple rows for any individual along with dates so you have a row representing the persons state, married, name from, to.

But there are also about a dozen othe better solutions than the one you describe.

gbog · on Sept 24, 2012

What you describe is just a denormalization of my solution. You are already optimizing a proposal that was designed as a very short example of a better data structure. That's absurd. If you need to access often the current marital status you can cache it, store it on the client, use a materialized view, write it in a file along with other info, etc.

I would advice personally against the plain denormalization you propose (if I understood it well), because it means your application logic will have to handle it, and your data structure will not produce a very simple straight forward code that is the appendage of good data structure.

lloeki · on Sept 24, 2012

If you want to keep marital status history, then the marital status is a SCD [0]

With Rails-like conventions, here's a minimalist Type II SCD definition:

    +--------+    +-----------------+    +----------+
    | people | -> | people_statuses | <- | statuses |
    +--------+    +-----------------+    +----------+
    | - id   |    | - id            |    | - id     |
    | - name |    | - person_id     |    | - label  |
    +--------+    | - status_id     |    +----------+
                  | - until         |                
                  +-----------------+

where person_statuses.until is the last date where this relationship is valid.

The logic follows from the data structure definition in a natural manner.

'WHERE person_statuses.until IS NULL' will immediately give you the last status of someone/everyone. You can trivially update one's status by UPDATEing 'until' and INSERTing a new record, wrapped in a transaction. Additionally, you can use until as a guard WHERE clause for such an update to implement a form of optimistic locking.

You can also add a column relating a person to another. With a slightly more complex query you can easily make the relation symmetric and remove the need for 'duplicate' reciprocal records.

Handling name changes and preserving navigable history in the people table is not much harder.

The wikipedia page about SCDs gives interesting cases.

[0] http://en.wikipedia.org/wiki/Slowly_changing_dimension

gbog · on Sept 24, 2012

That's interesting, thanks. I still wonder if a simpler event table is not a better data structure, mostly because it is a read write only structure, no need for updates. Granted, getting the current status is a bit slower but keeping pointers to the latest event plus chaining events can fix it.

mturmon · on Sept 23, 2012

Try to implement the game Asteroids without a thought about data structures, just start programming it procedurally as it comes to you. See how far you get in, say four hours. Find a graphics library, of course.

Then, use a very simple object oriented model, where everything on-screen (asteroids, ships, enemy ships, shots) has a draw method, a move method, a create method, and an I'm-hit method, together with logical internal state like position and velocity. See how far you get in the same four hours.

Note how much easier it is to do the second way. That's the power of data structures (and, admittedly, some simple oo ideas, deployed in a lightweight way).

YZF · on Sept 24, 2012

This is the power of designing before you code. Nothing to do with data structures. Data structures are part of the design but you're focusing on the object oriented design aspect.

mturmon · on Sept 24, 2012

You know, I put in a caveat anticipating this objection, but since you raised it, let me offer the following counterpoint:

You could implement the approach I described in a language without any OO support, and still come out ahead.

Just do it in plain C using function pointers for methods, and write your own dispatcher. Or, do it in assembly (I implemented this approach in PDP-11 assembly).

If you do it this way, it really is all about the data structure -- the methods are subordinate to the data structure, acting just like other state. You could even change the methods after objects have been created -- for example, swap in the "evasive maneuver" move method once you fire on an alien ship, so that the alien goes from lazy drifting to taking evasive action.

javajosh · on Sept 23, 2012

Well, let's say I decide to have two copies of an important number in my program. Now in various places in code I need to update that number. In every place I now need to do the update twice. Also, some of my code needs to read the number when it changes. I could go ahead and just call all of the routines that need to fire when I change the number.

Or I can decompose my application so that I maintain only one number, and modify the update mechanism to support firing off routines when it updates.

So, we went from two numbers, to one number and a lookup table (the table that tracks which routines need to fire). And the code will be vastly simpler in the second case.

(This is probably not the best example - there are probably some nice database normalization problems that highlight the benefits of good data structures a lot better).

robomartin · on Sept 23, 2012

Absolutely right. I was lucky enough to learn this in college. Although, I did not learn it from the CS professors but rather my physics prof. He was a champion for a language called APL and he actually cut a deal with the CS department to accept credits for taking an APL class he was teaching as a substitute for the FORTRAN class. APL was an amazing mind-opening experience.

Throughout the APL 101 and 102 courses he would repeat this mantra: "Work on your data representation first. Once you have fully understood how to represent your data start thinking about how to manage it with code."

He would throw this at us permanently. At the time it sounded like our Physics prof had lost his marbles (he was a very, shall we say, eccentric guy). It would take a few years after college for me to realize the value of that advise.

Put another way, our business is managing data of some sort. Whether you work on embedded systems or web applications, you are always dealing with data. You can make your programs far more complicated than necessary by neglecting to choose the right (or a good) representation of your problem space (data).

I equate it to designing an assembly line. Anyone who's watched a show like "How it's Made" cannot escape the realization that efficient manufacturing requires engineering an efficient assembly process. Sometimes far more engineering work goes into the design of the manufacturing process and equipment than the part that is actually being made. The end result is that the plant run efficiently and with fewer defects than alternative methods.

In programming, data representation can make the difference between a quality, stable, bug-free and easy to maintain application and an absolute mess that is hard to program, maintain and extend.

sedachv · on Sept 23, 2012

Your prof was onto something that seems to be very in the zeitgeist today. To "understood how to represent your data" you have to understand what it is you're trying to represent. Eric Evans popularized this notion with Domain-Driven Design.

If you follow this line of thinking far enough, you realize that computer programming is just applied analytic philosophy. You have your metamodel (logic/programming language) and then you build your model (ontology/software).

I really like your assembly line metaphor. Knowing that a "customer" has a name and email address is almost of no importance compared to understanding how the "customer" information arrives, the actions around the "customer," and the end result of the actions. That's the assembly line.

robomartin · on Sept 24, 2012

I think that, to some degree, becoming an effective APL programmer almost required becoming good at data representation. If you want to do things the "APL way" you have to think about structures that work well with array, matrix and tensor processing ideas. When you could represent your data in the most optimal way you could sometimes write a function with only a few operators that could do what needed to be done to that data.

This means that, to some extent, data representation might have several local minima (in terms of being optimal) with each population of possible representations exists around the language or toolset you are going to use to process this data.

What I mean is that the optimal data representation for an embedded system (as an example) written in assembler might be different than that of the same system written in C or Forth.

I don't use APL very much any more. Back when I did I was also programming in Forth, C and Assembler. I know that the decisions I would make --in terms of optimal data representation-- varied from language to language. This is partly true because each language, in my case, was being used to deal with different problem domains. For example, APL could do high-level computations while Forth was great for low-level, real-time stuff.

0ack · on Sept 24, 2012

Interesting comments. I've been working with an APL-derivative and this sounds similar to how I have been thinking about things. It's a matter of massaging the data to get in into a certain representation, e.g. a matrix, then from there it's very, very easy to work with using APL. As you say a few operators is all you need. The code is very terse. Very powerful. The real work seems to be shaping the data into the right representation first, a matrix. I use non-APL-descendent programs, the usual UNIX utilities, to do this for now.

robomartin · on Sept 24, 2012

Right. I remember having fun with this doing an APL application to aid in DNA sequencing. There were a number of ways to represent the data provided by the sequencer. At one point it almost became a game to see how small of an expression one could create to process the data by changing the representation.

With APL one has to be careful not to create monsters that cause geometric expansion in memory needs.

If the data set is large and the expressions processing the data cause frequent expansion into matrices or tensors (n-dimensional data structures where n > 2) one could end-up with geometric or exponential memory requirements. This, again, is another case of having to understand and fit data representation to the programming language AND the approach one will use to work with the data.

While languages like APL can be great, they can be disastrous in the hands of a programmer who does not understand what might be going on at a lower level. Sometimes there's nothing better than good-old low-level C.

lifeisstillgood · on Sept 23, 2012

Thank you, and you have just reminded me why I am liking the explosion of no-SQL stores - we have for a very long time been storing all our data in one factory design, with one, really flexible and powerful layout.

Being able to have a red black factory is rather nice. Although it does mean we now need to think carefully about what factory we shall need before even starting. And accepting occasionally moving the while factory three blocks over, during prodction

mtkd · on Sept 23, 2012

You can normally fix bad code - fixing bad data structures is not usually easy or even possible.

It's why I've still not fully bought in to 'release early release often'.

I prefer to defer releasing for production use until really satisfied with the structures - this way you have no barrier to ripping the foundations up.

If not 100% comfortable with the model - prototype a bare metal improved one (schemaless DBs ftw) - if it feels better start pasting what logic/tests you can salvage from the earlier version and move on.

hermannj314 · on Sept 23, 2012

I'm in the position of maintaining a legacy codebase. I feel like I've shown up half-way through a game of Jenga and management still wants me to play the game with the same speed as the guy who played the 1st half.

Meanwhile, he's been promoted to start work on a brand-new Jenga tower since he's demonstrated such remarkable success in the past.

I just want everyone to stop playing Jenga.

ryanmolden · on Sept 23, 2012

I've always, only half-snarkily, said that if you have never had to maintain/modify someone else's code then you probably write code like an asshole. Writing code that is both easy to understand and maintain and correct can be difficult, lots of people just go for the latter.

Unfortunately as you pointed out, especially in large companies, people can get promoted before the deficiencies of their previous work become clear. This leads to people never learning because the feedback loop is too long, or worse, they never deal with their past code so are oblivious to all its shortcomings and just think they are awesome. Also management tends to reward based on accomplishments today without an eye for costs to be born down the road, which gives perverse incentives to "get it done" programmers who leave mountains of technical debt in their wake for others to deal with.

loup-vaillant · on Sept 23, 2012

> Writing code that is both easy to understand and maintain and correct can be difficult, lots of people just go for the latter.

They believe they go for the latter, but actually they don't. If their code was easy to understand and correct, it would have fewer defects to begin with.

Your second paragraph I totally agree with. I've dealt with such code. Sometimes, I can halve its volume simply by applying local correctness-preserving transformations. That is, without even knowing what the code is doing. I even spotted some bugs in the process.

mononcqc · on Sept 24, 2012

'correct' is always only about a given specification, that is right for a limited period of time, assuming needs are well understood. It is very well possible to write satisfactory code one day that becomes inadequate the next.

I won't deny the presence of bugs though, there is endless evidence that bugs always exist.

ryanmolden · on Sept 24, 2012

Correct, I was talking mainly about code that 'works' (i.e. is 'correct', for the given spec) but is highly sub-optimal, confusing, tightly coupled with other code, etc...

wellpast · on Sept 23, 2012

If only there was a way to measure programmers on the robustness and potential of their code, and not just on "they wrote a lot of it."

The system seems to be that it is much more advantageous to your career to rapidly produce gobs of spaghetti -- and confound everyone around you -- than to build elegant code that enables everyone around you. You look better when everyone but you is confounded; you look replaceable when everyone around you can extract as much value out of your clean, powerful code as you can.

Evbn · on Sept 24, 2012

The more code I see, the lower my estimation of the coder. The mark of a talented engineer is how much code they delete.

nahname · on Sept 23, 2012

The whole point of continuous delivery is that correcting things like data structures is no longer the big deal it once was because it happens frequently. Rather than letting months (or years!) of data migrations pile up, you have a few days worth (or weeks). In my opinion, it is best to ship something that works today and have a system in place that makes correcting it as painless as possible.

That's the real problem with the current model. The data structures will never be perfect and you cannot know how they will change. Yet they do. Then all the FUD from the last migration that scared everyone prevents the team from due diligence and correcting issues when they are discovered. The team waits until the problem comes to a head, management has to be involved, new FUD is created and people dream about perfect data structures to prevent this whole mess.

Remember,

Shipping > Shipping Shit > Not Shipping

chernevik · on Sept 23, 2012

As a new programmer I know I should ship a lot faster than I do, but focussing on data structures makes me really slow. I can usually jump a hurdle with a hack on my extant structures, but this introduces code complexity and leaves me at square one when a similar hurdle appears elsewhere. I try to be disciplined about fixing stuff at a data structure level, but changes there set off change propagations throughout the code. Or I introduce hacks into the data structures, which then become convoluted and start acquiring code of their own.

I find unit testing does help with all this. It forces exposure of the data stuctures, essentially documenting them. And good coverage gives a list of breakages and sometimes helps find elegant repairs. But I also find myself wanting high level-tests, I guess essentially integration tests, that check not components but overall behavior, and I find writing and maintaining these becomes a real problem.

But I really, really wish I had better tools / procedures for thinking through the problems and designing a proper data model for solving them.

hasenj · on Sept 23, 2012

This article was an eye opener for me:

http://www.dodgycoder.net/2012/07/old-school-developers-achi...

"Old school developers - achieving a lot with little"

I'll quote:

> [Ken Thompson] debugs only via printf statements, hardly ever uses unit tests, starts his projects by designing the data structures and then works bottom up, with throwaway test stubs.

Also, Joe Armstrong, the father of Erlang:

> He uses prototypes to solve the hard problems first, and for debugging, just uses print statements. He is a critic of Object Oriented Programming, and favours functional programming languages like Haskell. He never uses an IDE, preferring just Emacs and the command line (no mouse required).

Also, Jamie Zawinski:

> During development he hardly ever uses unit tests, believing it slows things down - he thinks there's a lot to be said for getting the code right first time. In his view, its a matter of priorities, "do you want this to be good software or do you want it to be done next week - pick one because you can't have both".

greenyoda · on Sept 24, 2012

That article summarizes parts of the book Coders at Work, which has all sorts of other interesting stuff in it. The book's web site gives brief bios for all the programmers who are interviewed in it: http://codersatwork.com

joedoe55555 · on Sept 23, 2012

Agree, fixing bad data structures is much more painful than fixing bad code. The reason is that the deployment of the refactoring has the complexity of a new deployment, or even higher.

However, given that at large organizations updates and deployments can easily become political issues, it's a good habit to deploy often. That makes your life easier when trying to deploy new changes because those who are watching or performing the deployments get used to it - and the errors occuring during such deployments.

jt2190 · on Sept 23, 2012

You're just hung up on your definition of "release", which is "release for production use". "Release early release often" doesn't dictate how you release your application, just that you expose it in some form (private beta, public beta, pre-release) to the real world for vetting. Projects that fail to vet their assumptions are more prone to poor data structures and over-engineering.

dustingetz · on Sept 23, 2012

release early release option gives you chances to fix your mistakes before its too late. waterfall works great if you can manage to get your data structures perfect before production. it begins to fail when it becomes prohibitively expensive to fix mistakes after production, where not unexpectedly breaking old code has precedence over deploying new.

dsymonds · on Sept 23, 2012

There's other options than "release early release often" and "waterfall".

okal · on Sept 23, 2012

What might those be? Honest question. My inexperience is probably showing, but I have a hard time picturing what the middle ground might look like.

Retric · on Sept 23, 2012

One of the classics is basically the mid point between them. AKA throw one away.

Requirements desgin, DEMO, Requirements, Design, Implantation, Verification, Release, Maintenance.

The idea was to use a prototyping language for the Demo and then a production worthy language for release. The problem was a lot of Demo's ended up in production for a literally decades because people tryed to create 'over architected crap' which got scraped or takes decades to release.

Honestly, I think the real problem with most early Development strategy's is so few people have a clue how to actually design good software. Great solutions are designed to create minimal systems to solve the problem which are flexible because they are minimal and programmers can alter code, rather than people conceiving every possible change request up front.

tisme · on Sept 23, 2012

One of the big tricks here is not to be religious about any technique. Pick what fits best given the information that you've got for the situation that you find yourself in and don't be afraid to change the mix over time as the situation changes or you find yourself in the possession of new knowledge that is inconsistent with your past views on the state of affairs.

If you blindly adhere to some method or other then you're going to find out exactly what the limitations are so you are going to have to be flexible and you're going to have to mix-and-match as time goes by.

As an example, 'agile' comes up here with some regularity. It's a great principle but it's not a religious thing. Feel free to adopt some but not all of agile to come out ahead. Adopt all of agile in a religious fashion and you'll come out a loser.

MortenK · on Sept 24, 2012

Lifecycles are either sequential or iterative. But there's a lot of differences in the lifecycles, and many more models than usually assumed. For example spiral development, throwaway prototyping, evolutionary prototyping, staged delivery, design-to-tools, design-to-schedule, modified waterfall (a sequential, iterative process) and more.

Read McConnell's 'Rapid development' for a walkthrough of the ones above and then some.

barrkel · on Sept 23, 2012

This is approximately the same reason as why I start out writing most of my programs by creating a bunch of types, and why I find dynamic programming languages uncomfortable to use.

I'm less and less a fan of the ceremony of object orientation, but I think there's a lot to be said for having a succinct formalized statement of your data structures up front. Once you understand the data structures, the code is usually easy to follow. The hardest times I've had comprehending code in my career, apart from disassembly, have been from undocumented C unions.

tikhonj · on Sept 23, 2012

It sounds like you'd really like Haskell. It gives you a far more succinct way to represent your data. Since the overhead of creating a new type is very low, you also become far more likely to express more of your logic in the types.

I always start my Haskell projects by laying out the data types. The type declarations are very readable--you can just skim over them to see what's going on. This means that you can get a very good overview of what the project wants to do just by quickly looking over what types it declared, and then looking through functions' type signatures.

Then, after you have the data types defined, the resulting code is not only easier to read but reflects the data operations it is doing. I've found the type signatures above each function--which are optional but highly recommended--really help tie the code back to the data it operates on. Additionally, pattern-matching against the data makes the structure of most functions clearly reflect the exact data it is working on.

The low overhead of creating Haskell types also makes it very easy to add aliases to existing types. So perhaps you use a `Map Int String` in your code; you then give it a domain specific name:

    type IdMap = Map Int String

so your functions refer just to that. Then, when somebody comes along and tells you about `IntMap`, refactoring all your code to use `IntMap String` is far easier!

So if you really like a more type/data directed style of programming but are getting tired of OOP, you should definitely check out Haskell (or something similar like OCaml).

olliesaunders · on Sept 24, 2012

This is a very apropos recommendation. I love the way Haskell makes you think like this. Even better I love how you can continue to apply that kind of thinking to programming in any other language.

manoDev · on Sept 23, 2012

Data structures != Types

Dynamic languages are not an argument to make bad use of data structures in any way, it changes nothing.

barrkel · on Sept 23, 2012

I think you missed my point. I wasn't arguing against bad use of data structures; just for a formalized, checked description of the data structures used in the program.

I have in the past written preprocessors for this purpose. A simple language just to describe the types used in the program, with the output being the implementation language's definition of the type, as necessary. I've even done it in Javascript, in a very roundabout way - was JScript.net, producing XML, which was consumed at runtime to create classes dynamically, which in turn were used by a dynamic programming language.

I'll go a long way to get a simple up-front map of the program's data in a checked format.

auggierose · on Sept 23, 2012

I totally get what you are saying. I also have the difficulty that when programming in Clojure and starting on a new problem, there is no really good way of writing down the data structure. I still love the flexibility of Clojure, though, so now I am trying to work around this by writing down proper documentation of the data structure instead of representing them exclusively with types.

antidoh · on Sept 23, 2012

"This is approximately the same reason as why I start out writing most of my programs by creating a bunch of types, and why I find dynamic programming languages uncomfortable to use."

Classes. There are your types. Python has them. Ruby has them. Javascript has them. . . .

barrkel · on Sept 23, 2012

If you can't see see the qualitative difference between algebraic data types and Javascript classes, there's no point discussing things.

ufo · on Sept 23, 2012

Precisely. If you think that classes are all that you are ever going to need then you should read up on the Expression Problem

http://c2.com/cgi/wiki?ExpressionProblem

antidoh · on Sept 23, 2012

dasil003 · on Sept 23, 2012

Javascript only has Objects not Classes.

antidoh · on Sept 23, 2012

Thanks, I stand adjusted. And of course you can use prototypes (objects) for the same intent as classes, which was the source of my error.

hermannj314 · on Sept 23, 2012

Next week on Hacker News: Bad Programmers worry about their code. Good programmers ship.

"Bad programmers [technique A on programming KPI metric N1]. Good programmers [technique B on programming KPI metric N1]."

Responses: Someone will ask, "What about metric N2?" And someone will say, "What about technique C?" Someone will post a personal anecdote showing that people really underestimate the value of A. Someone will respond to that by posting a hyperlink to an anecdote that shows technique B really is what matters.

jrajav · on Sept 23, 2012

10 people learn about techniques A, B, and C who didn't before. 10 other people start thinking in terms of metrics N1 and N2 who weren't before. We learn and improve collectively. I think that is a pretty amazing thing about the internet and boards like this.

That's not to say that some things don't get passed around a lot, but that's generally because they're worthwhile enough to make sure that everyone gets a look.

hermannj314 · on Sept 23, 2012

I agree 100%.

To me, one of the downsides of many internet discussion is that it seems complicated issues get reduced to a single dimension just to get enough traction to have the conversation. And then most of the conversation is between the camps that have internalized and accepted the simplified mental model of the problem and those that haven't. Maybe that's is a good thing for the reasons you highlight.

In my opinion, it feels like we are fishing with a bigger net, but we aren't fishing any deeper now than we were 3 years ago.

javajosh · on Sept 23, 2012

>In my opinion, it feels like we are fishing with a bigger net, but we aren't fishing any deeper now than we were 3 years ago.

Well, who's fault is that? Pick a perspective and follow it deeply, write about it, and fish deeper. All it requires is commitment.

You could certainly pick a worse position that the OP's! Data is far more important than code - although I fear that this point is obscured by the "web service" trend that tends to spread data across many legal/economic/technical 'zones of control'.

Anyway, I hope that rather than just lament the state of things, that you take action to change it, at least for yourself.

joe_the_user · on Sept 23, 2012

Sure, and after six months, 5 of those 10 people learn more stuff and the 5 others become dogmatic mouth pieces for technique A,B and C.

The ideal approach is to not just teach someone a new approach but to make it clear that they have to keep learning rather than giving them the impression that they've finally found the holy grail of programming.

InclinedPlane · on Sept 23, 2012

To paraphrase, if I may, a novice imagines that the goal of programming is to create code which solves a problem. This is true, but limited. The real goal is to create an abstract model which can be used to solve a problem and then to implement that model in code.

slurgfest · on Sept 24, 2012

I really hope that people reading this know to apply Ockham's Razor to their abstract models (lest they write a lot of AbstractSingletonProxyFactoryBeans).

6ren · on Sept 23, 2012

I see a program as a theory, a theory of the problem it solves. You can see how well it generalises, if it is needlessly complicated (Occam)... and in some magical moments, you'll find it predicting phenomena you hadn't explicitly anticipated.

So I think a program's conceptualisation of a problem is the most important thing - more important than data structures or code. Though, data structures are usually closer to it, by representing the concepts you model.

However, it's really hard to get these things right. Linus created both his great successes (linux and git) after experience with similar systems (unix/minix and bitkeeper). Being able to play with an implementation, experience its strengths and weaknesses, gives you a concrete base to use, reason with, push against, and come up with new insights - it's enormously helpful in seeing the problem.

But that's a grand vision - I wonder if Linus is also talking about programming in the small, each step of the way, as a low-level pragmatic guide. I don't like git's interface or docs much, but the concepts are great, it is implemented really well, very few bugs, and even on ancient hardware boy is it fast.

jedbrown · on Sept 23, 2012

This is Normalization rearing its head. A properly normalized database can be extended without needing refactoring and does not have modification anomalies. There is a formal process to normalization. There is no such equivalent in code, but a poorly normalized data model virtually guarantees that any code wrapped around it will be messy. Conversely, mediocre code wrapped around a clean data model (less common in the wild) is much more amenable to incremental improvement.

joefarish · on Sept 23, 2012

"good data structures make the code very easy to design and maintain, whereas the best code can't make up for poor data structures."

Quite a nice summary, courtesy of http://programmers.stackexchange.com/a/163195/31774

east2west · on Sept 23, 2012

This brings up a burning question that I have been pondering for a while. I still don't get how to properly design good APIs. I have been programming but as a scientific research not as a professional developer, and I have found I cannot remember how to use my code a day after writing that code.

Take my current project as an example. I have some samples, each of which are observations along a sequence of non-overlapping segments. My objective is to extract observations over arbitrary intervals for all samples. So I have a segment defined as a pair of start and end position plus its observation, a sample as a vector of segments, and all samples as a dictionary of samples. There are various utility functions to make segments, collect samples, and scan individual samples. The problem is I have to remember all three levels of data structures to use this code. I wonder whether it is better to define an interface for those data structures as well so I just need to remember the interface. My objections to formal definition of interfaces is that everything is so simple and obvious and formal interfaces smack of overengineering.

I got to this point because in my previous projects I put every identifiable thing as a class and found too much coupling in classes and convoluted interfaces.

joedoe55555 · on Sept 23, 2012

Hi, I used to have a similar problem: seeing that when I do architectures they become difficult to maintain. Or difficult to talk about, to be honest, I don't understand your second paragraph :)

What really helped me is the approach to program API-driven. Don't start with your algorithms but start with what kind of functions you probably need and what would be the easiest way to use them. (In fact this is not so far from this data-centric approach as the most basic functions of APIs are usually function to retrieve or modify data.)

Try to read some good code from one of your favorite open-source projects. At some point some code may catch your attention because it's so simple and elegant. Why is it so elegant? Often because the underlying structures are just simple and made from common-sense. Don't over-engineer stuff, the simpler solution is often superior to the full-featured solution. And often you should ask yourself: do I really need this features currently to show some progresS? Shouldn't I not rather post-pone it?

east2west · on Sept 23, 2012

All good advice. Right now I only write out major tasks and outlines of procedures to achieve them. Maybe I first should abstract out core data structures and operations on them. I will try that.

sedachv · on Sept 23, 2012

How do you name things? From your text ("I have a segment defined as a pair of start and end position plus its observation, a sample as a vector of segments, and all samples as a dictionary of samples") it seems you have a lot of different names for different things. Decide what names are important and which ones aren't (the "vector" and "dictionary" probably aren't).

Eric Evans wrote a big book called Domain-Driven Design about these things, but his advice basically boils down to this.

east2west · on Sept 23, 2012

I won't have time to read big books in foreseeable future, especially if they are not related to my core research areas. But I will try this and see what I can get out of it.

kobolt · on Sept 23, 2012

This is similar to the rule of representation from the Unix philosophy, covered here: http://www.faqs.org/docs/artu/ch01s06.html

"Rule of Representation: Fold knowledge into data so program logic can be stupid and robust."

maxwell · on Sept 23, 2012

This seems to apply to all kinds of "writing" (symbol sequence generation), from math to poetry, though the terms differ, e.g.:

  Bad novelists worry about the plot. Good novelists worry   
  about the characters and their relationships.

warmfuzzykitten · on Sept 24, 2012

That's more a value judgement. As Samuel Johnson said, "No man but a blockhead ever wrote, except for money." "Bad" novelists who worry about the plot can make a lot of money.

lttlrck · on Sept 23, 2012

Algorithms + Data Structures = Programs

A 1976 book written by Niklaus Wirth, designer of Pascal

http://en.wikipedia.org/wiki/Algorithms_%2B_Data_Structures_...

qznc · on Sept 23, 2012

... and then there is user interface, debugging, support for various formats, documentation, and other mostly boring stuff.

zxcdw · on Sept 23, 2012

Heh, exactly my answer on the question at programmers SE.

Jarihd · on Sept 23, 2012

So you are at some "Source" and you want to get to the required "Destination"(goal) -- what do you do ???? --- you plan your journey well. Let your plan take into considerations all the possibilities --- all pros an cons.

Good Programmers - well they plan; understand the requirements of the problem; case study or analyze the problem space; consider all(or most) of the possibilities to reach their goal(destination); then make a design (create a plan) and decide on the path(s) to be taken i.e. choose data structures, algorithms, programming language, and other factors. Having understood the pros and cons of their design; they begin to code. This process generally works most of the time; but there are times when you go mid way and then change the design or might consider another alternative(like data-structures); this generally happens when you've missed some problem space to analyze earlier while planning. None-the-less; planning well ahead of time; before you begin coding helps get a good product and helps save a lot of time, money and effort.

Bad programmers on the other hand know about their Destination(goal) but do not know how to get there; they simply jump into coding hoping that they would someday get to their destination. This too works, but it takes more time; and when one realizes that one has made a mistake; it becomes very difficult to come up with a new plan to move forward from that point. The product loses its quality. Often you land up starting again from square zero.

bitcracker · on Sept 24, 2012

IMHO: That's why good programmers love Lisp.

Lisp is all about data structures. Data can be expressed so easily no matter how complex it is. Lisp coding is merely writing minimum code to handle data structures. Even code is data. So it's no problem to extend Lisp with new commands. That's precisely coding around data.

In Java or C# however you have a lot of libraries to handle data but you don't have such freedom of data expression. You have to write a lot of code to express and handle complex data.

zalew · on Sept 23, 2012

there's quite a similar quote on photography

"Amateurs worry about equipment, professionals worry about money, masters worry about light"

sneak · on Sept 23, 2012

The ending changes the whole thing:

"... I just take pictures."

Programming, motherfucker.

seanalltogether · on Sept 23, 2012

My only problem with this quote is it equates "new" programmers with "bad" programmers. Yes if you still have these problems after 10 years of professional work, then you're a bad programmer, but if you show these symptoms after 6 months it just means you're still learning. There's got to be a better way of stating this.

dsymonds · on Sept 23, 2012

A new programmer is a bad programmer. There's no shame in that. In fact it's better for a new programmer to realise that; the worst programmers are those that don't even realise they are bad.

A new programmer may still be learning, but that doesn't mean they aren't bad. It just means they are bad now, and hopefully won't be bad later.

bitcracker · on Sept 24, 2012

Even an experienced programmer can still be a bad programmer.

I realized that myself when I learned Ada. You cannot imagine how humbling an Ada compiler can be :-) This is possibly the reason why Ada is not popular at all. It exposes how bad you really are. Java, C#, and even C++ are much more tolerant and can easily give you the illusion of being a good programmer

I am experienced in really many programming styles and languages. Ada and Lisp and their programming paradigms advanced my programming skills the most. Ada, because of its merciless requirement of discipline, and Lisp because it teached me to focus on data instead of code.

joedoe55555 · on Sept 23, 2012

Amen.

nnq · on Sept 24, 2012

I always do it the other way around (when starting software from scratch):

0. write the simplest mock/pseudo-code I can think of for the business logic that needs to be implemented

1. extract from this ideal code the data structure that it needs in order to actually be so simple and write real code that implements these ideal data structures

2. write the real code that actually does the work

I think Linus means the same thing, but he doesn't get it that in order to imagine those "perfect data structures" he has to start with some idea of the code that will be using them, otherwise they will not be "perfect" for his program. I'm sure he's just smart enough to go through my "0" step in his mind without actually writing things down.

It's an obvious case of very smart people omitting the steps that are obvious/implicit to them when expressing their ideas to "lesser minds"...

jiggy2011 · on Sept 23, 2012

Maybe this is a dumb quesion , but I don't get how you would write code without thinking about your data structures?

Most of the code I write is manipulating a data structure in some way, I have no idea how I would even know where to begin with at least some idea about which structure I should be using.

mdonahoe · on Sept 23, 2012

The data structure you choose will have a profound impact on the code you write, so choose wisely.

KevinMS · on Sept 23, 2012

I think this can be distilled down to: "bad programmers worry about how, good programmers worry about why"

mathattack · on Sept 23, 2012

I've seen this is numerous areas.

Think about business analysis... Let's say you're analyzing the best way to support Marketing at a consumer products company like Colgate. If you start with fancy windows, or the latest technology, you won't help the business nearly as much as if you think through the data needs of the business first, and then worry about presentation later. The data model outlasts the presentation software.

Consider doing a webpage. It's much better to think about what you want to show (the HTML) first, and worry about style (CSS) and scripting (pick your tool) after.

This isn't to say coding isn't important too. You need programming skills to get what you need in the end. It's that data is the foundation. A poor data model reflects either an overly complex or unthought business model.

simonb · on Sept 23, 2012

This can be viewed as a corollary to Perlis's "It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures." [http://www.cs.yale.edu/quotes.html]

lifeisstillgood · on Sept 23, 2012

http://www.amazon.co.uk/gp/aw/d/0521663504

functional data structures - how to bend your brain in a functional world - nearly twenty years old. Still have to take a run up to read it

JoelJacobson · on Sept 23, 2012

I find myself writing new functions and updating existing ones a lot more often than having to add new columns or tables to my database schema. This is probably true for all well designed systems. If you need to change your schema every time you need to change your code, its a clear sign you are a very bad developer.

quality of code indicator = [code changes] / [database schema changes]

Related quote:

"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious." (Fred Brooks)

munin · on Sept 23, 2012

it occurs to me that when you're programming functionally, especially in a functional language, you must think about your data's structure and types first (or fight a whole lot with the language). if linus is correct, could a strength of FP be that it naturally herds its users down the path of the 'good programmer'?

jiayo · on Sept 23, 2012

This is the first time I've clicked on a StackExchange thread and not seen "closed - not a real question".

freework · on Sept 24, 2012

give it time...

Xcelerate · on Sept 23, 2012

I believe Rich Hickey believes in keeping data really simple and writing the code around that. Could someone explain this philosophy to me?

iso8859-1 · on Sept 23, 2012

Watch "Simple made Easy" or something like that. I think there is no contradiction, Rich Hickey probably just thinks the least complex efficient data organization is the simple one. I don't think Rich Hickey is saying that you can't use AMQ's or other "fancy" stuff :)

chris_wot · on Sept 23, 2012

Posting these stackexchange questions to HN is the best way of getting them closed as "unconstructive", whatever that means.

btilly · on Sept 23, 2012

This question is on the programmers stack exchange, which is supposed to be for questions that cause discussion.

nirai · on Sept 23, 2012

I think he meant that it is better not to worry about the code since it will stink no matter what you do...

michaelochurch · on Sept 23, 2012

The problem with code quality is that there's so much AND-ing that most people give up on understanding this massively difficult problem that is as much social and industrial as it is technical.

One of the first things you learn about systems is that parallel dependencies (OR-gates) are better than serial dependencies (AND-gates). The first has redundancy, the second has multiple single points of failure. That's also true with regard to how people manage their careers. Naive people will "yes, sir" and put their eggs into one basket. More savvy people network across the company so that if things go bad where they are, they have options.

To have code quality, you need people who are good at writing code AND reasonable system designs AND competent understanding of the relevant data structures AND a culture (of the company or project) that values and protects code quality. All of these are relatively uncommon, the result being that quality code is damn rare.

javajosh · on Sept 23, 2012

This is a very pessimistic view, because it doesn't allow for any process that could possibly change the state of affairs.

The fact is that one person armed with clear understanding of quality code can be the seed of change. Even under siege from bad data structures and opaque processes, it is possible for a programmer to carve out a small niche, to normalize (at least in his mind) the system he's been given to modify, and apply his knowledge correctly.

If he can execute projects quickly and relatively error-free, this programmer will do well in any organization, and he will probably be given a team of his own, and that team will probably be a good one, and the codebase will continue to change slowly, organically.

If the programmer leaves, then the bad code will grow again. That is the nature of life!

In any event, I just want to emphasize that there is great value in understanding and doing good work, even if (perhaps especially if) the constraints you're working under don't encourage it. YOU are the seed of change.

lifeisstillgood · on Sept 23, 2012

Completely agree - code quality depends on code review and the belief you should not foist rubbish on others - and that is almost always down to culture and comms

dschiptsov · on Sept 23, 2012

Algorithms, data-structures and code quality (ability to quickly understand and change) are three dimensions of the same thing, neither could be neglected.

Or to put it another way - algorithms are the plot, representation are your characters, and code is your language. One or two of three aren't enough.

Code is very good investment, it must be as short and simple as possible, but not too much. There must be a balance, compromise between volume, verbosity and meaning, as in a good poetry. The heuristic here is 'less is more'.

This must be not confused with languages. Bad, unreadable code could be written in any language, but some over-hyped languages are bloated and messy from the birth.

hasenj · on Sept 23, 2012

This article was posted here a while ago:

http://www.dodgycoder.net/2012/07/old-school-developers-achi...

It mentions that Ken Thompson "starts his projects by designing the data structures and then works bottom up".

Adapting this approach solved several problems I was having during development.

CamperBob2 · on Sept 23, 2012

A similar technique that's among my favorites: Write the inner loop first.

Useful in graphics work, perhaps not so much if writing a Web server or database engine.

Roybatty · on Sept 23, 2012

He's dissing OO there.

warmfuzzykitten · on Sept 24, 2012

I was wondering when someone would point out that a number of OO programmers have no idea what a data structure is.

ksk · on Sept 24, 2012

Why would he do that? Large parts of the kernel are designed using OO concepts :S