The best analogy I have here is that saying "DP is just caching/memoization" is like saying "investment is just buying stuff and selling them later." Regardless of how technically accurate it might be, it misses such massive complexities and difficulties in the topic that it just makes you look silly rather than insightful.
Would you prefer someone start learning the principles of investing from an initial premise of “investment is just spending money to make money” or “investment is just buying something for less than it will be worth in the future”
Both are technically correct and both statements are vast oversimplifications of a complex subject, however the second statement leads into more advanced topics naturally, creating a smoother progression of knowledge that builds on itself.
Every complex topic generally starts with a few foundational bits of knowledge that support the whole house of cards. Pick the wrong foundational bits and you won’t get as far.
I certainly don't disagree that there should be a solid foundation here; I just think this is the wrong foundation, for what I believe to be good reasons (some of which I'll get to below) - mainly, it confuses the concept with the implementation mechanics.
Instead... I find it much more productive to start teaching this with the fundamentals of problem-solving techniques, as follows:
- One method for solving complex problems is to divide them into separate subproblems that are simpler to solve. We call this "divide-and-conquer", as you've seen in [blah, e.g. mergesort]. Observe that this technique decomposes your subproblem structure into a tree. (Draw a picture.)
- Another (more advanced) method for solving complex problems is to find subproblems that might potentially overlap, and then find a clever way to combine their solutions into the final answer. An example here is finding the shortest N-hop path by finding the shortest (N-1)-hop paths, and picking out the shortest of those to find the shortest N-hop path. Observe that this technique decomposes your subproblem structure into a DAG form. (Draw a picture.)
Take a moment to read the above... notice the lack of any mention of caching/memoization/tables, etc.? That's not an accident. It's because those are optimizations to DP, not the core idea! Before you can even start talking about "how do I solve this quickly", the question you need to answer is "what systematic method can I use to solve this at all?" The central idea there is the decomposition of the problem into a tree vs. DAG structure. Once that sinks in, the question of how to make it faster (or whether to use recursion vs. iteration) is secondary, and also far more obvious to figure out on your own.
Moreover, this also lets the student realize that you can (sometimes) speed up divide-and-conquer with memoization, too. Which also hammers home the point that such an optimization is very much distinct from the technique itself!
I think the point of GP is that saying it's basically “smart caching”, or “filling the cache in the correct order” helps connect with a concept many people are already very familiar with, which grounds the concept immediately. In contrast, going through each step you have outlined introduces several new concepts that are generally new and counter-intuitive to most people, so many will just give up.
> In contrast, going through each step you have outlined introduces several new concepts that are generally new and counter-intuitive to most people
What's new there? The concept of a DAG? I dare say it's important to learn DAGs before learning DP. Trying to teach them in the opposite order is like trying to teach multiplication before addition.
(I also have no idea what you find counter-intuitive here. How is solving complex problems by solving simpler problems counter-intuitive? Do you normally solve hard problems by solving harder problems?!)
Please, take the perspective of an average computer science student, who might have had some interest in computers, but maybe did not look too deep in the theory. In the past few years, they just had to learn graphs, automata and Turing machines, complexity classes, computer architecture, compilation theory, and possibly a programming language they never used before, maybe two. And they might prefer having social life than doing all-nighters on computer-assisted proof assignments :-) .
In short, I am not saying you cannot learn dynamic programming straight away from the theory, just that you are going to lose many people with this approach.
In fact, it makes me think of “New Math” [1], an effort to teach math from the ground-up starting in junior high. I am not familiar with how it worked in the US, but at least in France, it was definitely not a success. I would definitely have had a lot of fun, but many more did not, and failed to get a basic mathematical education at all as a result.
Would you mind pinpointing where exactly you feel I would lose students in the above approach? Would I lose people at the mention of "DAG"? in which case... isn't it easy enough to jog their memory in a few seconds ("it's like a tree, except a node can have multiple parents" <drawing>) if they've forgotten what that is? Or is it with the N-hop example? In which case, aren't there plenty of simpler ones (like Fibonacci) you can use instead? Or do you see somewhere else where people would get overwhelmed, and if so, where?
Like, I would understand where you're coming from if I'd used the pumping lemma or something like you're imagining, but that's not what's happening here? All I did was merely mention the words "subproblems", "trees", and "DAGs"... that's it. No theorems, no proofs, not even any need to remember what caching is... just descriptions with vivid examples and diagrams. All of which you can re-explain in a few minutes if they've forgotten them. I have a hard time seeing why that should scare someone away who's otherwise in a position to learn DP.
Also, isn't the audience kinda important here? It's not like this was intended for 10-year-olds like New Math. It was intended for college-level CS students and above. They can and should be expected to have a greater understanding (and attention span) than elementary school students.
The point is that dynamic programming builds on various abstract concepts that many college students only understand to a level deep enough to pass exams. Making the connection with caching, which is a mostly-unrelated, and much more concrete concept, give them the basis to build a mental framework. Sure, they might do without, but why make it harder?
I don't agree it's making it harder though. Nothing I just built it on above required any deep understanding at all, and you could even re-explain all the shallow bits (like what a DAG is) in a few minutes if not seconds. I simply don't buy that this is too onerous. As to "why explain it this way regardless", because I believe it gives them a better understanding once they hear it?
I am not saying to not explain the theory. I am saying that using a concrete concept makes it easier to build the mental model needed to understand the theory.
The traditional way of teaching dynamic programming worked well in the traditional CS curriculum that had a lot more mathematics than today. It leveraged what the students had already learned at that point.
Take the definition of the problem. Transform it into an equivalent definition that looks like the inductive step in a proof by induction. Look at the partial order defined by the dependencies between the subproblems in the inductive step. Solve the subproblems in an order that is consistent with the partial order. (For bonus points, find an order that is asymptotically faster in easy situations, such as when the edit distance is small. You may have to rule out paths that cannot be part of an optimal solution.)
That's first or second year mathematics in the traditional curriculum.
Today, and actually since the 90s or maybe even earlier, many CS students have no particular interest in mathematics. Classes that rely on mathematics have to spend less time on the actual content, as they have to cover the prerequisites and find alternate ways of explaining things.
I think my own education followed the “traditional” way of learning computer science, albeit after years of tinkering with computers and programming as a kid. But I often felt that some concepts were much simpler when you made the connection with something concrete.
Of course, you would not use analogy for formal proofs, but it helped build a mental model incrementally instead of having to conjure various concepts out of thin air and draw connections between them.
The key part was studying enough mathematics. When your homework involves several proofs by induction every week, you get used to a certain way of thinking. If someone then shows you a dynamic programming algorithm, you will probably notice that the same process you are using all the time can also be used for designing algorithms.
You were doing the hard work in the mathematics classes, while dynamic programming was a simple topic you could understand after a single lecture. Definitely simpler than balanced search trees. But once the average CS student was no longer comfortable with mathematical proofs, they had to do the hard work while they were learning dynamic programming.
And how do you find "subproblems that might potentially overlap"? You go through the space and cache results of your calculations.
I don't think there was a DP problem in my university days that couldn't be solved this way. Are there problems that can't? Maybe, I don't know. From what I've seen you sometimes need a more clever caching but that's about it. The question is why do we teach a very simple, intuitive concept in a way that makes a lot of people who grasp all the pre-requisites struggle with it.
> And how do you find "subproblems that might potentially overlap"? You go through the space and cache results of your calculations.
I addressed this directly in my comment? At the risk of repeating it: you don't have to even think about caching anything in order to find "subproblems that might potentially overlap". The example I gave directly in my comment illustrated this vividly: the shortest N-hop path from A to B is the shortest of all (N-1)-hop paths after all the first hops you can possibly traverse from A. Is it clear what the subproblems are? Yes, they're the (N-1)-hop subproblems. Is it clear why they might overlap? Yes, because after a couple hops you might end up in the same location. Was caching relevant to understanding the aforementioned? No, it didn't even need to cross your mind - the point was the problem decomposition. Caching is just a generic optimization you can apply later to DP, just like you can to divide-and-conquer (and to other stuff).