I feel so validated by this article. I took two semesters of machine learning electives for my CS masters and feel nearly as ignorant and mystified as when I started. I worked so hard to create something useful and at the end of the day, my work felt like it was 96% example code with modifications hacked in to make it work. And in the end it was still terrible! At least now I know what people are talking about when discussing neural nets and their inner mechanics.
For now, ML research and development is too complicated and frustrating for me to dedicate the time and energy to become skilled in it.
Well that's because ML isn't really software engineering. Unfortunately, it is also software engineering as I often have to remind my colleagues coming from algebra / econometrics / statistics sides who are happy to shove all kinds of horrible code in.
What I've found in reality is that machine learning is 99% data cleaning scripts and 1% the part you're talking about. I've also seen the heavy duty statistics people writing data cleaning python scripts which probably leads to a lot of frustrations :)
I think what may be understated here is that while it’s true that ML is mostly date cleaning, data cleaning is not easy. There are a million little decisions made and it’s rarely clear which ones are most effective. Experimenting with various techniques is great but the iteration times and cost are usually too high to try more than a small handful of approaches.
> 96% example code with modifications hacked in to make it work.
This is 96% of how ML is used in practice by companies. Most parameter optimization should be automated by whatever library you’re using, beyond basic sanity checking. The challenging parts are creating high-quality training data and deploying the models efficiently at scale.
I'd be surprised if it doesn't suffer the same fate as graphics programming. The lower level stuff is what brings in a lot of talent but the producers often have little knowledge of how things work but just wire together some libraries in a GUI.
I don't think it's as complicated as it might seem, if you break it down.
I think the real blocker is time; modern software devs are expected to be across the whole stack. You can't have one person write your backend, frontend, infra, db admin, build ml architecture, train model, etc. It's just too much.
It's just specialising really, wide and shallow vs narrow but deep domains. Forefront of ML stuff requires researchers specialised in that domain, same as any I guess.
As a software engineer, I disagree. Caveat, I haven't studied traditional ML and just went straight to DL. There is a lot of jargon and you do have to sit down and learn how things work, but once you do, deep learning is fairly simple. One thing that actually really bothers me is how much libraries (e.g. huggingface) are just config files masquerading as programming. It is just a class with 50 parameters and takes about 5 lines of code to make it do the thing, most of the time is spent figuring out what the parameters do.
I think making lower level tools (fundamental building blocks) will make things more accessible to software engineers, as opposed to the high level wrappers being written. We can grab a library, read the docs, and put pieces together in an efficient way. We just need some core work horse libraries. Like if llama.cpp was a library at the same maturity as sqlite.
But these do exist: plain Jax or Pytorch only give you basic linear algebra, differentiation and some basic layers. And there's a plethora of more or less advanced libraries that add specific functionality, for example torch geometric for graph data and lightning to reduce boilerplate.
Chiming in to say that this precisely our observation. The existing ML/DL libraries are not bad as far as those types of things go. In fact, Pytorch is an amazing library IMO. Especially compared to TensorFlow, Caffe and the stuff that came before that.
But like George points out in the article, unlike "traditional" software, ML requires iteration, data management, monitoring, specific infra reqs, and so on. So our take was that libraries would never be enough, hence the SaaS offering.
> There is a lot of jargon and you do have to sit down and learn how things work, but once you do, deep learning is fairly simple.
This is my realization too. I think ML for SWE courses should focus on "translation" first. Like by "kernel" they mean this specific thing, not the normal meaning of kernel. This is similar to other fields like finance (which I'm working on). After you learn the language, it's actually not too terrible to understand.
Of course I'm kidding. It's one of those terms that many fields adopt and give it completely different meaning. There's no "normal" meaning of `kernel`.
EDIT: Hacker news won't let me respond, but the answers below all seem to be because the original meaning has been lost on everyone.
In English, the word 'kernel' means 'core'. An OS kernel is the core of an operating system. In linear algebra, the kernel of a matrix (or a linear transformation, same thing) is the set of vectors it maps to zero, which is also in a sense the 'core' of the mapping (in so far that zero can be seen to be at the 'core' of the vector space / number line).
So actually, the definition is the same, it's just that the word kernel is rather rare, despite having a well understood meaning. Nevertheless, it is the kernel of many common English idioms such as a 'kernel of truth'.
Going to ML kernels... the etymology is a bit convoluted. I believe they come from operator theory and support vector machines. But nevertheless, the 'normal' meaning of kernel works out because kernels are typically the core fundamental operations supported by a machine learning framework atop which the other operations are built. In that sense, regardless of the etymology, the name actually fits.
Well, I'm not actually sure which one they're referring to but there is two meanings that I know of. There's the Software Developer Kernel which is the core of an operating system, and there's the Linear Algebra Kernel which is the null space of a map. Most of us here are familiar with the operating system type but the linear algebra kernel describes the set of vectors which are sent to the null vector (0) by your transformation/map/operation. It's very useful in understanding what exactly your operator does and how it changes a space.
The bit of code at the center of an operating system that mediates access between userspace code and the hardware.
Whereas in CUDA programming, a kernel is just the code running on the device. When I first heard kernel in relation to CUDA programming, I expected it to be a) the actual OS kernel, and then b) something akin to a GPU driver.
I can see how NVIDIA got there, but it's not immediately obvious if you're coming from a SWE context.
> Whereas in CUDA programming, a kernel is just the code running on the device. When I first heard kernel in relation to CUDA programming, I expected it to be a) the actual OS kernel, and then b) something akin to a GPU driver.
I think this is still not the machine learning 'kernel' being talked about. While you'd inevitably need to know about this for tooling in the space, looks like kernel is also a mathematical term.
Kernel is an overloaded term outside of software engineering. In linear algebra it means the null space, but that has no connection I’ve ever found to kernel methods or kernels more generally in functional analysis.
The post is not talking about 'traditional ML' but rather only 'DL' (i.e. neural networks).
One of the problems I see here is that it seems to me that the math education of many CS grads is woefully lacking. Indeed, deep learning math is basically senior-high-school level calculus. Backpropagation is a straightforwards application of the chain rule. There is really no surprising thing or deep insight. The math for deep learning has been set in place for more than fifty years at this point. The only change was the advent of large enough data and computer chips to process it.
That being said, I don't think understanding how the `transformers` library works or composes is the same as understanding how 'deep learning' works. That's like saying knowing how to use GCC gives you a solid understanding of compilers. That's not a ding against those whose main experience with DL is using those libraries. I use many libraries doing things I don't fully understand. I'm not a graphics expert, but I use graphics programs every day, and graphics libraries regularly. That's fine. But using those libraries doesn't make you an expert in computer graphics.
> Indeed, deep learning math is basically senior-high-school level calculus.
Actually I find this claim problematic and incorrect. Yes, many parts of ML only require multivariate calculus[0] and those are the parts that most people are exposed to. BUT that doesn't mean there isn't a lot of math hiding around in the background. Even understanding something like activation functions take much higher level math as we got to get into topics such as topology, metric theory, and high dimensional statistics. You don't need this understanding to build models (as you also conclude) and most researchers don't understand much of this either tbh. But I think we need to be clear about these distinctions as model evaluation gets insanely complex. Complexity in evaluation only increases as our performance increases and a major problem we face today is we are marginalizing out important information. I'm sure that anyone that hacks around with LLMs or diffusion will understand how leaderboards are often noisy and how picking top models does not guarantee top performance (and why datasets get continually added). It's because evaluation requires more than boiling down performance to a single number. Here, math skills become extremely important.
[0] btw, many will not find this even offered at their high school. In fact, this is often upper level math for many STEM undergraduates and even an elective in my Uni's undergrad CS program. So let's not be so belittling. Shame the education systems instead of people lacking opportunities.
I don’t think I agree because it’s like saying accounting is just elementary level addition, subtraction and multiplication. But elementary school kids can’t do accounting.
DL requires a certain degree of mathematical maturity to grasp.
For me the hardest part of learning ML was getting over imposter syndrome. It felt like I needed a PhD and hardcore math skills. That’s what made me so hesitant in learning it. I thought: there’s already so many people much more smarter and advanced for me. Why even bother?
It wasn’t until I was “forced” to learn it to solve a problem I was facing, that I realized ML is just like any other engineering topic - whether it’s devops or data engineering. You just need motivation, some patience and ideally a project/problem that you can solve while learning all this stuff.
> For me the hardest part of learning ML was getting over imposter syndrome. It felt like I needed a PhD and hardcore math skills
ABD (all but PhD dissertation) here with strong math skills. I get the imposter syndrome, but let me absolutely assure you that the community at large does not have strong math skills. I routinely talk to people doing diffusion research that don't know what covariance is or pdf. People from top ranked schools, with high paper counts and high citation counts. Expertise is often more narrow than it appears. That's okay, as long as we're honest about it.
Don't get me wrong, I wish there was more math involved and efforts were more serious. But they aren't. The space is very noisy and little is being done to clean it up (there are some, and I do appreciate those efforts). I'll add that there's one thing more that you need besides motivation and patience: perseverance. ML systems are hard to debug and difficult to evaluate (maybe not for papers, but absolutely for systems that work in the real world). It's okay to not get things perfectly and it is totally okay to not have a model with decent generalization, but context is always important and part of the debugging process is trying to trace these down (which is difficult because you need to do more abstract versions of what is analogous to the silly or random inputs being passed to code). Detecting overfitting is often quite hard and honestly sometimes it is even desirable (GPT being overfit makes it great for information retrieval!).
Also, something I tell my students when I teach ML: you don't need math to train good models, but you do need math to know why your models are wrong. So I highly encourage math, but don't let that stop you from getting started. You can also just have a math heavy person on your team and get many benefits that way.
Also, like most topics, there are multiple levels of understanding, and you generally don't have to reach the deepest layers to be somewhat productive.
One difference from other engineering topics is ML brings a hope that it will solve itself. You make a plan, you implement it, you train the model for days, the problem is not yet solved, but if you let it train a little longer maybe it will be?
> It felt like I needed a PhD and hardcore math skills
I don't know if this helps more or make it worse. But I have both and getting the same feelings all the time. But basically you just need good statistics and linear algebra knowledge, and you will be fine (on the math side).
> there’s already so many people much more smarter and advanced for me. Why even bother?
That's the very definition of imposter syndrome put in a very good way. But you can ask the same for everything, not only ML. kudos on getting through it, though.
99% of the time people say linear algebra is required for something, they mean knowledge basic operations and properties of tensors more than actual "algebra". I found this when doing computer graphics. Is that true as well here?
> they mean knowledge basic operations and properties of tensors more than actual "algebra".
You're actually describing linear algebra. A core topic is system of equations. You might see a 2-Tensor (matrix) like Ax := [[a,b],[c,d]].[x,y] and you could write it as f = ax + by; g = cx + dy. Often dimensions are implicit so it may not look like this, but it is. But that's a big part of what it is about (there's a whole lot more btw). You're absolutely using linear algebra frequently in graphics. Euler angles are a good example, you're just probably not writing them in matrix/tensor form. You will even get a tiny bit of exposure to {field,group} theory/abstract algebra via quaternions.
In ML I'd say it is very similar. The typical researcher is going to have about the same math skills as the typical person studying graphics (I actually started my PhD in HPC graphics). But, and this holds for both domains, having a deeper math understanding only helps. It makes things easier to debug, gives you a better understanding of what the systems are doing, and gives you a lot of tools to solve many problems. I wouldn't ever use math as a strong barrier to entry, but I feel many get complacent with their skill level and we have been discouraging this myth that math doesn't help. Without a doubt, it does.
A tiny bit more. Once in a while someone does something elegant with eigenvalues, and Transformers represent their "thoughts" in a very LinAlg way (a linear combination of orthonormal basis vectors in a very high dimension) that was very hard for the only person with no LinAlg background in my study group to grok.
Basic operations and properties of tensors are exactly what is taught in linear algebra, in addition to (in my experience) more accessible proofs. More or less through singular value decomposition and/or least squares.
The core algorithms all build on top of each other. The `algebra` part of linear algebra refers to a `field`, but it might as well also be called arithmetic of tensors.
I agree but at the same time if you look at how laughable many CVPR reviews are, I kinda wish the community had more math and statistics knowledge. But that might also be a different issue...
FWIW, there are a lot of works that do get in deep to the mathematics of ML and I find these absolutely helpful. Anyone that says theory doesn't help practice hasn't read theory or is operating in bad faith. ML uses A LOT of math, but you just don't need it to create good and/or working models. I think the distinction is important.
It helps to differentiate between the foundational research and the application. Are you expecting to advance SOTA and get published? Then yes, you need a PhD. Are you building stuff in the enterprise world? Then no, it is like any other kind of engineering. Not everyone has to be James Clark Maxwell.
I was a software engineer for ten years before going back to school for a statistics degree. In my experience, I thought very deterministically, and that really got in the way of interpreting the mathematical concepts that are, by definition, stochastic. Engineers think in IF statements. ML thinks in probabilities. This is a nontrivial mental barrier to overcome.
It doesn’t help that a lot of engineers want to find shortcuts that involve not learning the math. That’s just more engineering thinking. Not all disciplines throw exceptions when the output is bad. Maybe there will be tools that negate this need some day. I have yet to see them.
There's a big gap between training an algorithm on a toy problem, vs building a useful product.
Software engineers often are missing key skills. They can learn them, but won't automatically get them in their traditional training.
First, measuring success. Actually telling how well a production system is doing is tricky. There's an art to developing metrics that tell you if an ML system is delivering value, and a lot for engineers don't have the metric design skills. Often to productionize an ML system, you need a bunch of proxy metrics and a pretty good backtesting setup. This will often depend on the specific problem, and the skill of it is something you won't get in a standard software setting.
Engineers - and especially designers - also struggle with edge cases when things go off the happy path. It's often easy to make an ML prototype that works in 90% of cases, and get a project started - but a nightmare to solve enough the edge cases for a production grade system. Finding and papering over and designing around all those edge cases effectively can require a deep bag of tricks a pure software engineer won't have.
Finally there's a struggle with tactics and culture.
A lot of the bread and butter tactics of high performing software delivery are the opposite of what you need for ML projects. E.g. In high velocity frontend work you want to lock a design early, and your designer can probably do a lot of iteration before engineering starts. In ML projects you want to keep the design floating and low fidelity, as you prototype, and lock it late in the project.
So many development tactics, and cultural patterns, that lead to high performing software teams, in a SaaS setting, say, are anathema to ML projects.
> Engineers - and especially designers - also struggle with edge cases when things go off the happy path. It's often easy to make an ML prototype that works in 90% of cases, and get a project started - but a nightmare to solve enough the edge cases for a production grade system. Finding and papering over and designing around all those edge cases effectively can require a deep bag of tricks a pure software engineer won't have.
YMMV. Finding and papering over the things that prevent a model from being deployable can also require a deep bag of engineering tricks that an average ML research scientist does not have. In my personal experience I've seen teams burned by this more often than the other way around.
In my opinion, current SOTA machine learning / deep learning is actually quite simple on a conceptual level. January last year, I decided to give it one year to understand transformers on a "I could totally code that" level. There were so many good video tutorials and texts that after 2 months there wasn't so much left to learn. In particular, the math did not feel much more difficult than what I learned in high school (and much easier than what we did in university). I think this is a particularity of those DL approaches that don't need much more than some (higher dimensional) linear algebra and calculus. There are other AI or ML approaches that I cannot even grasp on a superficial level, but fortunately (from a learner's perspective) DL has completely eaten machine learning, for the time being.
Nevertheless, that entire field depends so much on complex heuristics and subtle optimizations that I completely understand that learning and getting an intuition for those details takes a long time (and is more akin to black magic). The development experience is absolutely horrible. Debugging a model takes such a long time, is rather expensive, and observability is absolutely dismal (at least it was a year ago). It really felt like debugging a program into existence by staring at graphs and retrying infinitely many times.
ML is a broad topic, and it keeps getting wider, and deeper. Even ML specialists don't try to keep up with it all. Be comfortable with not knowing everything.
Machine learning engineers are software engineers, and they exist, so the title is wrong. I suppose it is in Nyckel's interest to claim otherwise.
Transformers are in many state-of-art models but they don't solve all machine learning problems. Even within the world of transformers, there are many variations depending on the application: generating embeddings, translation, next-token prediction, recommendations.
I am mathematician by training. The algorithms relied mostly on undergrad level mathematics when I took some courses six years ago, not easy, but I think there are harder algorithms.
I think it is qualitatively different from programming, because you try to find a reasonable good fit for data you know to guess new future data. Classical programming relies on rules and decisions. ML is closer to numerics, statistics, simulations. Guess the function from the data vs define a function and programm it.
You can easily get into territory that is harder than undergrad. Like it’s completely possible you’d need something like rejection sampling, functional time series, Jeffrey’s priors, etc
The bottleneck is that the Python libraries have a poor user experience.
“Get familiar with one or more ML libraries like PyTorch, Tensorflow, FastAI, or scikit-learn. This is harder than getting familiar with a normal programming library because the concepts and paradigms are very different from what programmers are used to.”
Is this just a matter of there not being nice packaged up products yet?
I’d like to see this sort of logic applied to, say, doing an Autocad simulation. I think lots of people use that kind of stuff to do finite element analysis without being experts in the underlying math packages…
There was a talk at PyCon Sweden by a Huggingface dev who showcased some of the tools they have for making ML (specifically transformers) making feel like "standard" software dev: https://www.youtube.com/watch?v=fckyXntHy1s
In particular treating datasets as "repos" analogous to how code can be versioned seems like a really good way forward.
Understandably but at the same time, an electrical engineer who specializes in power systems isn’t going to know the in and outs of microprocessor design and there’s nothing wrong with that.
Yeah, but ML is extremely math heavy, and the fact that it is done in software doesn’t mean that any software engineer should be able to pick it up readily. I think it is much easier to train up a person with deep understanding of math the basics of python and have them do ML, then someone with a understanding software graduate level math and then have them do ML
I can relate to the article. Knowing JavaScript, I've been fiddling about with Tensorflow.js, learning DL concepts through the lens of the library and working my way backwards. For anyone interested, here are a few books in this vein:
Learning TensorFlow.js: Powerful Machine Learning in JavaScript
Deep Learning with JavaScript: Neural networks in TensorFlow.js
When I first learned programming, I was surprised how you hardly need any mathematics for anything. There are some exceptions, like video game engines. But those exceptions were not what 99% of software engineers were doing.
But now machine learning is another such exception where non-trivial mathematics is important.
What are some good online courses to break into the field for a competent, generalist software engineer? Ideally I want to end up focusing on the platform / MLOps space.
(For someone say, who has a CS degree, took a Linear Algebra class a decade ago and doesn't remember much.)
Andrew Ng's course for theoretical underpinnings, and Jeremy Howard's fast.ai course for the practical. Or just the latter course, if you want to get right to it.
It's not that people can't do it, but that it consumes a huge quantity of time. You're learning a bunch of new tools, and building infrastructure to feed it your data.
And often the result is failure, or something close to it, as the output isn't very good.
My entire post-college career has been successful because ML is too hard for software engineers (and scientists). Bear with me here. Long ago, during the AI winter (late 80s to 2000s) as a high school student I read about neural nets, and being interest in both biology and CS, thought that was an exciting system to learn about.
When I got to college, nobody talked about neural networks. Machine learning as a whole was considered a scurrilous science, wasting people and computer time. "There's not enough data. And the algorithms we have don't work! And even if we solved those, computers are too slow".
Fortunately, I managed to fail to get a job in the bio dept and as a consolation, was pointed at a nascent computational biology group in the CS department. I met David Haussler, then one of the few people in CS doing ML. Absolutely genius, he showed me a few papers and I tried to read them/understand them. The math was all over my head. It involved finding analytic derivatives of complicated functions. Fortunately, I was paired up with a grad student and given a reasonable project, where I downloaded all the gene sequence data for E.Coli (which wasn't even finished at the time) and managed to build a simple model of E.Coli genes and write an undergrad thesis that I only partly understood. To me, the magical part was watching gradient descent take those derivatives and update the weights.
When getting ready for my next phase of life, I was terrified that I wouldn't get a job as a programmer in Silicon Valley, because those folks all had CS, not bio degrees, and they knew how hash tables worked, and other complicated stuff that I couldn't wrap my head around. I decided, there was no chance I could afford to live in the valley in 1995 so I applied to grad school and got in; goal was to understand gradient descent.
My PhD was the most exhilirating and exhausting time of my life. Suddenly I was surrounded by people who could solve hard physics questions, understood quantum mechanics, and could come up with interesting experiments that got published in top journals. I felt like an imposter the entire time. But I fell in with a good group that encouraged me to explore things at my own pace, and I spent the next 7 years learning a ton of things, the capstone of which was understanding/converting a molecular dynamics loss function (including those painfully learned analytic derivatives) from FORTRAN to C++, and writing a gradient descent routine straight out of Numerical Recipes. I was thrilled that I could understand the magic of gradient descent but also depressed because that method doesn't solve the hard problems of biology (such as predicting protein structures de novo), and people were still saying that ML didn't work, there wasn't enough data, the algorithms sucked, and computers weren't fast enough.
That didn't sound right to me, because I knew that genetic data was exploding, and computers (especially cheap linux clusters) were changing access to computation quickly. The algorithms (circa 2001) were still mostly garbage, especially in biology. Neural networks to predict protein secondary structure had hit a wall at about 80% accuracy and nobody was doing ML to predict protein tertiary structure.
So I went back to school for a few more years because I still didn't know how to get a job in Silicon Valley. I did a 3 year postdoc with little to no machine learning, just domain-specific biology stuff that I didn't find interesting, and finally managed to get a job as a computer scientist at a national lab. It was a good pivot- I was a principle investigator, meaning I could apply for my own grants, write papers, etc, but didn't have to teach classes. I was MISERABLE! I loved the engineering, but the papers/grants/conference parts were just terrible.
But I still didn't "get" machine learning and wanted to work somewhere that did ML. I tried to get a job as a SWE at google- went through the ringer of all the hard questions, and ultimately got turned down at the last step (thanks, Larry Page) and went to work for a biotech for a year before I finally managed to get hired at Google during the "post-IPO, Google-classic" era, around 2007. My pay started rising faster than the average for the Bay Area which was a nice detail.
When I got to Google I quickly looked through all the projects doing ML and found that other than ads, there really wasn't a lot. There was rephil, and SETI, and SmartASS, none of which seemed even remotely like the ML I was interested in (deep neural networks). So I went and focused on other stuff- learning the distributed technology beneath Borg and Colossus, and mastering the google3 stack and production environments, mainly from an SRE perspective. But my job wasn't very demanding and I spent all my time writing proposals for Google to get involved in biology, because Google had distributed tech that was perfect for doing biology research.
Eventually, some senior engineer found my proposals and introduced me to the right people and I spent the next few years writing and running a large-scale distributed idle-cycle harvester that ran protein folding, protein design, drug discoveyr, and telescope design codes at large scale, while also learning large-scale data processing, because in speeding up the simulations, we were inundated with data to process. I got a few great publications out of this, and ended up being part of the cool kids club (coffee with Jeff Dean and Sanjay Ghemawat, etc) and parlayed this into a job building a new biology-specific platform vertical in Google Cloud so Google could make money off of, and improve the process of, biology research.
At some point I managed to tick off some senior person so I coudlnt' work on Research at Google, but finally- for the first time in my career- managed to land a job working full-time on machine learning- a system at google that almost nobody knows of called Sibyl. Sibyl was an innovative system that used an obscure ML concept- boosting- combining it with mapreduce- to run large-scale ML experiments that were directly part of the serving loop for Youtube, Google Play Ads, and other rapidly growing parts of the company. The profits from sibyl were enough to pay for all of google's research for several years and helped google grow tremendously. All that time I'd spent on machine learning and computer infrastructure... went to writing systems that loaded 80GB hash tables into memory just so a mapper could compute a tiny part of some gradient for some variable.
Unfortunately sibyl was actually a terrible system and I got kicked off the team for telling the leader the right way to do DL was deep neural networks on high performance computing hardware, not mapreduce on cheap linux cluster machines. I hid in a side team for years, playing around with 3d printers and other stuff, not really moving my career forward, but enjoying my job for the first time! At the same time I watched Jeff Dean finally realize that machine learning was an HPC problem (I think vincent vanhoucke managed to speed up voice recognition with 8 GPUs stuffed into a desktop) and he created TensorFlow, which still stumbled around for years before it realized it was an HPC system (see the slow transition to making more and more of the training process be parallel).
Finally, neural networks were vindicated! They solved a wide range of problems and my skills were applicable. We had the data, the algorithms, and the compute, all at once. And even better, you didn't need to be inside google to take advantage of it (except the big data, and that was changing quickly). I understand enough of the math, and the infra to finally be an ML Engineer.
But around that time I also came to a conclusion: most people working in ML are miserable. They are under intense pressure to get results a few percent better than their collaborators, and then once published, pivot to the next-next thing. Thats when I came up with one of my laws: "The very best ML models are distilled from postdoc tears". I saw a few people break down and leave the industry for good just from working on super-stressful projects where they did great work, but only reached parity with a competitor. and so I concluded: I was going to be ML-adjacent. This has been a succesful pivot for me.
What is the moral of this long story? Imposter syndrome drove me to overcome my imposter syndrome, and in doing so, along the way, I learned what I was chasing was not actually what made me happy. I'm far more satisfied puttering about using 5-year-old ML tech like object detectors to improve my microscope's ability to track tardigrades, than I am trying to become a famous researcher who unblocked the hard problems of biology. I guess that's part of the aging process and the stability that comes from having a salary so I don't have to worry if I can make rent.
ML has its own guild-like quality. There is some subgroup of ML people who will always try to move the goalposts, making the math harder, most esoteric, and less practical, while often publishing garbage until you peek under the covers and you realize they just got lucky, and scared away all the competitors with their Big Math. I wish people would stop doing this and instead focus on building relatively simple systems and not trying to chase 1% improvement in performance by making the system 3X more complicated.
No shit. The majority of students I met in the Computer Science weren't skilled in math but took it as a requirement to code. Why would you expect them to just latch onto Machine Learning?
ML is one of the easiest fields out there. When I learned it I was actually turned off by how simplistic the concept was. Of course let me preface to say that it's hard to develop the intuition and skill in the same way learning to skateboard is hard. But conceptually it's easy and very possible for almost anyone.
The whole thing is just curve fitting. Literally finding some best fit curve across a series of points. This is very very easy for any software engineer to understand. I literally lost interest when I found out that the entire field was just all about messing with the data and the curve to try to get things to fit.
Literally it's just about eyeballing the data and qualitatively picking and training the thing that looks like it's the best fit. But because the data is N-dimensional and in the millions it's impossible to "eye-ball" it with your physical eyes, you have to come up with other techniques equivalent to "eye-balling" it.
Douglas Hofstadter had this whole theory of consciousness and when he found out that an LLM was a simple feed forward network with no feedback loops he went into a crisis. Basically his whole theory in GEB was wrong, according to him.
This stuff is NOT quantum physics. It's startling how simple it is and that's one of the big mysteries about it.
We only understand and build these things at a high level. At the very low level we don't actually understand what's going on. As I stated earlier we understand ML the same way a person understand data from an "eye-ball" perspective so it's impossible to even justify what exactly specifically went on with chatGPT when he answered a specific question correctly.
Moravec's paradox is the observation in artificial intelligence and robotics that, contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. The principle was articulated by Hans Moravec, Rodney Brooks, Marvin Minsky and others in the 1980s. Moravec wrote in 1988, "it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility".
> Moravec wrote in 1988, "it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility".
The resolution to the paradox is so simple I must be missing something. The amount of data in datasets for 'mobility' is basically zero. You would have to manually construct such a dataset. Whereas, humans have for thousands of years been trained to symbolically encode their reasoning processes in a way that has been incredibly accessible to computers (prose).
If I understand correctly, the scaling laws for mobility are the same for language and reasoning. We need more data.
The training data for chess is very easy in comparison to walking. If you had the same amounts of data for both and the ability to get it, understand it and use it you wouldn't have a problem.
Basically it's hard to make a machine use and understand how to use it's physical form and in 1988 it was even harder. For chess it was easy. It's easy to get, understand and use chess data.
I probably shouldn’t take the bait here, but this reads like someone took intro to ML and thinks that’s all there is to know. “Just” fitting a curve couldn’t be more reductionist and discount the work of a ton of incredibly intelligent people.
I tend to agree that I don’t really find ML work all that interesting (much more interested in making it go fast :)), but simple it is not.
Put it this way, it's extremely challenging and not simple at all to walk and balance on a wire. Tight rope walking is not simple at all because very few people can do it.
But tightrope walking is different from something like Quantum physics. I may not be able to tightrope walk but I can understand the concept in it's entirety. For Quantum Physics, many people will never truly understand it.
What I'm saying is this, ML is tightrope walking. Challenging, not simple, but NOT quantum physics. It only seems like quantum physics.
What are ML jobs about? I have this vague notion that you spend a lot of time gathering/cleaning data and throwing things at the wall, but maybe that's not accurate.
I've always been stronger at discrete type math/programming, which is why I tend to shy away from statistics-based stuff like ML.
One thing to note is that LLMs are indeed feed forward, however the generation of the text (from my understanding) is recursive in that you feed each output token to another forward pass of the neural network.
> I've always been stronger at discrete type math/programming, which is why I tend to shy away from statistics-based stuff like ML.
I think there's a major misconception that ML in the form of deep learning is about statistics. There's no statistics in deep learning models. There are some statistical measurements made of final models, much in the same way a good computer science paper covering implementations of discrete data structures might make statistical statements showing the performance of the author's implementation, but like transformers and traditional neural nets and backprop have nothing to do with statistics.
Curve fitting a set of data is essentially the same thing as understanding something at a high level. We have data points but no way to extract the low level exact equation that generated that data...
So we create a curve and estimate it. We will never know the true equation. Additionally the curve has hundreds of dimensions and is essentially something that can't be visualized or understood cohesively. We have this neural network that represents the curve but the neural network is a black box.
> Douglas Hofstadter had this whole theory of consciousness and when he found out that an LLM was a simple feed forward network with no feedback loops he went into a crisis. Basically his whole theory in GEB was wrong, according to him.
LLMs are not conscious. The training process for an LLM is not a feed forward network. If we were going to try to fit the idea of consciousness a la humanity (which is really the only fully 'conscious' creature we know of) into LLMs, then 'running' an LLM is identical to cloning a frozen human, thawing it, firing some neurons, reading the result and then destroying the clone.
A better argument for actual consciousness would come from the training process, but that itself is also dubious. It's unlikely consciousness is an emergent phenomenon. Or rather, such a claim is extraordinary and would require an extraordinary amount of proof, which GEB does not provide, sorry.
LLM's are all about "let's think this through step by step". That's literally a loop in the programming/software engineering sense, it's just expressed via natural language.
For now, ML research and development is too complicated and frustrating for me to dedicate the time and energy to become skilled in it.