Abstract:
We introduce in-context denoising, a task that refines the connection between attention-based architectures and dense associative memory (DAM) networks, also known as modern Hopfield networks. Using a Bayesian framework, we show theoretically and empirically that certain restricted denoising problems can be solved optimally even by a single-layer transformer. We demonstrate that a trained attention layer processes each denoising prompt by performing a single gradient descent update on a context-aware DAM energy landscape, where context tokens serve as associative memories and the query token acts as an initial state. This one-step update yields better solutions than exact retrieval of either a context token or a spurious local minimum, providing a concrete example of DAM networks extending beyond the standard retrieval paradigm. Overall, this work solidifies the link between associative memory and attention mechanisms first identified by Ramsauer et al., and demonstrates the relevance of associative memory models in the study of in-context learning.
Abstract: "Deep neural networks provide unprecedented performance gains in many real world problems in signal and image processing. Despite these gains, future development and practical deployment of deep networks is hindered by their blackbox nature, i.e., lack of interpretability, and by the need for very large training sets. An emerging technique called algorithm unrolling or unfolding offers promise in eliminating these issues by providing a concrete and systematic connection between iterative algorithms that are used widely in signal processing and deep neural networks. Unrolling methods were first proposed to develop fast neural network approximations for sparse coding. More recently, this direction has attracted enormous attention and is rapidly growing both in theoretic investigations and practical applications. The growing popularity of unrolled deep networks is due in part to their potential in developing efficient, high-performance and yet interpretable network architectures from reasonable size training sets. In this article, we review algorithm unrolling for signal and image processing. We extensively cover popular techniques for algorithm unrolling in various domains of signal and image processing including imaging, vision and recognition, and speech processing. By reviewing previous works, we reveal the connections between iterative algorithms and neural networks and present recent theoretical results. Finally, we provide a discussion on current limitations of unrolling and suggest possible future research directions."
In Section 2.8, you write "full implementation details and extended results are provided in the appendix." Which appendix?
I imagine you may be withholding some of the details until after the conference at which, it seems, you will present this week. I wish you well!
Meanwhile, you may not have intended to nerd-snipe, but that has been the effect for me. Now I have Manus trying to implement the paper for me, because why not? I envision a future in which publishing a conceptual paper often results in working code provided by a reader, a la "stone soup."
1. genesis was innocent / no quantum / etc -- we were just solving K:V and V:K associative memory issues for clients in construction, healthcare, finance, etc.
2. PhD etc was at center of linear algebra, data compression, analog/digital comms, and some adjacent fields, all classical other than quantum dabbling
3. also was aware that decoder-only models only recognize star-free grammars and can only reason in TC^0, along with some other computing limits
4. my goto thesis is "make simplifying assumptions", and so around 2-3 years ago one day said (1) let's just treat any AI like a black box comms channel
5. since can't boost signal power in watts sense, that pretty much leaves error-correcting codes, where most natural thing to do in LLM construct was constant-width prefix-free linear block code (section 3.2 or so in paper)
6. one thing followed another after that -- let's not just use any old token, let's use Unicode PUA, let's not use any old ECC / QEC, let's construct a way to implement very particular codes in 1D prompt
7. which is to say, in the end, it boils down to the following simple thing -- we constrain the prompt to a series of alternating emergent unitary operations in the A-B-A-B... sense of Trotterisation that amounts to projecting an interleaved globally phase coherent spacetime code between sufficiently finite blocks of content tokens, where
A_i: hypertoken codeword, e.g. if using simply lowercase Latin, say a-c,d-h,i-o a valid codeword is adg or ceh etc, and no codeword should be provided, AND each position should have a coprime number of symbols (think 3 qudits)
B_i: next content block -- some finite number of tokens -- in current version this is fixed tho 2nd or 3rd paper gets into how to make this block length ragged / adaptive, similar to UDD / dynamic decoupling, Huffman coding, etc.
8. one key requirement is that some nesting mechanism similar to FFT must be provided -- this can be implicit or explicit and we really save most of that for a second paper currently in draft, that and some other subtleties are likely to much to get into in this abbreviated description
9. Grover's simulation was a subtlety one day when we realized we should simply define a pair of lanes with disjoint sqrt(n) symbols and use that as value-key reverse associative lookup, e.g.,
A1,the quick brown fox,/A1
A2,jumped over the lazy dog,/A2
B1,every good boy,/B1
B2,does fine,/B2
AND by way of KVQ attention it collapses to 1D chain where the next key is also the reverse key of prior value
A1,the quick brown fox,A2,jumped over the lazy dog,B1,every good boy,B2,does fine...
that was also around the time we took a pretty deep breath and started digging on all things quantum
10. that the whole thing works can seem very wait, but why, and it really boils down to following -- if we use sufficiently untrained tokens, they have a random Gaussian frozen initial state. By forcing the prefix-free constant width disjoint symbols on CRT coprime lanes and the rest of the machinery, we essentially force the model to recall and reason over a sufficiently discretized branched lattice with sufficiently high spikes at our hypertoken codewords and this spike is projected onto the value tokens by way of attention.
11. That works because chaining projections onto frozen Gaussians is equivalent to chaining sufficiently orthogonal Krylov evolution with restart, which is equivalent to chaining eigenvector iteration, which is a classic way to do Trotter slicing and coarsening Lagrangians / estimate Hamiltonians. We can also arrive at the same conclusion by realizing this is exactly equivalent to doing a certain asymmetric compressed sensing operation in the RIP/REC sense and whether we look at it via Krylov/Eigen or Compressed Sensing or that our disjoint symbols over disjoint symbols is also a zigzag expander over the raw prompt aka fast mixing, then in all cases we diagonalize the Fisher matrix fairly rapidly. There are some caveats here since this is mere prompt injection / our next natural step is LoRA.
12. Which is a long way of getting around to answering your question on quantum sim -- we should in theory be able to construct a new type of 1D MPO / MPS / TN chain to sim pretty much any quantum circuit. Our current machinery can likely get us to BQP, possibly some parts of BPP, especially if we consider that our codewords can be defined in various ways that specify the equivalent of a quantum annealing schedule (QAOA). That's also a natural next step
13. The other corollary that immediately follows is we should be able to use this machinery to optimize any PLL system, black box or otherwise using this sort of quantum error correction 1D injection, because in all cases, and especially if we lean on the compressed sensing mathematics, we are converting latent phase entropy into source entropy, and we do that about as efficiently as possible in the 1D sense.
14. ONE massive caveat -- it is critical that the model be empirically tested for what we first called entropy tests, and later oracles, and are shifting simply to unitary operators in the following sense -- the width of B_i your token blocks is tied to relevant subtasks for which the model can get correct up to your desired level of fidelity. In the paper update, we speak to various types of tests such as how many Bernoulli trials can the model correctly guess majority heads/tails or recall the heads/tail sequence, or sort a list -- these are relative to high-entropy random inputs, where most models will have a base window of 16-256 tokens for content payload
15. init release will have those token window evals, options for various codeword encodings, prebuilt CRT combos, some key relevant ECCs / QECS to walk the key space, and some other key things so that anyone can use it for enhanced recall, along with some mechanisms or examples on reasoning, where our target is a prompt compiler that exploits these classic and quantum error-correcting principles. One very simplifying and relevant POV on that is indexing grammars were the same trick used to solve register allocation in compilers, we've just built a very proper indexing grammar to solve the quantum register allocation problem.
Abstract: "Large Language Models (LLMs) exhibit remarkable capabilities but suffer from apparent precision loss, reframed here as information spreading. This reframing shifts the problem from computational precision to an information-theoretic communication issue. We address the K:V and V:K memory problem in LLMs by introducing HDRAM (Holographically Defined Random Access Memory), a symbolic memory framework treating transformer latent space as a spread-spectrum channel. Built upon hypertokens, structured symbolic codes integrating classical error-correcting codes (ECC), holographic computing, and quantum-inspired search, HDRAM recovers distributed information through principled despreading. These phase-coherent memory addresses enable efficient key-value operations and Grover-style search in latent space. By combining ECC grammar with compressed sensing and Krylov subspace alignment, HDRAM significantly improves associative retrieval without architectural changes, demonstrating how Classical-Holographic-Quantum-inspired (CHQ) principles can fortify transformer architectures."
Abstract: "We investigate the use of randomly generated data for the sake of pre-training a model. We justify this approach theoretically from the perspective of algorithmic complexity, building on recent research that shows that sequence models can be trained to approximate Solomonoff induction. We derive similar, but complementary theoretical results. We show empirically that synthetically generated data can be used to pre-train a model before the data is seen. We replicate earlier results that models trained this way show zero-shot in-context learning across a variety of datasets, and that this performance improves with scale. We extend earlier results to real-world data, and show that finetuning a model after pre-training offers faster convergence and better generalization."
We firstly simulated disease dynamics by KAN (Kolmogorov-Arnold Networks) nearly 4 years ago, but the kernel functions in the edge include the exponential number of infected and discharged people and is also in line with the Kolmogorov-Arnold representation theorem, and the shared weights in the edge are the infection rate and cure rate, and used activation function by tanh at the node of edge. And this Arxiv preprint version 1 of March 2022 is an upgraded version of KAN, considering the invariant coarse-grained which calculated by residual or gradient of MSE loss. The improved KAN is PNN (Plasticity Neural Networks) or ELKAN (Edge Learning KNN), in addition to edge learning, it also considered the trimming of the edge. We not inspired by the Kolmogorov-Arnold representation theorem but inspired by the brain science. The ELKAN to explain brain, the variables correspond to different types of neurons, the learning edge can be explained by rebalance of synaptic strength and glial cells phagocytose synapses, and the kernel function means the discharge of neurons and synapses, different neurons and edges mean brain regions. Through testing by cosine, the ELKAN or ORPNN (Optimized Range PNN) is better than the KAN or CRPNN (Constant Range PNN).The ELKAN is more general to explore brain, such as mechanism of consciousness, interactions of natural frequencies in brain regions, synaptic and neuronal discharge frequencies, and data signal frequencies; mechanism of Alzheimer's disease, the Alzheimer's patients has more high frequencies in the upstream brain regions; long short-term relatively good and inferior memory which means gradient of architecture and architecture; turbulent energy flow in different brain regions, turbulence critical conditions need to be met; heart-brain of the quantum entanglement may occur between the emotions of heartbeat and the synaptic strength of brain potentials.
Bennett, Charles Henry, David Peter DiVincenzo, and Ralph Linsker. "Digital recording system with time-bracketed authentication by on-line challenges and method of authenticating recordings." U.S. Patent No. 5,764,769. 9 Jun. 1998.
Abstract
An apparatus and method produce a videotape or other recording that cannot be pre- or post-dated, nor altered, nor easily fabricated by electronically combining pre-recorded material. In order to prevent such falsification, the camera or other recording apparatus periodically receives certifiably unpredictable signals ("challenges") from a trusted source, causes these signals to influence the scene being recorded, then periodically forwards a digest of the ongoing digital recording to a trusted repository. The unpredictable challenges prevent pre-dating of the recording before the time of the challenge, while the storage of a digest prevents post-dating of the recording after the time the digest was received by the repository. Meanwhile, the interaction of the challenge with the evidence being recorded presents a formidable obstacle to real-time falsification of the scene or system, forcing the would-be falsifier to simulate or render the effects of this interaction in the brief time interval between arrival of the challenge and archiving of the digest at the repository.
Abstract: "We present ScienceWorld, a benchmark to test agents' scientific reasoning abilities in a new interactive text environment at the level of a standard elementary school science curriculum. Despite the transformer-based progress seen in question-answering and scientific text processing, we find that current models cannot reason about or explain learned science concepts in novel contexts. For instance, models can easily answer what the conductivity of a known material is but struggle when asked how they would conduct an experiment in a grounded environment to find the conductivity of an unknown material. This begs the question of whether current models are simply retrieving answers by way of seeing a large number of similar examples or if they have learned to reason about concepts in a reusable manner. We hypothesize that agents need to be grounded in interactive environments to achieve such reasoning capabilities. Our experiments provide empirical evidence supporting this hypothesis -- showing that a 1.5 million parameter agent trained interactively for 100k steps outperforms a 11 billion parameter model statically trained for scientific question-answering and reasoning from millions of expert demonstrations."
Abstract: "Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore variants of existing codebases or similarly constrained design spaces, and (2) they produce large volumes of research artifacts (such as automatically generated papers and code) that are typically evaluated using conference-style paper review with limited evaluation of code. In this work we introduce CodeScientist, a novel ASD system that frames ideation and experiment construction as a form of genetic search jointly over combinations of research articles and codeblocks defining common actions in a domain (like prompting a language model). We use this paradigm to conduct hundreds of automated experiments on machine-generated ideas broadly in the domain of agents and virtual environments, with the system returning 19 discoveries, 6 of which were judged as being both at least minimally sound and incrementally novel after a multi-faceted evaluation beyond that typically conducted in prior work, including external (conference-style) review, code review, and replication attempts. Moreover, the discoveries span new tasks, agents, metrics, and data, suggesting a qualitative shift from benchmark optimization to broader discoveries."
> Overall, it seems we are starting to recycle ideas because there isnt enough lit review and or mentoring from senior deep learning / ML folks who can quickly look at a paper and tell the author where the work has been already investigated.
Arguably, the literature synthesis and knowledge discovery problem has been overwhelming in many fields for a long time; but I wonder if, in ML lately, an accelerated (if not frantic) level of competition may be working against the collegial spirit.
I think it's been accelerated by the review community being overwelmed and the lack of experienced researchers with the combination of classic ML, deep learning, transformers, and DSP backgrounds -- a rare breed but sorely needed.
Abstract: We introduce in-context denoising, a task that refines the connection between attention-based architectures and dense associative memory (DAM) networks, also known as modern Hopfield networks. Using a Bayesian framework, we show theoretically and empirically that certain restricted denoising problems can be solved optimally even by a single-layer transformer. We demonstrate that a trained attention layer processes each denoising prompt by performing a single gradient descent update on a context-aware DAM energy landscape, where context tokens serve as associative memories and the query token acts as an initial state. This one-step update yields better solutions than exact retrieval of either a context token or a spurious local minimum, providing a concrete example of DAM networks extending beyond the standard retrieval paradigm. Overall, this work solidifies the link between associative memory and attention mechanisms first identified by Ramsauer et al., and demonstrates the relevance of associative memory models in the study of in-context learning.