Something that always bothers me about these explanations is they usually forget about numerical errors. You can't just abstract away multiplying coefficients as "constant time". You may as well abstract away the entire multiplication to begin with! If you take into account numerical precision, it's closer to O(n (log n)^3) [1].
This is why I keep coming back to HN. You read an interesting article, a little proud you understand half of it, read a question that already makes you feel like the stupidest person in the room, then read a clarifying answer by someone who probably got a Knuth reward check for correcting an errata in the art of computer programming.
For FFT with floating-point numbers, another paper from Arnold Schönhage in 1982 [1] already gives the bound in Psi(n l) operations, where n is the number of coefficients, and l is the desired precision (typically 53 for double precision). Psi(m) is the time to multiply two integers with m digits, which is known since 2021 to be O(m log m) [2]. So the current bound is O(nl log(nl)).
It will be great if we can minimize the multiplication errors and perhaps do away with the errors altogether by utilizing quaternion based operations describes in the OP article [1],[2],[3].
[1] One-Dimensional Quaternion Discrete Fourier Transform and an Approach to Its Fast Computation:
But if the coefficients are integers, you can use NTT with a big enough modulus and get exact results and a boost (esp. in hardware) in multiplication time.
Is there a way to find groups with easy generators/primitive roots? I imagine you'd want a small root of unity, but also be able to choose a bigger modulus for extra big multiplications. Also, afaik it's discrete-logarithm level of difficulty to even find a generator if you choose a random modulus, though I don't know if it's easier to find a modulus after you choose the generator.
since the groups you're looking at are is size log(n), you can do a lot of work without issue. as long as you do experimental it less work, it doesn't affect the runtime.
> though I don't know if it's easier to find a modulus after you choose the generator.
Sure, pick a large prime. Double it and add 1 (and call it n). If it's still prime, then you know the prime factorization of n-1. Pick your generator, and check if it raised to the p is 1, or if squaring it is one. If not, it's a generator of the multiplicative group mod n.
You can use this method to multiply long numbers together. The key insight you need is that polynomial multiplication is the same as normal long number multiplication without doing carrying.
So suppose you have a 1000 digit number. You then take each digit and use it as the coefficient of a 1000 element polynomial. You can then multiply these polynomials together using the fft method as described in the article. To convert the result back to a number you need to do the carries. So if an element is bigger than 10 you carry the excess to the next digit. You then take the coefficients and turn them into numbers.
That's the basic idea. There is some subtlety I've glossed over due to precision needed for the carries and to be sure that rounding the fft result to the nearest integer is correct. That is the way big number multiplication is done in GMP which is the leading Library for this sort of thing.
Thanks for the comment. That makes sense, since as you are saying, a base10 number can be expressed as a polynomial where x=10. Eg: 983 = 9x^2 + 8x + 3 aka [9, 8, 3]. Wondering how big the number has to be to make sense, and where this is used in practice.
FFT is used extensively in curve based zero-knowledge cryptography, not for the purpose of splitting a single large number into smaller ones, but to interpolate and evaluate very large polynomials (for example of the degree 2^27).
All of this happens in a field of an elliptic curve so the complexity reduction is greatly appreciated.
GMP has lots of other methods in between schoolbook multiplication and FFT multiplication. A nice one is Karatsuba multiplication which is very easy to understand and delivers O(n^1.58) rather than O(n^2) performance. Python uses this method for multiplying large numbers together
PS: if you're interested in multiplying "ludicrously large numbers", Harvey and van der Hoeven had a nice breakthrough and got multiplication down to "FFT speed" (n*log(n)), see
Who was the first person to propose FFTs for faster polynomial multiplication?
Got curious about this recently. I’m not great at citation tracing, but did make it back to this 1995 paper by David Eppstein [0] where he uses it to efficiently solve Subset Sum after an incremental update. Surely Knuth’s TAOCP had it even earlier?
The fact that FFT polynomial multiplication also lets you solve Exact Subset Sum with Repetition in sub-exponential time came as a real shock to me. [1] Crucially, this algo is O(N log N) where N = the maximum element, not N = the set size, so it isn’t a P ≠ NP counterexample or anything.
Pollard [1], Nicholson [2], and Schonhage-Strassen [3] seem to have come up with it independently around the same time, using different approaches. Strassen is said to have discovered the Pollard approach in 1968 but there is no (written) record of it.
It should also be noted that, while it was not exactly the birth of the FFT, Cooley-Tukey's 1965 paper [4] on it was what kickstarted research on FFT and its applications. This was just a few years after that.
Interesting. I've just implemented an algorithm (matrix profile) that makes use of FFT to compute a big set of dot products of time series subsequences where the length n of the time series can be in 100s of millions. The fast convolution computation using FFT reduces the computation time from O(n) to O(log n) with awesome speed gains at this scale. Throw in a GPU and the speed goes up even faster, like processing 10 million data point in 0.1 second on a laptop.
The key ”trick” of the operation seems to be this revelation:
> In other words, performing the convolution of two signals in time domain is equivalent to multiplying them in the frequency domain.
Great article, as it breaks down a complex idea into much smaller steps that even my math challenged mind can somehow grasp, but did I miss a step? Or is it left as an exercise to the reader to look up? I was already stretching my math ability to that point, but it felt a little bit like - “and then, draw the rest of the F-ing owl” to me. Is it just me?
Thanks! In case it helps:
* Multiplying two polynomials as taught in school is in reality a convolution.
* "performing the convolution of two signals in the time domain is equivalent to multiplying them in the frequency domain"
* FFT allows us to convert from time domain to frequency domain
* We use FFT to convert our polynomial to frequency domain.
* If we are now in the frequency domain, we just need to multiply. Faster than a convolution.
Does it clarify the missing step? Happy to update the post with what's missing.
Yep, it definitely helps! What I'm struggling to understand is why "performing the convolution of two signals in the time domain is equivalent to multiplying them in the frequency domain" (I'm going to google/gpt it, this is more of an exercise left for me to dive into, your post is perfect as it is)
So integer factoring is a discrete deconvolution? I wonder if juxtaposing the FFT representation (inverse pointwise multiplication) and the tableax (regular long multiplication / carry add) could break the symmetry and get enough information for a fast algorithm?
Sure, the naive method of multiplying polynomials is slow in the degree of the polynomial. But when does one have to deal with two degree 100 polynomials?
My impression is that this sort of method isn't used by computer algebra system for this reason.
Computer algebra systems (such as chebfun in Matlab) convert arbitrary functions to 100-degree+ polynomials to make it easier to find roots, optima, etc.
When I wanted to reverse engineer some CRC checksum parameters for larger files, I made a program[1] that converts the files into some million degree GF(2) polynomials and calculates their GCD, which is only possible in reasonable time with FFT-based multiplication.
This convolutional view and fast GPU kernels for FFT were used in some pre-Mamba state space models for long sequence modeling where the polynomial is the input sequence. The Hazy Research blog posts from 2020-2023 have a lot of information on this approach.
Great feedback. I have updated the post using convolve instead. There is a huge difference convolve/naive. On the other hand, convolved is slower than the FFT as expected, but for greater than 3000 degree polynomials or so.
See diff: https://github.com/alrevuelta/myblog/commit/9fcc3dc84b1d9b66...
Another FFT trick applies well here: the middle vector multiplication can be folded into the last stage of the FFT and first stage of the IFFT. This saves writing these two stages out two stages to memory.
[1]: http://numbers.computation.free.fr/Constants/Algorithms/fft....