Convolutions, Fast Fourier Transform and polynomials (2022)

programjames · 2024-06-30T22:09:37 1719785377

Something that always bothers me about these explanations is they usually forget about numerical errors. You can't just abstract away multiplying coefficients as "constant time". You may as well abstract away the entire multiplication to begin with! If you take into account numerical precision, it's closer to O(n (log n)^3) [1].

[1]: http://numbers.computation.free.fr/Constants/Algorithms/fft....

cperciva · 2024-06-30T23:32:15 1719790335

The error bound cited in that article is wildly pessimistic. The latest edition of Knuth has the correct bound (because I gave it to him).

dataflow · 2024-06-30T23:38:18 1719790698

Would you mind just sharing it here...?

cperciva · 2024-07-01T03:01:44 1719802904

Short answer is that FFTs are about as well behaved as anything can possibly be, because they're rotations in C^n.

Explicit bound is in https://www.daemonology.net/papers/fft.pdf

eranation · 2024-07-01T03:43:59 1719805439

This is why I keep coming back to HN. You read an interesting article, a little proud you understand half of it, read a question that already makes you feel like the stupidest person in the room, then read a clarifying answer by someone who probably got a Knuth reward check for correcting an errata in the art of computer programming.

cperciva · 2024-07-01T04:58:40 1719809920

Knuth judged that it wasn't an erratum, since the bound he included was correct and he never claimed it was optimal. :-/

nextaccountic · 2024-07-01T08:04:11 1719821051

Did he decide to include your better bound in future editions?

cperciva · 2024-07-01T08:49:07 1719823747

Yes. I believe proving the strict bound is one of the exercises now.

Randor · 2024-07-01T05:09:33 1719810573

Thanks, not often we see Knuth erratas. :)

cperciva · 2024-07-01T07:01:52 1719817312

<pedant>You never see "erratas", since "errata" is already the plural (of "erratum").</pedant>

tverbeure · 2024-07-01T14:46:35 1719845195

The ensemble of errata of multiple books are erratas… probably.

guyomes · 2024-07-01T12:24:08 1719836648

For FFT with floating-point numbers, another paper from Arnold Schönhage in 1982 [1] already gives the bound in Psi(n l) operations, where n is the number of coefficients, and l is the desired precision (typically 53 for double precision). Psi(m) is the time to multiply two integers with m digits, which is known since 2021 to be O(m log m) [2]. So the current bound is O(nl log(nl)).

[1]: https://doi.org/10.1007/3-540-11607-9_1

[2]: https://www.texmacs.org/joris/nlogn/nlogn-abs.html

teleforce · 2024-07-01T05:12:46 1719810766

It will be great if we can minimize the multiplication errors and perhaps do away with the errors altogether by utilizing quaternion based operations describes in the OP article [1],[2],[3].

[1] One-Dimensional Quaternion Discrete Fourier Transform and an Approach to Its Fast Computation:

https://www.mdpi.com/2079-9292/12/24/4974

[2] Convolution Theorems for Quaternion Fourier Transform: Properties and Applications:

https://onlinelibrary.wiley.com/doi/10.1155/2013/162769

[3] On the Matrix Form of the Quaternion Fourier Transform and Quaternion Convolution:

https://arxiv.org/abs/2307.01836

e-khadem · 2024-06-30T23:02:19 1719788539

But if the coefficients are integers, you can use NTT with a big enough modulus and get exact results and a boost (esp. in hardware) in multiplication time.

xphos · 2024-06-30T23:25:14 1719789914

Had no clue what NTT was but found this as a reference https://codeforces.com/blog/entry/48798#:~:text=NTT%20(Numbe....

programjames · 2024-06-30T23:29:00 1719790140

I prefer this reference: https://cp-algorithms.com/algebra/fft.html#number-theoretic-...

dataflow · 2024-06-30T23:41:17 1719790877

See chapter 26 of the "FXT book". I just shared it here: https://news.ycombinator.com/item?id=40841355

programjames · 2024-06-30T23:28:30 1719790110

Is there a way to find groups with easy generators/primitive roots? I imagine you'd want a small root of unity, but also be able to choose a bigger modulus for extra big multiplications. Also, afaik it's discrete-logarithm level of difficulty to even find a generator if you choose a random modulus, though I don't know if it's easier to find a modulus after you choose the generator.

adgjlsfhk1 · 2024-07-01T01:34:00 1719797640

since the groups you're looking at are is size log(n), you can do a lot of work without issue. as long as you do experimental it less work, it doesn't affect the runtime.

archgoon · 2024-07-01T05:10:16 1719810616

> though I don't know if it's easier to find a modulus after you choose the generator.

Sure, pick a large prime. Double it and add 1 (and call it n). If it's still prime, then you know the prime factorization of n-1. Pick your generator, and check if it raised to the p is 1, or if squaring it is one. If not, it's a generator of the multiplicative group mod n.

uoaei · 2024-07-01T00:24:39 1719793479

Hence the distinction between computer science and software engineering. :)

nickcw · 2024-07-01T07:59:35 1719820775

You can use this method to multiply long numbers together. The key insight you need is that polynomial multiplication is the same as normal long number multiplication without doing carrying.

So suppose you have a 1000 digit number. You then take each digit and use it as the coefficient of a 1000 element polynomial. You can then multiply these polynomials together using the fft method as described in the article. To convert the result back to a number you need to do the carries. So if an element is bigger than 10 you carry the excess to the next digit. You then take the coefficients and turn them into numbers.

That's the basic idea. There is some subtlety I've glossed over due to precision needed for the carries and to be sure that rounding the fft result to the nearest integer is correct. That is the way big number multiplication is done in GMP which is the leading Library for this sort of thing.

clearprop · 2024-07-01T08:05:40 1719821140

Thanks for the comment. That makes sense, since as you are saying, a base10 number can be expressed as a polynomial where x=10. Eg: 983 = 9x^2 + 8x + 3 aka [9, 8, 3]. Wondering how big the number has to be to make sense, and where this is used in practice.

Kubuxu · 2024-07-01T09:48:23 1719827303

FFT is used extensively in curve based zero-knowledge cryptography, not for the purpose of splitting a single large number into smaller ones, but to interpolate and evaluate very large polynomials (for example of the degree 2^27).

All of this happens in a field of an elliptic curve so the complexity reduction is greatly appreciated.

nickcw · 2024-07-01T10:03:47 1719828227

In libgmp I think the numbers have to be around 100,000 bits or 30,000 digits for FFT to be faster.

https://gmplib.org/manual/Multiplication-Algorithms

GMP has lots of other methods in between schoolbook multiplication and FFT multiplication. A nice one is Karatsuba multiplication which is very easy to understand and delivers O(n^1.58) rather than O(n^2) performance. Python uses this method for multiplying large numbers together

https://en.wikipedia.org/wiki/Karatsuba_algorithm

kurlberg · 2024-07-01T14:46:16 1719845176

There is a nice picture of the "best" for different ranges of sizes of numbers to be multiplied at

http://gmplib.org/devel/log.i7.1024.png

More context and explanation can be found at: http://gmplib.org/devel/

BTW, I like Bernstein's survey of different multiplication algorithms at

https://cr.yp.to/papers/m3.pdf

(there is a unifying theme about using ring isomorphisms to explain many of the "standard" routines.)

kurlberg · 2024-07-01T14:55:54 1719845754

PS: if you're interested in multiplying "ludicrously large numbers", Harvey and van der Hoeven had a nice breakthrough and got multiplication down to "FFT speed" (n*log(n)), see

https://hal.science/hal-02070778v2/document

A pop-sci description can be found at

https://theconversation.com/weve-found-a-quicker-way-to-mult...

pfdietz · 2024-07-01T13:05:13 1719839113

The Karatsuba algorithm idea can also be used for multiplication of polynomials.

tverbeure · 2024-07-01T14:41:01 1719844861

If you haven’t seen it yet, you should watch this video: https://youtu.be/h7apO7q16V0?si=bmgUEMTQSqU3flIv

It derives the FFT algorithm from polynomial multiplication and it’s fantastic.

I rewatch it every 6 months or so.

rjeli · 2024-07-01T15:17:18 1719847038

Note that the FFT has this property of “convolution is pointwise multiplication” for any cyclic multiplicative group, see https://www.sciencedirect.com/science/article/pii/S002200007... for a more algebraic derivation.

Some call this a “harmonic” fft, and there are also non-harmonic FFTs:

- the “additive NTT” of [LCH14] on GF(2^n)

- the circle fft on the unit circle X^2+Y^2=1 of a finite field [HLP24]

- the ecfft on a sequence of elliptic curve isogenies [BCKL21]

[LCH14]: https://arxiv.org/abs/1404.3458

[HLP24]: https://eprint.iacr.org/2024/278

[BCKL21]: https://arxiv.org/pdf/2107.08473

sigil · 2024-07-01T00:37:20 1719794240

Who was the first person to propose FFTs for faster polynomial multiplication?

Got curious about this recently. I’m not great at citation tracing, but did make it back to this 1995 paper by David Eppstein [0] where he uses it to efficiently solve Subset Sum after an incremental update. Surely Knuth’s TAOCP had it even earlier?

The fact that FFT polynomial multiplication also lets you solve Exact Subset Sum with Repetition in sub-exponential time came as a real shock to me. [1] Crucially, this algo is O(N log N) where N = the maximum element, not N = the set size, so it isn’t a P ≠ NP counterexample or anything.

[0] https://escholarship.org/content/qt6sd695gn/qt6sd695gn.pdf

[1] https://x.com/festivitymn/status/1788362552998580473?s=46&t=...

pbsd · 2024-07-01T01:21:43 1719796903

Pollard [1], Nicholson [2], and Schonhage-Strassen [3] seem to have come up with it independently around the same time, using different approaches. Strassen is said to have discovered the Pollard approach in 1968 but there is no (written) record of it.

It should also be noted that, while it was not exactly the birth of the FFT, Cooley-Tukey's 1965 paper [4] on it was what kickstarted research on FFT and its applications. This was just a few years after that.

[1] https://doi.org/10.1090/S0025-5718-1971-0301966-0

[2] https://doi.org/10.1016/S0022-0000(71)80014-4

[3] https://doi.org/10.1007/BF02242355

[4] https://doi.org/10.1090/S0025-5718-1965-0178586-1

sigil · 2024-07-01T06:01:06 1719813666

Thank you!

npalli · 2024-07-01T01:17:24 1719796644

Earliest maybe Gentleman and Sande from 1966 and a kickass title (for '66) - "Fast Fourier Transforms: for fun and profit"

https://www.cis.rit.edu/class/simg716/FFT_Fun_Profit.pdf

cornstalks · 2024-07-01T00:57:35 1719795455

The Schönhage–Strassen algorithm from 1971 is basically a polynomial multiplication using FFTs: https://en.m.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Stras...

adamnemecek · 2024-06-30T22:48:03 1719787683

I think that all machine learning is solving a convolutional equation.

This paper talks about it in the context of RL https://arxiv.org/abs/1712.06115 but most approaches fit within that paradigm.

salty_biscuits · 2024-06-30T23:04:52 1719788692

Basically kernel methods right?

adamnemecek · 2024-06-30T23:15:08 1719789308

Yes, it is there. There might be more.

bsoles · 2024-07-02T21:37:11 1719956231

Interesting. I've just implemented an algorithm (matrix profile) that makes use of FFT to compute a big set of dot products of time series subsequences where the length n of the time series can be in 100s of millions. The fast convolution computation using FFT reduces the computation time from O(n) to O(log n) with awesome speed gains at this scale. Throw in a GPU and the speed goes up even faster, like processing 10 million data point in 0.1 second on a laptop.

eranation · 2024-07-01T03:35:02 1719804902

The key ”trick” of the operation seems to be this revelation:

> In other words, performing the convolution of two signals in time domain is equivalent to multiplying them in the frequency domain.

Great article, as it breaks down a complex idea into much smaller steps that even my math challenged mind can somehow grasp, but did I miss a step? Or is it left as an exercise to the reader to look up? I was already stretching my math ability to that point, but it felt a little bit like - “and then, draw the rest of the F-ing owl” to me. Is it just me?

Great writing otherwise.

clearprop · 2024-07-01T08:16:26 1719821786

Thanks! In case it helps: * Multiplying two polynomials as taught in school is in reality a convolution. * "performing the convolution of two signals in the time domain is equivalent to multiplying them in the frequency domain" * FFT allows us to convert from time domain to frequency domain * We use FFT to convert our polynomial to frequency domain. * If we are now in the frequency domain, we just need to multiply. Faster than a convolution.

Does it clarify the missing step? Happy to update the post with what's missing.

eranation · 2024-07-01T15:24:28 1719847468

Yep, it definitely helps! What I'm struggling to understand is why "performing the convolution of two signals in the time domain is equivalent to multiplying them in the frequency domain" (I'm going to google/gpt it, this is more of an exercise left for me to dive into, your post is perfect as it is)

keepamovin · 2024-07-01T14:21:15 1719843675

So integer factoring is a discrete deconvolution? I wonder if juxtaposing the FFT representation (inverse pointwise multiplication) and the tableax (regular long multiplication / carry add) could break the symmetry and get enough information for a fast algorithm?

joe_the_user · 2024-07-01T03:06:41 1719803201

Sure, the naive method of multiplying polynomials is slow in the degree of the polynomial. But when does one have to deal with two degree 100 polynomials?

My impression is that this sort of method isn't used by computer algebra system for this reason.

programjames · 2024-07-01T04:06:21 1719806781

Computer algebra systems (such as chebfun in Matlab) convert arbitrary functions to 100-degree+ polynomials to make it easier to find roots, optima, etc.

empath75 · 2024-07-01T18:50:00 1719859800

Extremely common in error correction and signal processing.

https://www.youtube.com/watch?v=CcZf_7Fb4Us

https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_cor... is one example.

8051Enthusiast · 2024-07-01T13:49:35 1719841775

When I wanted to reverse engineer some CRC checksum parameters for larger files, I made a program[1] that converts the files into some million degree GF(2) polynomials and calculates their GCD, which is only possible in reasonable time with FFT-based multiplication.

[1]: https://github.com/8051enthusiast/delsum

imjonse · 2024-07-01T04:58:57 1719809937

This convolutional view and fast GPU kernels for FFT were used in some pre-Mamba state space models for long sequence modeling where the polynomial is the input sequence. The Hazy Research blog posts from 2020-2023 have a lot of information on this approach.

lkuty · 2024-07-01T11:13:42 1719832422

See https://news.ycombinator.com/item?id=40306339

"(...) or my physics research I have worked with expressions that was just shy of a terabyte long and had > 100M terms"

citizen_friend · 2024-07-01T03:15:32 1719803732

Now do 1000. That’s kind of the whole point. Quadratic works fine until it sneaks up on you.

Kubuxu · 2024-07-01T09:50:12 1719827412

Zero-knowledge cryptography frequently deals with polynomials of degree 2^20 - 2^27.

froh · 2024-07-01T16:19:15 1719850755

I'd like the article even more if it used numpy.convolve in the benchmark.

comparing pure python and numpy fft feels not right to me.

the result will be similar, sure, but the effect might then only show for much larger inputs.

clearprop · 2024-07-04T11:53:47 1720094027

Great feedback. I have updated the post using convolve instead. There is a huge difference convolve/naive. On the other hand, convolved is slower than the FFT as expected, but for greater than 3000 degree polynomials or so. See diff: https://github.com/alrevuelta/myblog/commit/9fcc3dc84b1d9b66...

bobmcnamara · 2024-07-01T05:00:37 1719810037

Another FFT trick applies well here: the middle vector multiplication can be folded into the last stage of the FFT and first stage of the IFFT. This saves writing these two stages out two stages to memory.

bjornsing · 2024-06-30T21:54:50 1719784490

Funny feeling: Looking at the title I felt a bit puzzled. Then scrolling the article the concepts and their (well known) connections came back to me.

sva_ · 2024-07-01T13:48:01 1719841681

Good article, but I think the multiply_naive should've used numpy instead for a more fair benchmark comparison.

clearprop · 2024-07-04T11:54:17 1720094057

Indeed, updated: https://github.com/alrevuelta/myblog/commit/9fcc3dc84b1d9b66...

Thanks for the feedback.