Protip here: Do it by hand, though when you start getting tired of typing latex ...

Protip here: Do it by hand, though when you start getting tired of typing latex you can switch to IDE and let github copilot complete it(it will mostly be incorrect) and then you can go in and fix its mistakes, it still saves a bunch of time. For example:

``` The importtant thing is that the derivative needs to be computed from a number of elements logits = h@W+b

logits = h@W+b

h =

h11 h12 h13

h21 h21 h23

W = w11 w12

w21 w22

w31 w32

b = b1, b2 and

Logit11 = h11w11+ h12w21 + h13w31 + b1 - Eq.1

Logit12 = h11w12+ h12w22 + h13w32 + b2 - Eq.2

Logit21 = h21w11+ h22w21 + h23w31 + b1 - Eq.3

Logit22 = h21w12+ h22w22 + h23w32 + b2 - Eq.4

DL/Dh11 = DL/DLogit11 * DLogit11/Dh11 + DL/DLogit12 * DLogit12/Dh11 = DL/Dlogit11 * w11 + DL/DLogit12 * w12

DL/Dh12 = DL/DLogit11 * DLogit11/Dh12 + DL/DLogit12 * DLogit12/Dh12 = DL/Dlogit11 * w21 + DL/DLogit12 * w22

DL/Dh13 = DL/DLogit11 * DLogit11/Dh13 + DL/DLogit12 * DLogit12/Dh13 = DL/Dlogit11 * w31 + DL/DLogit12 * w32

DL/Dh21 = DL/DLogit21 * DLogit21/Dh21 + DL/DLogit22 * DLogit22/Dh21 = DL/Dlogit21 * w11 + DL/DLogit22 * w12

DL/Dh22 = DL/DLogit21 * DLogit21/Dh22 + DL/DLogit22 * DLogit22/Dh22 = DL/Dlogit21 * w21 + DL/DLogit22 * w22

DL/Dh23 = DL/DLogit21 * DLogit21/Dh23 + DL/DLogit22 * DLogit22/Dh23 = DL/Dlogit21 * w31 + DL/DLogit22 * w32

DL/Dh = [

    DL/Dh11 DL/Dh12 DL/Dh13

    DL/Dh21 DL/Dh22 DL/Dh23

] = [[DL/Dlogit11 DL/Dlogit12], @ [[w11 w21 w31],

    [DL/Dlogit21 DL/Dlogit22]]     [w12 w22 w32]]

= DL/Dlogit * W^T -------------> This is the final gradient and note that it is a matrix multiplication or a projection of the logit gradient on the Weight layer

Now lets compute DL/dW

DL/dW = DL/DLogit * DLogit/DW

DL/DW11 = DL/DLogit11 * DLogit11/DW11 + DL/Dlogit21 * Dlogit21/DW11 = DL/DLogit11 * h11 + DL/Dlogit21 * h21

DL/DW12 = DL/DLogit12 * DLogit12/DW12 + DL/Dlogit22 * Dlogit22/DW12 = DL/DLogit12 * h11 + DL/Dlogit22 * h21

DL/DW21 = DL/DLogit11 * DLogit11/DW21 + DL/Dlogit21 * Dlogit21/DW21 = DL/DLogit11 * h12 + DL/Dlogit21 * h22

DL/DW22 = DL/DLogit12 * DLogit12/DW22 + DL/Dlogit22 * Dlogit22/DW22 = DL/DLogit12 * h12 + DL/Dlogit22 * h22

DL/DW31 = DL/DLogit11 * DLogit11/DW31 + DL/Dlogit21 * Dlogit21/DW31 = DL/DLogit11 * h13 + DL/Dlogit21 * h23

DL/DW32 = DL/DLogit12 * DLogit12/DW32 + DL/Dlogit22 * Dlogit22/DW32 = DL/DLogit12 * h13 + DL/Dlogit22 * h23

DL/DW = [[h11 h21] @ [[DL/DLogit11, DL/DLogit12]

         [h12 h22]     [DL/DLogit21, DL/DLogit22]]

         [h13 h23]

        ]

        = h^T @ DL/DLogit -------------> This is the final gradient and note that it is a matrix multiplication or a projection of the logit gradient on the hidden layer.

```