Suppose we have samples and the input dimension is as before, and consider . We can formulate that
where is a function such that . Then, we estimate as
so in this case, we can generally assume (though this is technically only an upper bound, linear independence of the functions should give this handily). Contrast this to the Vandermonde case, where , which gives a matrix of rank . While these two numbers seem different, there is actually some parity to them; score matching attempts to match , which gives the illusion of an extra dimension of freedom from the gradient. Therefore, to adjust for this extra freedom, the number of parameters should be scaled by the reciprocal of the dimension when comparing against .
We can also form , where is applied elementwise. Then, we wish to solve . Obviously, this can be done uniquely when . However, in the "overparameterized" case, there are clearly an infinite number of solutions by simple linear algebra. A traditional technique would be to solve
i.e., find the minimum norm solution. This can be done via allowing , where and give the orthogonal eigenvectors and their nonzero eigenvalues, respectively (we know that is symmetric positive semi-definite since it is the sum of SPSD matrices). Then we know that will be the optimal solution (Trefethen and Bau?).
However, while this has an optimal solution in closed form, the objective is not at all well motivated in this context, besides that it will ensure doesn't have entries diverging to infinity. Perhaps a better solution would be to minimize the Kullback-Leibler divergence, . By definition,
While I think this is a convex function (the rightmost term is reminiscient of a log-sum-exp), it's not really tractable as an objective even when approximating the expectation with an average over the empirical distribution of . The integral over is not tractable in closed form. However, examining the role of this term suggests some regularization against higher frequency . This intuition comes from that, if is highly oscillatory and is nonzero, will often be true and thus we will be far from the minimum. In this sense, we approximate the KL as
For simplicity, imagine as a diagonal matrix with scaling with in some way. While will clearly have a minimizer at , this is physically meaningless due to the relaxation of the log-sum-exp term, i.e., this will not be guaranteed to minimize any statistical divergence in any manner. However, it seems reasonable to return to the Fisher divergence score-matching setting to say
A spectral method for diffusion models
Generally, the neural networks for diffusion models are minimizing functions of the form
which approximates the loss
where is some density that we prescribe to the "importance" of different time-points and is the measure of following the SDE
with . Suppose now that we consider with . Then,
where we abuse notation to say and . Finally, suppose we separate the random variables from time in our functions via Then,
However, consider that we know . Then, we should have .