Polynomial Score Matching

Suppose we have data and you want to estimate from the data. One way to do this is via score matching, i.e., estimate a function representing the gradient of the log PDF (obligatory note that "score" in statistics usually refers to something of the form , which is not the same). Unless otherwise specified, consider to indicate the gradient of a function in and to be the Laplacian. To formulate this, we use the Fisher divergence as in Hyvaerinen 05 and formulate it as a loss function of parameters of some , an approximation to the PDF .

where the last step comes from integration by parts assuming are zero at the boundaries of the domain (i.e., ). We let as a slightly "nicer" loss function, which has the same minimizer as . Notably, this loss function doesn't require that is even a proper density, it just requires the ability to integrate with respect to (e.g., from samples), and evaluate derivatives of , which is what we try to model.

It should be observed that, often in score-matching literature, researchers will parameterize the gradient of , not the function itself; this can produce a nonphysical solution since the learned function won't necessarily even be of the form for any particular . Now, we parameterize where are scalar-valued functions (assume multivariate polynomial terms); without loss of generality, assume that as , the boundary of our domain. Then, we know that

Therefore, we get that

If we then set and , then our loss ends up becoming quadratic in as , which is known to have minimizer . It's particularly interesting that , i.e., it is square and finite dimensional, even though our expectation requires an "infinite" number of evaluations. Contrast this with a Vandermonde-like matrix, which will grow with the number of samples. The next interesting point is to consider the elements of and .

An example in infinite sample limit

Now suppose and for some multi-index set , i.e., we want to find estimate the coefficients for to approximate . We know that and that since all probabilist Hermite polynomials are orthogonal to , so

for some , where is the elementary vector with all zeros except for in entry and by convention for . Similarly,

implying that is precisely diagonal. Then, since we have , we get that when for some . Now suppose is of this form. We must have that

which implies that (notably we must have that there is some constant that normalizes this which we can't find via score-matching). This is precisely giving that , i.e., will converge exactly.

Finite sample case and Overparameterization

Suppose we have samples and the input dimension is as before, and consider . We can formulate that

where is a function such that . Then, we estimate as

so in this case, we can generally assume (though this is technically only an upper bound, linear independence of the functions should give this handily). Contrast this to the Vandermonde case, where , which gives a matrix of rank . While these two numbers seem different, there is actually some parity to them; score matching attempts to match , which gives the illusion of an extra dimension of freedom from the gradient. Therefore, to adjust for this extra freedom, the number of parameters should be scaled by the reciprocal of the dimension when comparing against .

We can also form , where is applied elementwise. Then, we wish to solve . Obviously, this can be done uniquely when . However, in the "overparameterized" case, there are clearly an infinite number of solutions by simple linear algebra. A traditional technique would be to solve

i.e., find the minimum norm solution. This can be done via allowing , where and give the orthogonal eigenvectors and their nonzero eigenvalues, respectively (we know that is symmetric positive semi-definite since it is the sum of SPSD matrices). Then we know that will be the optimal solution (Trefethen and Bau?).

However, while this has an optimal solution in closed form, the objective is not at all well motivated in this context, besides that it will ensure doesn't have entries diverging to infinity. Perhaps a better solution would be to minimize the Kullback-Leibler divergence, . By definition,

While I think this is a convex function (the rightmost term is reminiscient of a log-sum-exp), it's not really tractable as an objective even when approximating the expectation with an average over the empirical distribution of . The integral over is not tractable in closed form. However, examining the role of this term suggests some regularization against higher frequency . This intuition comes from that, if is highly oscillatory and is nonzero, will often be true and thus we will be far from the minimum. In this sense, we approximate the KL as

For simplicity, imagine as a diagonal matrix with scaling with in some way. While will clearly have a minimizer at , this is physically meaningless due to the relaxation of the log-sum-exp term, i.e., this will not be guaranteed to minimize any statistical divergence in any manner. However, it seems reasonable to return to the Fisher divergence score-matching setting to say

A spectral method for diffusion models

Generally, the neural networks for diffusion models are minimizing functions of the form

which approximates the loss

where is some density that we prescribe to the "importance" of different time-points and is the measure of following the SDE

with . Suppose now that we consider with . Then,

where we abuse notation to say and . Finally, suppose we separate the random variables from time in our functions via Then,

However, consider that we know . Then, we should have .