tags:
  - approximation
  - algorithm
  - polynomials

Polynomial Score Matching

Suppose we have data and you want to estimate from the data. One way to do this is via score matching, i.e., estimate a function representing the gradient of the log PDF (obligatory note that "score" in statistics usually refers to something of the form , which is not the same). Unless otherwise specified, consider to indicate the gradient of a function in and to be the Laplacian. To formulate this, we use the Fisher divergence as in Hyvaerinen 05 and formulate it as a loss function of parameters of some , an approximation to the PDF .

where the last step comes from integration by parts assuming are zero at the boundaries of the domain (i.e., ). We let as a slightly "nicer" loss function, which has the same minimizer as . Notably, this loss function doesn't require that is even a proper density, it just requires the ability to integrate with respect to (e.g., from samples), and evaluate derivatives of , which is what we try to model.

It should be observed that, often in score-matching literature, researchers will parameterize the gradient of , not the function itself; this can produce a nonphysical solution since the learned function won't necessarily even be of the form for any particular . Now, we parameterize where are scalar-valued functions (assume multivariate polynomial terms); without loss of generality, assume that as , the boundary of our domain. Then, we know that

Therefore, we get that

If we then set and , then our loss ends up becoming quadratic in as , which is known to have minimizer . It's particularly interesting that , i.e., it is square and finite dimensional, even though our expectation requires an "infinite" number of evaluations. Contrast this with a Vandermonde-like matrix, which will grow with the number of samples. The next interesting point is to consider the elements of and .

An example in infinite sample limit

Now suppose and for some multi-index set , i.e., we want to find estimate the coefficients for to approximate . We know that and that since all probabilist Hermite polynomials are orthogonal to , so

for some , where is the elementary vector with all zeros except for in entry and by convention for . Similarly,

implying that is precisely diagonal. Then, since we have , we get that when for some . Now suppose is of this form. We must have that

which implies that (notably we must have that there is some constant that normalizes this which we can't find via score-matching). This is precisely giving that , i.e., will converge exactly.

Finite sample case and Overparameterization

Suppose we have samples and the input dimension is as before, and consider . We can formulate that

where is a function such that . Then, we estimate as

so in this case, we can generally assume (though this is technically only an upper bound, linear independence of the functions should give this handily). Contrast this to the Vandermonde case, where , which gives a matrix of rank . While these two numbers seem different, there is actually some parity to them; score matching attempts to match , which gives the illusion of an extra dimension of freedom from the gradient. Therefore, to adjust for this extra freedom, the number of parameters should be scaled by the reciprocal of the dimension when comparing against .

We can also form , where is applied elementwise. Then, we wish to solve . Obviously, this can be done uniquely when . However, in the "overparameterized" case, there are clearly an infinite number of solutions by simple linear algebra. A traditional technique would be to solve

i.e., find the minimum norm solution. This can be done via allowing , where and give the orthogonal eigenvectors and their nonzero eigenvalues, respectively (we know that is symmetric positive semi-definite since it is the sum of SPSD matrices). Then we know that will be the optimal solution (Trefethen and Bau?).

However, while this has an optimal solution in closed form, the objective is not at all well motivated in this context, besides that it will ensure doesn't have entries diverging to infinity. Perhaps a better solution would be to minimize the Kullback-Leibler divergence, . By definition,

While I think this is a convex function (the rightmost term is reminiscient of a log-sum-exp), it's not really tractable as an objective even when approximating the expectation with an average over the empirical distribution of . The integral over is not tractable in closed form. However, examining the role of this term suggests some regularization against higher frequency . This intuition comes from that, if is highly oscillatory and is nonzero, will often be true and thus we will be far from the minimum. In this sense, we approximate the KL as

For simplicity, imagine as a diagonal matrix with scaling with in some way. While will clearly have a minimizer at , this is physically meaningless due to the relaxation of the log-sum-exp term, i.e., this will not be guaranteed to minimize any statistical divergence in any manner. However, it seems reasonable to return to the Fisher divergence score-matching setting to say

A spectral method for diffusion models

Generally, the neural networks for diffusion models are minimizing functions of the form

which approximates the loss

where is some density that we prescribe to the "importance" of different time-points and is the measure of following the SDE

with . Suppose now that we consider with . Then,

where we abuse notation to say and . Finally, suppose we separate the random variables from time in our functions via Then,

However, consider that we know . Then, we should have . Therefore, we construct a basis where for and, if we have multi-indices , we create a set such that, when , we have for some dimension and some (i.e., if we look at the basis functions invariant in time which do not go to zero, we only approximate on functions that can represent ). Now we partition the set of multi-indices into , where for , , and . Without loss of generality on the ordering (addition is commutative), this allows us to partition the coefficients as as well as and . We see that the exact minimizer is given by
For ease of programming, we note that
If we discretize this integral in time, we get

Multi-indexing

Suppose we have a basis for the data dimension parameterized by . For the unbounded domain, we transport to the standard Normal, so we assume that the basis is suitable across all dimensions . Further, we assume that the target distribution is sub-Gaussian, so for some constant . Therefore, we form a univariate basis as follows:
where is a "mollifier" function. A simple example would be , which produces the Hermite functions. A more complicated function would be with as the Gaspari-Cohn localization function, which imitates for some choice of , but it becomes 0 outside of some radius with two continuous derivatives.

Similarly, we create a basis to parameterize the pseudo-spectral behavior in time. Since time exists in , we believe that Laguerre polynomials work well. However, since we don't actually run the Ornstein-Uhlenback process to and we don't want the approximation to be too sensitive at the first steps starting from , we also adapt a mollification scheme . Since we use the Laguerre polynomials, it seems natural to choose , the square root of the Gaspari-Cohn function. To ensure we allow for the asymptotic Gaussian behavior, we choose
Then, for by construction, which imposes some interesting choices of "admissible" multivariate basis functions. First, assume , i.e. we have a univariate function and consider a set as our set describing admitted basis functions, where means we represent
If we assume , then can only be in if ; this is equivalent to enforcing that for some , which is analytically true. This comes from the fact that we use mollifiers for all data basis functions with degree and all nonconstant temporal basis functions.

In general dimension , we consider an admissible multi-index set of basis functions . Given a maximal complexity in time , we can make a simple set of admissible multivariate basis functions as
where is understood as a vector with all zeros except a one at element . Some remarks:

One can replace the first tensor-product term with something more complicated (e.g. a total-order set), but it cannot have any terms constant in time.
Additionally, the terms constant in time shouldn't be just any quadratic terms; they must be cardinal directions! For example, is not admissible, so it can't just be a generic tensor or even total order 2 set.
Another "softer" restriction is that , since this corresponds to the constant of proportionality, which we will not find using this objective function (nor do we need). One will find that including these terms will create a singular matrix .
For more experience practitioners of multi-index sets, an immediately notable feature of is that it is not "downward closed"---to that I say ¯\_(ツ)_/¯. I will point out that the first set is downward closed when ignoring constant terms, and union'ed set has known coefficients (and no admissible forward neighbors constant in time)