Suppose we have samples and the input dimension is as before, and consider . We can formulate that
where is a function such that . Then, we estimate as
so in this case, we can generally assume (though this is technically only an upper bound, linear independence of the functions should give this handily). Contrast this to the Vandermonde case, where , which gives a matrix of rank . While these two numbers seem different, there is actually some parity to them; score matching attempts to match , which gives the illusion of an extra dimension of freedom from the gradient. Therefore, to adjust for this extra freedom, the number of parameters should be scaled by the reciprocal of the dimension when comparing against .
We can also form , where is applied elementwise. Then, we wish to solve . Obviously, this can be done uniquely when . However, in the "overparameterized" case, there are clearly an infinite number of solutions by simple linear algebra. A traditional technique would be to solve
i.e., find the minimum norm solution. This can be done via allowing , where and give the orthogonal eigenvectors and their nonzero eigenvalues, respectively (we know that is symmetric positive semi-definite since it is the sum of SPSD matrices). Then we know that will be the optimal solution (Trefethen and Bau?).
However, while this has an optimal solution in closed form, the objective is not at all well motivated in this context, besides that it will ensure doesn't have entries diverging to infinity. Perhaps a better solution would be to minimize the Kullback-Leibler divergence, . By definition,
While I think this is a convex function (the rightmost term is reminiscient of a log-sum-exp), it's not really tractable as an objective even when approximating the expectation with an average over the empirical distribution of . The integral over is not tractable in closed form. However, examining the role of this term suggests some regularization against higher frequency . This intuition comes from that, if is highly oscillatory and is nonzero, will often be true and thus we will be far from the minimum. In this sense, we approximate the KL as
For simplicity, imagine as a diagonal matrix with scaling with in some way. While will clearly have a minimizer at , this is physically meaningless due to the relaxation of the log-sum-exp term, i.e., this will not be guaranteed to minimize any statistical divergence in any manner. However, it seems reasonable to return to the Fisher divergence score-matching setting to say
A spectral method for diffusion models
Generally, the neural networks for diffusion models are minimizing functions of the form
which approximates the loss
where is some density that we prescribe to the "importance" of different time-points and is the measure of following the SDE
with . Suppose now that we consider with . Then,
where we abuse notation to say and . Finally, suppose we separate the random variables from time in our functions via Then,
However, consider that we know . Then, we should have . Therefore, we construct a basis where for and, if we have multi-indices , we create a set such that, when , we have for some dimension and some (i.e., if we look at the basis functions invariant in time which do not go to zero, we only approximate on functions that can represent ). Now we partition the set of multi-indices into , where for , , and . Without loss of generality on the ordering (addition is commutative), this allows us to partition the coefficients as as well as and . We see that the exact minimizer is given by
For ease of programming, we note that
If we discretize this integral in time, we get
Multi-indexing
Suppose we have a basis for the data dimension parameterized by . For the unbounded domain, we transport to the standard Normal, so we assume that the basis is suitable across all dimensions . Further, we assume that the target distribution is sub-Gaussian, so for some constant . Therefore, we form a univariate basis as follows:
where is a "mollifier" function. A simple example would be , which produces the Hermite functions. A more complicated function would be with as the Gaspari-Cohn localization function, which imitates for some choice of , but it becomes 0 outside of some radius with two continuous derivatives.
Similarly, we create a basis to parameterize the pseudo-spectral behavior in time. Since time exists in , we believe that Laguerre polynomials work well. However, since we don't actually run the Ornstein-Uhlenback process to and we don't want the approximation to be too sensitive at the first steps starting from , we also adapt a mollification scheme . Since we use the Laguerre polynomials, it seems natural to choose , the square root of the Gaspari-Cohn function. To ensure we allow for the asymptotic Gaussian behavior, we choose
Then, for by construction, which imposes some interesting choices of "admissible" multivariate basis functions. First, assume , i.e. we have a univariate function and consider a set as our set describing admitted basis functions, where means we represent
If we assume , then can only be in if ; this is equivalent to enforcing that for some , which is analytically true. This comes from the fact that we use mollifiers for all data basis functions with degree and all nonconstant temporal basis functions.
In general dimension , we consider an admissible multi-index set of basis functions . Given a maximal complexity in time , we can make a simple set of admissible multivariate basis functions as
where is understood as a vector with all zeros except a one at element . Some remarks:
- One can replace the first tensor-product term with something more complicated (e.g. a total-order set), but it cannot have any terms constant in time.
- Additionally, the terms constant in time shouldn't be just any quadratic terms; they must be cardinal directions! For example, is not admissible, so it can't just be a generic tensor or even total order 2 set.
- Another "softer" restriction is that , since this corresponds to the constant of proportionality, which we will not find using this objective function (nor do we need). One will find that including these terms will create a singular matrix .
- For more experience practitioners of multi-index sets, an immediately notable feature of is that it is not "downward closed"---to that I say ¯\_(ツ)_/¯. I will point out that the first set is downward closed when ignoring constant terms, and union'ed set has known coefficients (and no admissible forward neighbors constant in time)