Input: multi-index set mset_full, which is assumed downward-closed
Output: quad_rules: set of tensor-product quadratures and their (possibly negative) associated cardinality `Vector{Tuple{Midx, Int}}`
keep_looping = true
quad_rules = []

while keep_looping
	# Form the mset for this loop
	look at indexes of occurrences that are not 1.
	form them into mset_loop
	get reduced frontier frontier of mset_loop
	# Adjust each frontier member and reindex their ancestors' number of occurrences
	for idx_midx_loop in frontier:
		# Access the adjustment for this frontier member
		get index idx_midx_full corresponding to the multi-index idx_midx_loop in mset_full
		set j = occurrences[idx_midx_full] - 1, e.g., if idx_midx_full occurs $j+1=-1$ times, then $j=-2$.
		occurrences[idx_midx_full] = 1

		# Reindex the number of occurrences of the backward ancestors
		get backward ancestors of idx_midx_loop
		for each ancestor idx_back_loop of idx_midx_loop
			get original index idx_back_full of idx_back_loop (i.e. index of idx_back_loop w.r.t. mset_full, not mset_loop)
			occurrences[idx] -= j
		end for
		push (idx_midx_full, j) onto quad_rules
	end for
	keep_looping = occurrences is not identically 1
end while

return quad_rules

Suppose we have data and you want to estimate from the data. One way to do this is via score matching, i.e., estimate a function representing the gradient of the log PDF (obligatory note that "score" in statistics usually refers to something of the form , which is not the same). Unless otherwise specified, consider to indicate the gradient of a function in and to be the Laplacian. To formulate this, we use the Fisher divergence as in Hyvaerinen 05 and formulate it as a loss function of parameters of some , an approximation to the PDF .

where the last step comes from integration by parts assuming are zero at the boundaries of the domain (i.e., ). We let as a slightly "nicer" loss function, which has the same minimizer as . Notably, this loss function doesn't require that is even a proper density, it just requires the ability to integrate with respect to (e.g., from samples), and evaluate derivatives of , which is what we try to model.

It should be observed that, often in score-matching literature, researchers will parameterize the gradient of , not the function itself; this can produce a nonphysical solution since the learned function won't necessarily even be of the form for any particular . Now, we parameterize where are scalar-valued functions (assume multivariate polynomial terms); without loss of generality, assume that as , the boundary of our domain. Then, we know that

Therefore, we get that

If we then set and , then our loss ends up becoming quadratic in as , which is known to have minimizer . It's particularly interesting that , i.e., it is square and finite dimensional, even though our expectation requires an "infinite" number of evaluations. Contrast this with a Vandermonde-like matrix, which will grow with the number of samples. The next interesting point is to consider the elements of and .

An example in infinite sample limit

Now suppose and for some multi-index set , i.e., we want to find estimate the coefficients for to approximate . We know that and that since all probabilist Hermite polynomials are orthogonal to , so

for some , where is the elementary vector with all zeros except for in entry and by convention for . Similarly,

implying that is precisely diagonal. Then, since we have , we get that when for some . Now suppose is of this form. We must have that

which implies that (notably we must have that there is some constant that normalizes this which we can't find via score-matching). This is precisely giving that , i.e., will converge exactly.

Finite sample case and Overparameterization

Suppose we have samples and the input dimension is as before, and consider . We can formulate that

where is a function such that . Then, we estimate as

so in this case, we can generally assume (though this is technically only an upper bound, linear independence of the functions should give this handily). Contrast this to the Vandermonde case, where , which gives a matrix of rank . While these two numbers seem different, there is actually some parity to them; score matching attempts to match , which gives the illusion of an extra dimension of freedom from the gradient. Therefore, to adjust for this extra freedom, the number of parameters should be scaled by the reciprocal of the dimension when comparing against .

We can also form , where is applied elementwise. Then, we wish to solve . Obviously, this can be done uniquely when . However, in the "overparameterized" case, there are clearly an infinite number of solutions by simple linear algebra. A traditional technique would be to solve

i.e., find the minimum norm solution. This can be done via allowing , where and give the orthogonal eigenvectors and their nonzero eigenvalues, respectively (we know that is symmetric positive semi-definite since it is the sum of SPSD matrices). Then we know that will be the optimal solution (Trefethen and Bau?).

However, while this has an optimal solution in closed form, the objective is not at all well motivated in this context, besides that it will ensure doesn't have entries diverging to infinity. Perhaps a better solution would be to minimize the Kullback-Leibler divergence, . By definition,

While I think this is a convex function (the rightmost term is reminiscient of a log-sum-exp), it's not really tractable as an objective even when approximating the expectation with an average over the empirical distribution of . The integral over is not tractable in closed form. However, examining the role of this term suggests some regularization against higher frequency . This intuition comes from that, if is highly oscillatory and is nonzero, will often be true and thus we will be far from the minimum. In this sense, we approximate the KL as

For simplicity, imagine as a diagonal matrix with scaling with in some way. While will clearly have a minimizer at , this is physically meaningless due to the relaxation of the log-sum-exp term, i.e., this will not be guaranteed to minimize any statistical divergence in any manner. However, it seems reasonable to return to the Fisher divergence score-matching setting to say

A spectral method for diffusion models

Generally, the neural networks for diffusion models are minimizing functions of the form

which approximates the loss

where is some density that we prescribe to the "importance" of different time-points and is the measure of following the SDE

with . Suppose now that we consider with . Then,

where we abuse notation to say and . Finally, suppose we separate the random variables from time in our functions via Then,

However, consider that we know . Then, we should have .

Polynomial Score Matching

Function Approximation


Suppose we have two probability measures, , and we would like to find the distance between them. While one naturally might consider the Wasserstein metric, which (by Brenier-Benamou) is given by

this is well-studied and has well-known restrictions. An alternative here would be the Fisher-Rao distance,

Note that this is distance is constrained by a reaction PDE instead of a continuity equation. We can see that

where . Since we know , then the total mass can't possibly change (consider discretizing this via Euler and noting that it won't change for any step-size , so it converges as ). One can see that the (adjusted) Fisher-Rao induces a geodesic of the form

but this is irrelevant to the topic at hand.


The PDE constraining the Wasserstein-Fisher-Rao metric will intuitively be

i.e., we use velocity field and "reaction function" to govern the dynamics of . For a given functional , we would like to find a "Wasserstein-Fisher-Rao" gradient flow. In particular, we would like to find a velocity field and reaction function which determine a s.t. minimizes using constant-speed geodesics in the WFR geometry. To do this, we first must define a "Wasserstein-Fisher-Rao gradient". Generally one defines such gradients by writing as for some governing the PDE (TODO: elaborate?). Observe that

Then, if I have a measure , I would end up with something like

See, e.g., YWR23

KL functional

At first, we may be interested in the KL-divergence function often used in gradient flows, which (for a given distribution ) has pertinent information

but if we keep as a discrete measure, we will not be able to ever estimate . It is possible to use a kernelized version of in the vein of SVGD, but I'm not sold. Alternatively, I could use score matching. For the record, this gives

I don't think I need to center the weight derivative? Anyway, if is truly a discrete distribution, then this is undefined. If you kernelize it.... I'm not entirely sure, but it'd give something I guess? On the other hand, perhaps you could match the score (or score ratio). See Polynomial Score Matching, which would need to be modified for accounting for , which weights the true .

MMD functional?

Suppose we now have

where is the norm in the RKHS induced by the kernel . One can find that

where refers to the gradient in the second argument. Suppose that . Then,
. More generally, if for positive and symmetric distance, then you get which is generally pretty easy to compute. This gives the update equation as

Why is this nice?

What's nice here is that this is "forgetful" of our initial statue , assuming the convexity and all that (probably not the case, but still). The choice of is one of preconditioning, it's not an end-all-be-all choice in theory. So far in our transport quadrature, our choice of reference basically entirely constrains the points to be the pushforward of the reference quadrature under the map with no change to the weights, where this isn't necessarily the case.

Alternative "static" formulation

Take our Hilbert space to be polynomials up to degree with inner product

so our kernel becomes the Christoffel-Darboux kernel of

with the norm. I.e., this kernel projects onto the polynomials , which are orthogonal w.r.t. . Without loss of generality, assume . Obviously, for any function , we see

which means the embedding of an arbitrary function into our RKHS is exactly the orthogonal projection of . Our maximum mean discrepancy becomes

Assume we want to match points and weights , i.e., discrete measure to some target . We see

with . Then, we get that

and somewhat less obviously, suppose we abuse notation to consider a diagonal matrix with diagonal entry equal to ; matrix calculus dictates

which indicates the trace is operating on a matrix with only one nonzero column ; therefore, is the th entry in the th column for and zero otherwise. This corresponds to

Here we maintain control of three things: . We know that induces the orthogonal polynomial family , and this gets normalized (with respect to ), so we don't get to choose which scaling for the family. Notably, for fixed , this becomes a quadratic loss in , so we can use a linear solve to find the solution. For the case of finding as well, it seems straightforward enough to attempt to just choose and optimize over these values. Unfortunately, even in one dimension, the Gaussian quadrature being unique does not mean that it will be easy to find in this setting. In higher dimensions, this will be very difficult without something of a greedy procedure.

Returning to dynamic

We now assume data from and want to estimate a quadrature via discrete as and we have reference distribution . Taking as our "preconditioner", we estimate for from data (perhaps), then we have (for quadrature points)

One can compare and contrast this with the minimization problem given in ^021c59. While they are very similar, note that the geometry of the static problem by on the points, and the WFR places the scaling of on the weights. This is a slight change, but can make remarkable differences due to imposing different geometries.

Wasserstein-Fisher-Rao Gradient Flow

Dynamic transport of functions

I'm particularly interested if this can be extended to work for flows of functions and, in particular, orthogonal polynomials. Recall that

Transport of coefficients

It's fairly clear that, if you take the inner product of both sides with , we get that

We can simplify the numerator for as due to wikipedia. The second form of the norm does not seem very helpful.

Assuming this is correct, we further constrict our polynomials to have unit magnitude, which automatically indicates that and by construction. Now suppose we construct as the th orthonormal polynomial w.r.t. measure . We should get that is a function representing the th recurrence term at time .

One could show that (assuming unit variance of the polynomials)

I have no idea what should be due to the representation of . Additionally, note that and depend on and . At this point, this is a disgusting equation whic h no one would ever want to solve. At least I don't want to.

Another addition is that this becomes somewhat intractable in higher than one dimension due to the finicky nature of recurrence relations on multivariate polynomials.

Finally, I'm not entirely sure what one would linearize here, or how this could possibly be made more efficient.

Transport of function values? Roots?

The subsequent question is what else you would transport. One idea I like is transporting roots so that, at the end of the transport, you approximate the exact roots of the orthonormal polynomials (then one could maybe approximate the value of the Christoffel function or something? I'm not sure)

Alternatively, another idea would be somehow transporting the weights so that you keep the points and the weights change according to the transport (in the vein of tempered importance weighting?)

Monic Polynomials

Generally, we know , where and . Because I'm lazy, I'm going to drop the hat (even though technically doesn't agree with the above definition of ). Then, we see the more complicated time derivative of as

Via our abuse of notation, however, stays the same as ^c0ec95. On the other hand, we get

Forcing a Homotopy

Suppose we know up to a constant. Then, set . Then, , so we observe


Additionally, on the corner cases, assuming we know

Now we can formulate an algorithm:

Dynamic Quadrature Flow (under particular )

Assume: ,
Input: (possibly unnormalized?) density , for , step size
Output: for , which integrate , respectively

  • for
    - Calculate from .
    - Set and renormalize
    - Note that , so calculate
    - for ,
    - for each quadrature point
    - Calculate
    - Calculate
    - Calculate
    - Estimate $$
    \dot{a}{k}(t) &\approx \sum\limits{i=1}^{N}w{t}^{(i)}x^{(i)}{t}p{k}^{(i)}\left[p{k}^{(i)}v^{(i)} + 2\dot{p}{k}^{(i)}\right]\
    {k}(t) &\approx \sum\limits{i=1}^{N}w{t}^{(i)}p{k}^{(i)}\left[p{k}^{(i)}v^{(i)} + 2\dot{p}_{k}^{(i)}\right]
  • Calculate from .

Note that this approximates the Euler discretization of the ODE of . However, the inner for loop approximates the derivative of , given the values of , so one could use any other scheme (e.g., Runge-Kutta).

Polynomial Recurrence Flow

On integration in a measure transport framework

Setting the stage

Notation Meaning
, Densities of mapping and to , respectively
Approximation space when considering
Approximation of when confined to
Quadrature rule on
function we want expectation of on , respectively
The "transported" quadrature rule,
Orthogonal polynomials with respect to

We are interested in, for , approximating the expectation by a quadrature rule of the form . There are a few central questions:

  • What can we prove about when integrating ?
  • What can we prove about when integrating ?
  • How does the error between and affect the difference between and ?
  • How do we consider all of these to bound the error on
  • How do we choose to minimize in some appropriate norm? Does this change how we choose the loss?

What's the potential issue?

What can we say about classical quadrature?

Quadrature generally, to some degree, approximates inner products via polynomials of some kind. Gauss quadrature gets phenomenal convergence via the fact that is exact on polynomials up to degree . Clenshaw-Curtis is exact on polynomials of degree , and many other quadrature rules share this property. Needless to say, this hinges precisely on the fact that , when smooth, can be approximated very well by polynomials; further, if "largely made of" low-degree polynomials (i.e. any polynomial approximation, e.g. PCE, has rapidly decaying coefficients), we know that this quadrature rule works very well. Suppose we have , orthogonal under ; let be the degree PCE of . We know that, for sufficiently smooth ,

which comes from Parseval's identity (see Sagiv 22). What we want is that, given a function and a set , we can create an approximation

or something like that. Note here that is a complete degree of freedom-- we can choose it however we wish. Note here a few things:

  • is probably not a polynomial, and is almost certainly not a polynomial
  • Because of this, is almost definitely not a polynomial
  • Regardless, assuming orthonormality of , we get that
    Similar to this, we get
    Therefore, a spectral projection of onto is equivalent to a spectral projection of onto , and thus we inherit any convergence properties of spectral approximation on of the function .

Example 1: Compact to Infinite domain

Suppose we map from the reference to via the inverse CDF of the Normal distribution as . Then, are naturally the Legendre polynomials and is a Legendre polynomial composed with the CDF of which, while smooth, is entirely bounded for any given . Therefore, a -point quadrature rule will only capture behavior on the domain for , which grows rather slowly (i.e., the effective domain of this quadrature rule is rather small). However, the basis functions are fairly smooth so we can get convergence faster than the rate cited above.

Example 2: Mapping between tails

In the above example, the problem is ostensibly that the transport maps a distribution with "lighter tails" (in fact, tails that are exactly zero) to a distribution with "heavier tails". It then stands to reason that we might do better if we do the opposite way. Consider now (the reference is Gaussian now) and , i.e., . We are initially pleased that then are orthogonal polynomials! However, there are a few problems that crop up.

  • ; so, if we use a quadrature rule with odd, the middle point index (which receives the most weight!) will stay at zero, where Therefore, our "practitioner's method" places a lot of weight where the density places zero weight.
  • The polynomials are indeed orthogonal and clearly span ; however, they aren't a "traditional family" since , so the th term is a degree polynomial. Therefore, if , we will never be able to integrate exactly!
  • Indeed, the rate of convergence for is abyssmal. The rate is precisely the rate of decay of the coefficients when projecting onto (consider Parseval).

What I take away from these examples is that having "lighter" tails isn't enough; you want the same tails, i.e., the transport should be exactly linear in the tails. In this sense, we might consider regularizing by the log of the Sobolev norm or something.

Other topics to consider:

Suppose is a RKHS with kernel . It would be nice to get something of the following form:

where is the discrete distribution using the pushforward of on a quadrature w.r.t. .

Why is that interesting?

This "moves the onus onto ", so-to-speak. For a given kernel, will smooth out the "high moments" of the distributions and only care about the broad strokes of similarity. Then, as long as broadly captures the features of , we get that the bias of the quadrature is largely dependent on how bad the function is itself. In some sense, this is the "easier" problem to solve, because most convergence of quadrature rules depend on how large the norm of is in some sobolev sense.

Some other notes

  • Use "heat" to diffuse into , and then perform transport on for small (or decreasing) .
  • If is our integral and is our -point approximation of the integral , do something like
  • Note that if , we have that . Similar if in (sobolev).
  • Sagiv, Baptista paper is a very loose bound by strong engine for bounding problems.
  • Suppose we know Gaussian quadrature points for and consider a map that draws a voronoi tesselation on the domain and maps deterministically to these quadrature points. This is an atrocious sampling map, but has wonderful expectation approximation properties if it maps gaussian points to gaussian points (for a specific construction).
  • Consider distributions with compact support?
  • Sobolev norm is introduced in statistics to ensure uniqueness of the map
  • What is the geometry induced by the MMD? Even less formally, more pedestrian, how do simple functions (e.g. piecewise linear) functions morph under ?
    • In this sense, can we say anything about, e.g., the trapezoidal rule?
  • Instead of choosing to minimize the error, maybe just minimize an upper bound on the error.
    • Something to do with the Sagiv paper bound.
    • How to tune ? I personally want to use sobolev regularization, but Amir is a little skeptical
  • His process:
  1. Create approximation theory for transformed space
    • From parseval and spectral approximation, etc (This is where I think sobolev norm kicks in!!)
  2. Look at the error of the expectations of between target and pushforward to get interesting bound
  3. Use bounds on error of pushforward to encourage construction in the right place
Practitioner's Quadrature