Quadrature

Software

Suppose we have data and you want to estimate from the data. One way to do this is via score matching, i.e., estimate a function representing the gradient of the log PDF (obligatory note that "score" in statistics usually refers to something of the form , which is not the same). Unless otherwise specified, consider to indicate the gradient of a function in and to be the Laplacian. To formulate this, we use the Fisher divergence as in Hyvaerinen 05 and formulate it as a loss function of parameters of some , an approximation to the PDF .

where the last step comes from integration by parts assuming are zero at the boundaries of the domain (i.e., ). We let as a slightly "nicer" loss function, which has the same minimizer as . Notably, this loss function doesn't require that is even a proper density, it just requires the ability to integrate with respect to (e.g., from samples), and evaluate derivatives of , which is what we try to model.

It should be observed that, often in score-matching literature, researchers will parameterize the gradient of , not the function itself; this can produce a nonphysical solution since the learned function won't necessarily even be of the form for any particular . Now, we parameterize where are scalar-valued functions (assume multivariate polynomial terms); without loss of generality, assume that as , the boundary of our domain. Then, we know that

Therefore, we get that

If we then set and , then our loss ends up becoming quadratic in as , which is known to have minimizer . It's particularly interesting that , i.e., it is square and finite dimensional, even though our expectation requires an "infinite" number of evaluations. Contrast this with a Vandermonde-like matrix, which will grow with the number of samples. The next interesting point is to consider the elements of and .

An example in infinite sample limit

Now suppose and for some multi-index set , i.e., we want to find estimate the coefficients for to approximate . We know that and that since all probabilist Hermite polynomials are orthogonal to , so

for some , where is the elementary vector with all zeros except for in entry and by convention for . Similarly,

implying that is precisely diagonal. Then, since we have , we get that when for some . Now suppose is of this form. We must have that

which implies that (notably we must have that there is some constant that normalizes this which we can't find via score-matching). This is precisely giving that , i.e., will converge exactly.

Finite sample case and Overparameterization

Suppose we have samples and the input dimension is as before, and consider . We can formulate that

where is a function such that . Then, we estimate as

so in this case, we can generally assume (though this is technically only an upper bound, linear independence of the functions should give this handily). Contrast this to the Vandermonde case, where , which gives a matrix of rank . While these two numbers seem different, there is actually some parity to them; score matching attempts to match , which gives the illusion of an extra dimension of freedom from the gradient. Therefore, to adjust for this extra freedom, the number of parameters should be scaled by the reciprocal of the dimension when comparing against .

We can also form , where is applied elementwise. Then, we wish to solve . Obviously, this can be done uniquely when . However, in the "overparameterized" case, there are clearly an infinite number of solutions by simple linear algebra. A traditional technique would be to solve

i.e., find the minimum norm solution. This can be done via allowing , where and give the orthogonal eigenvectors and their nonzero eigenvalues, respectively (we know that is symmetric positive semi-definite since it is the sum of SPSD matrices). Then we know that will be the optimal solution (Trefethen and Bau?).

However, while this has an optimal solution in closed form, the objective is not at all well motivated in this context, besides that it will ensure doesn't have entries diverging to infinity. Perhaps a better solution would be to minimize the Kullback-Leibler divergence, . By definition,

While I think this is a convex function (the rightmost term is reminiscient of a log-sum-exp), it's not really tractable as an objective even when approximating the expectation with an average over the empirical distribution of . The integral over is not tractable in closed form. However, examining the role of this term suggests some regularization against higher frequency . This intuition comes from that, if is highly oscillatory and is nonzero, will often be true and thus we will be far from the minimum. In this sense, we approximate the KL as

For simplicity, imagine as a diagonal matrix with scaling with in some way. While will clearly have a minimizer at , this is physically meaningless due to the relaxation of the log-sum-exp term, i.e., this will not be guaranteed to minimize any statistical divergence in any manner. However, it seems reasonable to return to the Fisher divergence score-matching setting to say

A spectral method for diffusion models

Generally, the neural networks for diffusion models are minimizing functions of the form

which approximates the loss

where is some density that we prescribe to the "importance" of different time-points and is the measure of following the SDE

with . Suppose now that we consider with . Then,

where we abuse notation to say and . Finally, suppose we separate the random variables from time in our functions via Then,

However, consider that we know . Then, we should have . Therefore, we construct a basis where for and, if we have multi-indices , we create a set such that, when , we have for some dimension and some (i.e., if we look at the basis functions invariant in time which do not go to zero, we only approximate on functions that can represent ). Now we partition the set of multi-indices into , where for , , and . Without loss of generality on the ordering (addition is commutative), this allows us to partition the coefficients as as well as and . We see that the exact minimizer is given by
For ease of programming, we note that
If we discretize this integral in time, we get

Multi-indexing

Suppose we have a basis for the data dimension parameterized by . For the unbounded domain, we transport to the standard Normal, so we assume that the basis is suitable across all dimensions . Further, we assume that the target distribution is sub-Gaussian, so for some constant . Therefore, we form a univariate basis as follows:
where is a "mollifier" function. A simple example would be , which produces the Hermite functions. A more complicated function would be with as the Gaspari-Cohn localization function, which imitates for some choice of , but it becomes 0 outside of some radius with two continuous derivatives.

Similarly, we create a basis to parameterize the pseudo-spectral behavior in time. Since time exists in , we believe that Laguerre polynomials work well. However, since we don't actually run the Ornstein-Uhlenback process to and we don't want the approximation to be too sensitive at the first steps starting from , we also adapt a mollification scheme . Since we use the Laguerre polynomials, it seems natural to choose , the square root of the Gaspari-Cohn function. To ensure we allow for the asymptotic Gaussian behavior, we choose
Then, for by construction, which imposes some interesting choices of "admissible" multivariate basis functions. First, assume , i.e. we have a univariate function and consider a set as our set describing admitted basis functions, where means we represent
If we assume , then can only be in if ; this is equivalent to enforcing that for some , which is analytically true. This comes from the fact that we use mollifiers for all data basis functions with degree and all nonconstant temporal basis functions.

In general dimension , we consider an admissible multi-index set of basis functions . Given a maximal complexity in time , we can make a simple set of admissible multivariate basis functions as
where is understood as a vector with all zeros except a one at element . Some remarks:

  • One can replace the first tensor-product term with something more complicated (e.g. a total-order set), but it cannot have any terms constant in time.
  • Additionally, the terms constant in time shouldn't be just any quadratic terms; they must be cardinal directions! For example, is not admissible, so it can't just be a generic tensor or even total order 2 set.
  • Another "softer" restriction is that , since this corresponds to the constant of proportionality, which we will not find using this objective function (nor do we need). One will find that including these terms will create a singular matrix .
  • For more experience practitioners of multi-index sets, an immediately notable feature of is that it is not "downward closed"---to that I say ¯\_(ツ)_/¯. I will point out that the first set is downward closed when ignoring constant terms, and union'ed set has known coefficients (and no admissible forward neighbors constant in time)
Polynomial Score Matching

Function Approximation

Fisher-Rao

Suppose we have two probability measures, , and we would like to find the distance between them. While one naturally might consider the Wasserstein metric, which (by Brenier-Benamou) is given by

this is well-studied and has well-known restrictions. An alternative here would be the Fisher-Rao distance,

Note that this is distance is constrained by a reaction PDE instead of a continuity equation. We can see that

where . Since we know , then the total mass can't possibly change (consider discretizing this via Euler and noting that it won't change for any step-size , so it converges as ). One can see that the (adjusted) Fisher-Rao induces a geodesic of the form

but this is irrelevant to the topic at hand.

Wasserstein-Fisher-Rao

The PDE constraining the Wasserstein-Fisher-Rao metric will intuitively be

i.e., we use velocity field and "reaction function" to govern the dynamics of . For a given functional , we would like to find a "Wasserstein-Fisher-Rao" gradient flow. In particular, we would like to find a velocity field and reaction function which determine a s.t. minimizes using constant-speed geodesics in the WFR geometry. To do this, we first must define a "Wasserstein-Fisher-Rao gradient". Generally one defines such gradients by writing as for some governing the PDE (TODO: elaborate?). Observe that

Then, if I have a measure , I would end up with something like

See, e.g., YWR23

KL functional

At first, we may be interested in the KL-divergence function often used in gradient flows, which (for a given distribution ) has pertinent information

but if we keep as a discrete measure, we will not be able to ever estimate . It is possible to use a kernelized version of in the vein of SVGD, but I'm not sold. Alternatively, I could use score matching. For the record, this gives

I don't think I need to center the weight derivative? Anyway, if is truly a discrete distribution, then this is undefined. If you kernelize it.... I'm not entirely sure, but it'd give something I guess? On the other hand, perhaps you could match the score (or score ratio). See Polynomial Score Matching, which would need to be modified for accounting for , which weights the true .

MMD functional?

Suppose we now have

where is the norm in the RKHS induced by the kernel . One can find that

where refers to the gradient in the second argument. Suppose that . Then,
. More generally, if for positive and symmetric distance, then you get which is generally pretty easy to compute. This gives the update equation as

Why is this nice?

What's nice here is that this is "forgetful" of our initial statue , assuming the convexity and all that (probably not the case, but still). The choice of is one of preconditioning, it's not an end-all-be-all choice in theory. So far in our transport quadrature, our choice of reference basically entirely constrains the points to be the pushforward of the reference quadrature under the map with no change to the weights, where this isn't necessarily the case.

Alternative "static" formulation

Take our Hilbert space to be polynomials up to degree with inner product

so our kernel becomes the Christoffel-Darboux kernel of

with the norm. I.e., this kernel projects onto the polynomials , which are orthogonal w.r.t. . Without loss of generality, assume . Obviously, for any function , we see

which means the embedding of an arbitrary function into our RKHS is exactly the orthogonal projection of . Our maximum mean discrepancy becomes

Assume we want to match points and weights , i.e., discrete measure to some target . We see

^3be7e2
with . Then, we get that

and somewhat less obviously, suppose we abuse notation to consider a diagonal matrix with diagonal entry equal to ; matrix calculus dictates

which indicates the trace is operating on a matrix with only one nonzero column ; therefore, is the th entry in the th column for and zero otherwise. This corresponds to

Here we maintain control of three things: . We know that induces the orthogonal polynomial family , and this gets normalized (with respect to ), so we don't get to choose which scaling for the family. Notably, for fixed , this becomes a quadratic loss in , so we can use a linear solve to find the solution. For the case of finding as well, it seems straightforward enough to attempt to just choose and optimize over these values. Unfortunately, even in one dimension, the Gaussian quadrature being unique does not mean that it will be easy to find in this setting. In higher dimensions, this will be very difficult without something of a greedy procedure.

Returning to dynamic

We now assume data from and want to estimate a quadrature via discrete as and we have reference distribution . Taking as our "preconditioner", we estimate for from data (perhaps), then we have (for quadrature points)

One can compare and contrast this with the minimization problem given in ^021c59. While they are very similar, note that the geometry of the static problem by on the points, and the WFR places the scaling of on the weights. This is a slight change, but can make remarkable differences due to imposing different geometries.

One node case

Suppose we have node and target measure , and we use kernel . Then, recall the form of WFR-MMD-Quad

In this squared-exponential kernel case, we know that and for any . Therefore, we reduce the equations to

We know that, if we set the RHS to zero, we might get a steady-state solution (i.e., a fixed-point of the ODE) such that and for all . Therefore, assuming strictly, we must have

The expressions on the right are necessitated by the form of the squared-exponential kernel.

Dirac case

If for some , then consider and . Therefore, we know that is a valid steady-state solution.

Empirical distribution case

Suppose for Lebesgue measure . Then, we consider the empirical distribution for given samples . Then, we note that , which is creating a KDE using the samples and evaluating it at . Similarly, the RHS of the expression for is estimating the moment of from its KDE. If , then we get that independent of , and therefore I'm pretty sure that, as , regardless of . Similarly, I think that approaches the expectation of .

I'm not sure of either of these things, though.

Time-dependent Dirac case

We return to the original ODE when Then,

Particle Mirror Descent

Wasserstein-Fisher-Rao Gradient Flow

Dynamic transport of functions

I'm particularly interested if this can be extended to work for flows of functions and, in particular, orthogonal polynomials. Recall that

Transport of coefficients

It's fairly clear that, if you take the inner product of both sides with , we get that

Similarly,
We can simplify the numerator for as due to wikipedia. The second form of the norm does not seem very helpful.

Assuming this is correct, we further constrict our polynomials to have unit magnitude, which automatically indicates that and by construction. Now suppose we construct as the th orthonormal polynomial w.r.t. measure . We should get that is a function representing the th recurrence term at time .

One could show that (assuming unit variance of the polynomials)

I have no idea what should be due to the representation of . Additionally, note that and depend on and . At this point, this is a disgusting equation whic h no one would ever want to solve. At least I don't want to.

Another addition is that this becomes somewhat intractable in higher than one dimension due to the finicky nature of recurrence relations on multivariate polynomials.

Finally, I'm not entirely sure what one would linearize here, or how this could possibly be made more efficient.

Transport of function values? Roots?

The subsequent question is what else you would transport. One idea I like is transporting roots so that, at the end of the transport, you approximate the exact roots of the orthonormal polynomials (then one could maybe approximate the value of the Christoffel function or something? I'm not sure)

Alternatively, another idea would be somehow transporting the weights so that you keep the points and the weights change according to the transport (in the vein of tempered importance weighting?)

Monic Polynomials

Generally, we know , where and . Because I'm lazy, I'm going to drop the hat (even though technically doesn't agree with the above definition of ). Then, we see the more complicated time derivative of as

Via our abuse of notation, however, stays the same as ^c0ec95. On the other hand, we get

Forcing a Homotopy

Suppose we know up to a constant. Then, set . Then, , so we observe

and

Additionally, on the corner cases, assuming we know

Now we can formulate an algorithm:

Dynamic Quadrature Flow (under particular )

Assume: ,
Input: (possibly unnormalized?) density , for , step size
Output: for , which integrate , respectively

  • for
    - Calculate from .
    - Set and renormalize
    - Note that , so calculate
    - for ,
    - for each quadrature point
    - Calculate
    - Calculate
    - Calculate
    - Estimate $$
    \begin{align}
    \dot{a}{k}(t) &\approx \sum\limits{i=1}^{N}w{t}^{(i)}x^{(i)}{t}p{k}^{(i)}\left[p{k}^{(i)}v^{(i)} + 2\dot{p}{k}^{(i)}\right]\
    \dot{\ell}
    {k}(t) &\approx \sum\limits{i=1}^{N}w{t}^{(i)}p{k}^{(i)}\left[p{k}^{(i)}v^{(i)} + 2\dot{p}_{k}^{(i)}\right]
    \end{align
    }
  • Calculate from .

Note that this approximates the Euler discretization of the ODE of . However, the inner for loop approximates the derivative of , given the values of , so one could use any other scheme (e.g., Runge-Kutta).

Polynomial Recurrence Flow

On integration in a measure transport framework

Setting the stage

Notation Meaning
target
reference
, Densities of mapping and to , respectively
Transport
Approximation space when considering
Approximation of when confined to
Quadrature rule on
function we want expectation of on , respectively
The "transported" quadrature rule,
Orthogonal polynomials with respect to

We are interested in, for , approximating the expectation by a quadrature rule of the form . There are a few central questions:

  • What can we prove about when integrating ?
  • What can we prove about when integrating ?
  • How does the error between and affect the difference between and ?
  • How do we consider all of these to bound the error on
  • How do we choose to minimize in some appropriate norm? Does this change how we choose the loss?

What's the potential issue?

What can we say about classical quadrature?

Quadrature generally, to some degree, approximates inner products via polynomials of some kind. Gauss quadrature gets phenomenal convergence via the fact that is exact on polynomials up to degree . Clenshaw-Curtis is exact on polynomials of degree , and many other quadrature rules share this property. Needless to say, this hinges precisely on the fact that , when smooth, can be approximated very well by polynomials; further, if "largely made of" low-degree polynomials (i.e. any polynomial approximation, e.g. PCE, has rapidly decaying coefficients), we know that this quadrature rule works very well. Suppose we have , orthogonal under ; let be the degree PCE of . We know that, for sufficiently smooth ,

which comes from Parseval's identity (see Sagiv 22). What we want is that, given a function and a set , we can create an approximation

or something like that. Note here that is a complete degree of freedom-- we can choose it however we wish. Note here a few things:

  • is probably not a polynomial, and is almost certainly not a polynomial
  • Because of this, is almost definitely not a polynomial
  • Regardless, assuming orthonormality of , we get that
    Similar to this, we get
    Therefore, a spectral projection of onto is equivalent to a spectral projection of onto , and thus we inherit any convergence properties of spectral approximation on of the function .

Example 1: Compact to Infinite domain

Suppose we map from the reference to via the inverse CDF of the Normal distribution as . Then, are naturally the Legendre polynomials and is a Legendre polynomial composed with the CDF of which, while smooth, is entirely bounded for any given . Therefore, a -point quadrature rule will only capture behavior on the domain for , which grows rather slowly (i.e., the effective domain of this quadrature rule is rather small). However, the basis functions are fairly smooth so we can get convergence faster than the rate cited above.

Example 2: Mapping between tails

In the above example, the problem is ostensibly that the transport maps a distribution with "lighter tails" (in fact, tails that are exactly zero) to a distribution with "heavier tails". It then stands to reason that we might do better if we do the opposite way. Consider now (the reference is Gaussian now) and , i.e., . We are initially pleased that then are orthogonal polynomials! However, there are a few problems that crop up.

  • ; so, if we use a quadrature rule with odd, the middle point index (which receives the most weight!) will stay at zero, where Therefore, our "practitioner's method" places a lot of weight where the density places zero weight.
  • The polynomials are indeed orthogonal and clearly span ; however, they aren't a "traditional family" since , so the th term is a degree polynomial. Therefore, if , we will never be able to integrate exactly!
  • Indeed, the rate of convergence for is abyssmal. The rate is precisely the rate of decay of the coefficients when projecting onto (consider Parseval).

What I take away from these examples is that having "lighter" tails isn't enough; you want the same tails, i.e., the transport should be exactly linear in the tails. In this sense, we might consider regularizing by the log of the Sobolev norm or something.

Other topics to consider:

Suppose is a RKHS with kernel . It would be nice to get something of the following form:

where is the discrete distribution using the pushforward of on a quadrature w.r.t. .

Why is that interesting?

This "moves the onus onto ", so-to-speak. For a given kernel, will smooth out the "high moments" of the distributions and only care about the broad strokes of similarity. Then, as long as broadly captures the features of , we get that the bias of the quadrature is largely dependent on how bad the function is itself. In some sense, this is the "easier" problem to solve, because most convergence of quadrature rules depend on how large the norm of is in some sobolev sense.

Some other notes

  • Use "heat" to diffuse into , and then perform transport on for small (or decreasing) .
  • If is our integral and is our -point approximation of the integral , do something like
  • Note that if , we have that . Similar if in (sobolev).
  • Sagiv, Baptista paper is a very loose bound by strong engine for bounding problems.
  • Suppose we know Gaussian quadrature points for and consider a map that draws a voronoi tesselation on the domain and maps deterministically to these quadrature points. This is an atrocious sampling map, but has wonderful expectation approximation properties if it maps gaussian points to gaussian points (for a specific construction).
  • Consider distributions with compact support?
  • Sobolev norm is introduced in statistics to ensure uniqueness of the map
  • What is the geometry induced by the MMD? Even less formally, more pedestrian, how do simple functions (e.g. piecewise linear) functions morph under ?
    • In this sense, can we say anything about, e.g., the trapezoidal rule?
  • Instead of choosing to minimize the error, maybe just minimize an upper bound on the error.
    • Something to do with the Sagiv paper bound.
    • How to tune ? I personally want to use sobolev regularization, but Amir is a little skeptical
  • His process:
  1. Create approximation theory for transformed space
    • From parseval and spectral approximation, etc (This is where I think sobolev norm kicks in!!)
  2. Look at the error of the expectations of between target and pushforward to get interesting bound
  3. Use bounds on error of pushforward to encourage construction in the right place
Practitioner's Quadrature

MrHyDE Related work: Binding, Julia Lib

Suppose we have that, for nodes and kernel function ,

and, for those fixed nodes , we define the optimal simplex weights

The only difference between and is that the latter lives on the simplex; both are defined for the same set of nodes . We know that

For , we can see that

where are defined appropriately. Assuming matrix is full rank (which is a reasonable enough assumption for our purposes), we can see that

which ensures some kind of "interpolatory behavior". This gives the loss as

Now we set , so by definition, . We see

which comes from the fact that will be a positive-definite norm for a full rank kernel matrix . Of course, if , then and we get that .

Orthogonal projection?

Suppose we choose to be such that (i.e., the -orthogonal projection onto the simplex). We of course will get that , so we can simplify the above to be

where the third term comes from the definition of . I should note that we may not be able to project onto the simplex if is not in the positive orthant; we could project onto the plane and then project that onto the simplex, but it would no longer be orthogonal? idk.

Proportionality?

Suppose for some . We will see that

however by construction, we can also just plug the simplex solution into the definition of to see

Mean Shift
Input: multi-index set mset_full, which is assumed downward-closed
Output: quad_rules: set of tensor-product quadratures and their (possibly negative) associated cardinality `Vector{Tuple{Midx, Int}}`
keep_looping = true
quad_rules = []

while keep_looping
	# Form the mset for this loop
	look at indexes of occurrences that are not 1.
	form them into mset_loop
	get reduced frontier frontier of mset_loop
	
	# Adjust each frontier member and reindex their ancestors' number of occurrences
	for idx_midx_loop in frontier:
		# Access the adjustment for this frontier member
		get index idx_midx_full corresponding to the multi-index idx_midx_loop in mset_full
		set j = occurrences[idx_midx_full] - 1, e.g., if idx_midx_full occurs $j+1=-1$ times, then $j=-2$.
		occurrences[idx_midx_full] = 1

		# Reindex the number of occurrences of the backward ancestors
		get backward ancestors of idx_midx_loop
		for each ancestor idx_back_loop of idx_midx_loop
			get original index idx_back_full of idx_back_loop (i.e. index of idx_back_loop w.r.t. mset_full, not mset_loop)
			occurrences[idx] -= j
		end for
		push (idx_midx_full, j) onto quad_rules
	end for
	keep_looping = occurrences is not identically 1
end while

return quad_rules
Smolyak