Optimizing Linear Maps (new draft)

As established in (prev section), choosing a linear map parameterization permits not just more efficient map evaluation, but a more efficient map optimization routine. Recall for a general (not necessarily separable) map, each map component function can be defined as

where are the th basis functions of map component , associated with the nonmonotone and monotone terms, respectively. Further, are the multivariate basis functions for the nonmonotone functions (i.e., ). The function and is the "monotone function"; in the rectified integrated basis, this is the rectified integral, and in the separable linear basis this is the linear combination of, e.g., iRBF functions in . The vectors and are the coefficients of each of these respective collections basis functions (technically, can be generic nonlinear parameters). For maps from samples, we recall the following optimization objective for :

Substituting Equation ^9e08c8 into this expression, we obtain

We note that maps vectors to vectors, so the dot products with its coefficient vectors map these to scalars. We now perform a few simplifications; first, we recognize that the partial derivative will eliminate the term by construction of our "nonmonotone" part of our function . Further, we can recognize that a sum of squares of a linear combination is indeed going to be a squared two-norm of a vector. Therefore, we can abuse notation to assemble the matrix where the component is evaluation of the th basis function on of the respective function and the vector where the th component is the evaluation of at point . This simplifies notation into

This gives way to the astounding realization that, when optimizing for the coefficients and , the nonquadratic (i.e., log) term entirely independent of . As such, any optimal choice of coefficients for the nonmonotone the basis (where the hat indicates that these coefficients are optimal) must satisfy

This of course leads to a typical solution of an ordinary least squares problem, seen often in numerical linear algebra and linear regression. Therefore, given a choice of monotone coefficients , we know the optimal choice takes the form

which comes from any standard text on numerical linear algebra (e.g., Trefethen and Bau).

We can investigate some alternatives and a more general case. Consider the more general loss function

We define the function because, since this is independent of , it can only simplify in the linear case. This gives us the gradient of the objective on the nonmonotone parameters via

SVD solution

While there is only one scaling parameter here, , this can easily be extended to when and might have different regularization terms (indeed, it could even extend to when we wish to weight different elements of each coefficient vector differently). Suppose has SVD . Then, setting of w.r.t. to zero, we get

^d8eb8d
We know that where . If , then we know , and the solution becomes clear. If , then we now have an underdetermined set of equations. However, it is clear our goal is to compute with a minimal norm and so a "least norm" solution becomes

where is the set of all satisfying ^930584. Since has linearly independent columns, we use the Moore-Penrose pseudoinverse to get that

which agrees with the case of when . Then, the optimization objective becomes

The regularization term simplifies marginally due to the fact that we know for any due to properties of orthogonal matrices. At this point, for general , the logarithmic and regularization terms cannot simplify substantially, so we focus on the first term, i.e. squared-norm loss. We can deduce via clear linear algebra that this ends up becoming

Note that, in the case of , , the orthogonal projector that looks eerily similar to the QR case, and , which gives an equivalent answer to using , which gives the original least-squares/least-norm solution for .

In the linear case, we know that , which simplifies our work substantially to find that the loss function becomes

Normal solution

I think at this point the factorizations are pointless; since I'm only using these solutions once, it ends up becoming significantly more expensive than just using a symmetric solver. Suppose a linear basis and define

and let . Then,

which agrees with all above formulations. This loss will clearly have a gradient (with respect to ) of

In the adaptive transport map algorithm, we care about what the gradient of is with respect to coefficients we have yet to introduce into our model; suppose we have candidate basis functions represented by coefficients and , which correspond to candidate terms and for some sets of basis functions and . Vitally, these sets do not necessarily have the same cardinality, i.e., and . Then suppose we have matrices and vectors such that

The loss function, for general and and optimal choice becomes

Now we differentiate with respect to and and evaluate when these coefficients are zero. Starting with the easier case of ,

Therefore, all we need are the matrices which are evaluations of the candidate functions at the same points as and were evaluated on (i.e. we do not necessarily need evaluations of already "accepted" basis functions).