Optimizing Linear Maps (new draft)
As established in (prev section), choosing a linear map parameterization permits not just more efficient map evaluation, but a more efficient map optimization routine. Recall for a general (not necessarily separable) map, each map component function  can be defined as
where  are the th basis functions of map component , associated with the nonmonotone and monotone terms, respectively. Further,   are the   multivariate basis functions for the nonmonotone functions (i.e., ). The function and  is the "monotone function"; in the rectified integrated basis, this is the rectified integral, and in the separable linear basis this is the linear combination of, e.g., iRBF functions in . The vectors  and  are the coefficients of each of these respective collections basis functions (technically,  can be generic nonlinear parameters). For maps from samples, we recall the following optimization objective for :
Substituting Equation ^9e08c8 into this expression, we obtain
We note that  maps vectors to vectors, so the dot products with its coefficient vectors map these to scalars. We now perform a few simplifications; first, we recognize that the partial derivative  will eliminate the term  by construction of our "nonmonotone" part of our function . Further, we can recognize that a sum of squares of a linear combination is indeed going to be a squared two-norm of a vector. Therefore, we can abuse notation to assemble the matrix  where the  component is evaluation of the  th basis function on  of the respective function and the vector  where the th component is the evaluation of  at point . This simplifies notation into
This gives way to the astounding realization that, when optimizing for the coefficients  and , the nonquadratic (i.e., log) term entirely independent of . As such, any optimal choice of coefficients  for the nonmonotone the basis  (where the hat indicates that these coefficients are optimal) must satisfy
This of course leads to a typical solution of an ordinary least squares problem, seen often in numerical linear algebra and linear regression. Therefore, given a choice of monotone coefficients , we know the optimal choice   takes the form
which comes from any standard text on numerical linear algebra (e.g., Trefethen and Bau).
We can investigate some alternatives and a more general case. Consider the more general loss function
We define the function  because, since this is independent of , it can only simplify in the linear case. This gives us the gradient of the objective on the nonmonotone parameters via
SVD solution
While there is only one scaling parameter here, , this can easily be extended to when  and  might have different regularization terms (indeed, it could even extend to when we wish to weight different elements of each coefficient vector differently). Suppose  has SVD . Then, setting of  w.r.t.  to zero, we get
^d8eb8d
We know that  where  .  If , then we know , and the solution becomes clear. If , then we now have an underdetermined set of equations. However, it is clear our goal is to compute  with a minimal norm and so a  "least norm" solution becomes 
where  is the set of all  satisfying ^930584. Since  has linearly independent columns, we use the Moore-Penrose pseudoinverse to get that
which agrees with the case of when . Then, the optimization objective becomes
The regularization term simplifies marginally due to the fact that we know  for any  due to properties of orthogonal matrices. At this point, for general , the logarithmic and regularization terms cannot simplify substantially, so we focus on the first term, i.e. squared-norm loss. We can deduce via clear linear algebra that this ends up becoming
Note that, in the case of ,  , the orthogonal projector that looks eerily similar to the QR case, and , which gives an equivalent answer to using  , which gives the original least-squares/least-norm solution for .
In the linear case, we know that , which simplifies our work substantially to find that the loss function becomes
Normal solution
I think at this point the factorizations are pointless; since I'm only using these solutions once, it ends up becoming significantly more expensive than just using a symmetric solver. Suppose a linear basis and define
which agrees with all above formulations. This loss will clearly have a gradient (with respect to ) of
In the adaptive transport map algorithm, we care about what the gradient of  is with respect to coefficients we have yet to introduce into our model; suppose we have candidate basis functions represented by coefficients  and , which correspond to candidate terms  and  for some sets of basis functions  and . Vitally, these sets do not necessarily have the same cardinality, i.e.,  and . Then suppose we have matrices   and vectors  such that
The loss function, for general  and  and optimal choice  becomes
Now we differentiate with respect to  and  and evaluate when these coefficients are zero. Starting with the easier case of ,
Therefore, all we need are the matrices  which are evaluations of the candidate functions at the same points as  and  were evaluated on (i.e. we do not necessarily need evaluations of already "accepted" basis functions).