Autodecoder
The auto-encoder is widely used in machine learning area and become an tool of great importance in dealing embedding and generative tasks. However, in most AE based generative model, the encoder isn’t of any use. Is it possible to get rid of the encoder in the first place and just train a decoder? With this question, I summarized some exploration in related literature.
An auto-decoder is used for decoding embedding to useful targets like images, videos, texts and etc. Unlike auto-encoder, auto-decoder does not have encoder part. A comparison of two model architectures is shown in figure.
The goal of this model is to decode an embeded information \(z_i\in\mathrm{R}^d\)(latent space) into meaning for information \(x_i\in\mathrm{R}^n\)(“physical space”). The model decodes as \(x_i\approx f_\theta(z_i)\), where \(\theta\) is learnable model parameters.
There are two parts are trainable:
Both \(z_i\)’s and \(\theta\) are randomly initialized and trained according to the following loss function: \(\begin{equation}\label{eq:auto_decode_loss} \text{argmin}_{\theta,\{z_i\}}\sum_{i=1}^N\left(\mathcal{L}(f_\theta(z_i), x_i)+\frac{1}{\sigma^2}\|z_i\|^2\right). \end{equation}\)
Yes, but not as in auto-encoder. Since there is no encoder part in the auto-decoder architecture, the computation of an embedding is more nontrivial. When a auto-decoder is trained, for a given information \(x\), you can randomly initialize an embedding \(z\). Then, optimize \(z\) through the Eq.(\(\ref{eq:auto_decode_loss}\)), with model parameters \(\theta\) fixed.
In this example, let’s use \(\{x_i(t)\}\) be a set of functions and \(\{z_i\}\) be their embeddings. In this case, the auto-decoder maps an embedding \(z_i\) to a function \(x_i(t)\).
For each function \(x_i(t)\), we randomly sample K sample points \(\{(t_k, x_i(t_k))\}_{i=1}^K\). The loss function is
\[\begin{equation} \text{argmin}_{\theta,\{z_i\}}\sum_{i=1}^N\sum_{k=1}^K\left(\mathcal{L}(f_\theta(z_i, t_k), x_i(t_k))+\frac{1}{\sigma^2}\|z_i\|^2\right). \end{equation}\]As we are optimising two sets of variables (the network weights and the latent vectors), we need to choose an optimisation strategy. The original paper does not detail how they optimised their objective function. In this implementation, we will alternate the optimisation of the latent vectors first for 20 epochs, and then the model parameters for 20 epochs. We do this 20 times, and we call this an alternating optimisation strategy. Alternatively, one can also optimize in a single process by using a single optimizer for both set of parameters. In both cases, we randomly sample minibatch \(B\) examples from \(N\times K\) data points.