Robustness via Uncertainy-aware Cycle Consistency
Uddeshya Upadhyay, Yanbei Chen, Zeynep Akata
Neural Information Processing Systems, NeurIPS

Unpaired image-to-image translation refers to learning inter-image-domain mapping without corresponding image pairs. Existing methods learn deterministic mappings without explicitly modelling the robustness to outliers or predictive uncertainty, leading to performance degradation when encountering unseen perturbations at test time. To address this, we propose a novel probabilistic method based on Uncertainty-aware Generalized Adaptive Cycle Consistency (UGAC), which models the per-pixel residual by generalized Gaussian distribution, capable of modelling heavy-tailed distributions. We compare our model with a wide variety of state-of-the-art methods on various challenging tasks including unpaired image translation of natural images, using standard datasets, spanning autonomous driving, maps, facades, and also in medical imaging domain consisting of MRI. Experimental results demonstrate that our method exhibits stronger robustness towards unseen perturbations in test data. Code is released here:

Long Summary

Translating an image from a distribution, i.e. source domain, to an image in another distribution, i.e. target domain, with a distribution shift is an ill-posed problem as a unique deterministic one-to-one mapping may not exist between the two domains. Furthermore, since the correspondence between inter-domain samples may be missing, their joint-distribution needs to be inferred from a set of marginal distributions. However, as infinitely many joint distributions can be decomposed into a fixed set of marginal distributions, the problem is ill-posed in the absence of additional constraints. Image translation approaches often learn a deterministic mapping between the domains where every pixel in the input domain is mapped to a fixed pixel value in the output domain. However, such a deterministic formulation can lead to mode collapse while at the same time not being able to quantify the model predictive uncertainty important for critical applications, e.g., medical image analysis. We propose an unpaired probabilistic image-to-image translation method trained without inter-domain correspondence in an end-to-end manner. The probabilistic nature of this method provides uncertainty estimates for the predictions. Moreover, modelling the residuals between the predictions and the ground-truth with heavy-tailed distributions makes our model robust to outliers and various unseen data.


Let there be two image domains AA and BB. Let the set of images from domain AA and BB be defined by (i) SA:={a1,}S_{A} := \{a_1, a_2 ... a_n\}, where aiPAia_i \sim \mathcal{P}_A \forall i and (ii) SB:={b1,}S_{B} := \{b_1, b_2 ... b_m\}, where biPBib_i \sim \mathcal{P}_B \forall i, respectively. The elements aia_i and bib_i represent the ithi^{th} image from domain AA and BB respectively, and are drawn from an underlying unknown probability distribution PA\mathcal{P}_{A} and PB\mathcal{P}_{B} respectively.

Let each image have KK pixels, and uiku_{ik} represent the kthk^{th} pixel of a particular image uiu_i. We are interested in learning a mapping from domain AA to BB (ABA \rightarrow B) and BB to AA (BAB \rightarrow A) in an unpaired manner so that the correspondence between the samples from PA\mathcal{P}_A and PB\mathcal{P}_B is not required at the learning stage. In other words, we want to learn the underlying joint distribution PAB\mathcal{P}_{AB} from the given marginal distributions PA\mathcal{P}_A and PB\mathcal{P}_B. This work utilizes CycleGANs that leverage the cycle consistency to learn mappings from both directions (ABA \rightarrow B and BAB \rightarrow A).

Cycle Consistency and its interpretation as Maximum Likelihood Estimation (MLE)

CycleGAN enforces an additional structure on the joint distribution using a set of primary networks (forming a GAN) and a set of auxiliary networks. The primary networks are represented by {GA(;θAG),DA(;θAD)}\{\mathcal{G}_A(\cdot; \theta^\mathcal{G}_A), \mathcal{D}_A(\cdot; \theta^\mathcal{D}_A)\}, where GA\mathcal{G}_A represents a generator and DA\mathcal{D}_A represents a discriminator. The auxiliary networks are represented by {GB(;θBG),DB(;θBD)}\{\mathcal{G}_B(\cdot; \theta^\mathcal{G}_B), \mathcal{D}_B(\cdot; \theta^\mathcal{D}_B)\}. While the primary networks learn the mapping ABA \rightarrow B, the auxiliary networks learn BAB \rightarrow A. Let the output of the generator GA\mathcal{G}_A translating samples from domain AA (say aia_i) to domain BB be called b^i\hat{b}_i. Similarly, for the generator GB\mathcal{G}_B translating samples from domain BB (say bib_i) to domain AA be called a^i\hat{a}_i, i.e., b^i=GA(ai;θAG) and a^i=GB(bi;θBG) \hat{b}_i = \mathcal{G}_A(a_i; \theta^{\mathcal{G}}_A) \text{ and } \hat{a}_i = \mathcal{G}_B(b_i; \theta^{\mathcal{G}}_B). To simplify the notation, we will omit writing parameters of the networks in the equation. The cycle consistency constraint re-translates the above predictions (b^i,a^i\hat{b}_i, \hat{a}_i) to get back the reconstruction in the original domain (aˉi\bar{a}_i,bˉi\bar{b}_i), where, aˉi=GB(b^i) and bˉi=GA(a^i),\bar{a}_i = \mathcal{G}_B(\hat{b}_i) \text{ and } \bar{b}_i = \mathcal{G}_A(\hat{a}_i), and attempts to make reconstructed images (aˉi,bˉi\bar{a}_i, \bar{b}_i) similar to original input (ai,bia_i, b_i) by penalizing the residuals with L1\mathcal{L}_1 norm between the reconstructions and the original input images, giving the cycle consistency,

Lcyc(aˉi,bˉi,ai,bi)=L1(aˉi,ai)+L1(bˉi,bi).\mathcal{L}_{\text{cyc}}(\bar{a}_i, \bar{b}_i, a_i, b_i) = \mathcal{L}_1(\bar{a}_i, a_i) + \mathcal{L}_1(\bar{b}_i, b_i).

The underlying assumption when penalizing with the L1\mathcal{L}_1 norm is that the residual at \textit{every pixel} between the reconstruction and the input follow \textit{zero-mean and fixed-variance Laplace} distribution, i.e., aˉij=aij+ϵija\bar{a}_{ij} = a_{ij} + \epsilon^a_{ij} and bˉij=bij+ϵijb\bar{b}_{ij} = b_{ij} + \epsilon^b_{ij} with,

ϵija,ϵijbLaplace(ϵ;0,σ2)12σ2e2ϵ0σ,\epsilon^a_{ij}, \epsilon^b_{ij} \sim Laplace(\epsilon; 0,\frac{\sigma}{\sqrt{2}}) \equiv \frac{1}{\sqrt{2\sigma^2}}e^{-\sqrt{2}\frac{|\epsilon-0|}{\sigma}},

where σ2\sigma^2 represents the fixed-variance of the distribution, aija_{ij} represents the jthj^{th} pixel in image aia_i, and ϵija\epsilon^{a}_{ij} represents the noise in the jthj^{th} pixel for the estimated image aˉij\bar{a}_{ij}. This assumption on the residuals between the reconstruction and the input enforces the likelihood (i.e., L(ΘX)=P(XΘ)\mathscr{L}(\Theta | \mathcal{X}) = \mathcal{P}(\mathcal{X}|\Theta), where Θ:=θAGθBGθADθBD\Theta := \theta^{\mathcal{G}}_A \cup \theta^{\mathcal{G}}_B \cup \theta^{\mathcal{D}}_A \cup \theta^{\mathcal{D}}_B and X:=SASB\mathcal{X}:= S_A \cup S_B) to follow a factored Laplace distribution:

L(ΘX)ijpqe2aˉijaijσe2bˉpqbpqσ,\begin{align} \mathscr{L}(\Theta | \mathcal{X}) &\propto \bm\prod_{ijpq} e^{-\frac{\sqrt{2}|\bar{a}_{ij}-a_{ij}|}{\sigma}} e^{-\frac{\sqrt{2}|\bar{b}_{pq}-b_{pq}|}{\sigma}}, \end{align}

where minimizing the negative-log-likelihood yields Lcyc\mathcal{L}_{\text{cyc}} with the following limitations. The residuals in the presence of outliers may not follow the Laplace distribution but instead a heavy-tailed distribution, whereas the i.i.d assumption leads to fixed variance distributions for the residuals that do not allow modelling of heteroscedasticity to aid in uncertainty estimation.

Building Uncertainty-aware Cycle Consistency

We propose to alleviate the mentioned issues by modelling the underlying per-pixel residual distribution as independent but non-identically distributed zero-mean generalized Gaussian distribution} (GGD), i.e., with no fixed shape (β>0\beta > 0) and scale (α>0\alpha > 0) parameters. Instead, all the shape and scale parameters of the distributions are predicted from the networks and formulated as follows:

ϵija,ϵijbGGD(ϵ;0,αˉij,βˉij)βˉij2αˉijΓ(1βˉij)e(ϵ0αˉij)βˉij.\epsilon^a_{ij}, \epsilon^b_{ij} \sim GGD(\epsilon; 0, \bar{\alpha}_{ij}, \bar{\beta}_{ij}) \equiv \frac{\bar{\beta}_{ij}}{2\bar{\alpha}_{ij}\Gamma(\frac{1}{\bar{\beta}_{ij}})}e^{-\left(\frac{|\epsilon-0|}{\bar{\alpha}_{ij}}\right)^{\bar{\beta}_{ij}}}.

For each ϵij\epsilon_{ij}, the parameters of the distribution {αˉij,βˉij}\{\bar{\alpha}_{ij}, \bar{\beta}_{ij}\} may not be the same as parameters for other ϵik\epsilon_{ik}s; therefore, they are non-identically distributed allowing modelling with heavier tail distributions. The likelihood for our proposed model is,

L(ΘX)=ijpqG(βˉija,αˉija,aˉij,aij)G(βˉpqb,αˉpqb,bˉpq,bpq),\mathscr{L}(\Theta | \mathcal{X}) = \bm\prod_{ijpq} \mathscr{G}(\bar{\beta}^a_{ij},\bar{\alpha}^a_{ij},\bar{a}_{ij},a_{ij}) \mathscr{G}(\bar{\beta}^b_{pq},\bar{\alpha}^b_{pq},\bar{b}_{pq},b_{pq}),

where (βˉija\bar{\beta}^a_{ij}) represents the jthj^{th} pixel of domain AA's shape parameter βia\beta^a_i (similarly for others). G(βˉiju,αˉiju,uˉij,uij)\mathscr{G}(\bar{\beta}^u_{ij},\bar{\alpha}^u_{ij},\bar{u}_{ij},u_{ij}) is the pixel-likelihood at jthj^{th} pixel of image uiu_i (that can represent images of both domain AA and BB) formulated as, G(βˉiju,αˉiju,uˉij,uij)=GGD(uij;uˉij,αˉiju,βˉiju). \mathscr{G}(\bar{\beta}^u_{ij},\bar{\alpha}^u_{ij},\bar{u}_{ij},u_{ij}) = GGD(u_{ij}; \bar{u}_{ij}, \bar{\alpha}^u_{ij}, \bar{\beta}^u_{ij}).

The negative-log-likelihood is given by,

lnL(ΘX)=ijpq[lnβˉija2αˉijaΓ(1βˉija)e(aˉijaijαˉija)βˉija+lnβˉpqb2αˉpqbΓ(1βˉpqb)e(bˉpqbpqαˉpqb)βˉpqb]-\ln{\mathscr{L}(\Theta | \mathcal{X})} = -\bm\sum_{ijpq} \left [ \ln\frac{\bar{\beta}^a_{ij}}{2\bar{\alpha}^a_{ij}\Gamma(\frac{1}{\bar{\beta}^a_{ij}})}e^{-\left(\frac{|\bar{a}_{ij}-a_{ij}|}{\bar{\alpha}^a_{ij}}\right)^{\bar{\beta}^a_{ij}}} + \ln\frac{\bar{\beta}^b_{pq}}{2\bar{\alpha}^b_{pq}\Gamma(\frac{1}{\bar{\beta}^b_{pq}})}e^{-\left(\frac{|\bar{b}_{pq}-b_{pq}|}{\bar{\alpha}^b_{pq}}\right)^{\bar{\beta}^b_{pq}}} \right ]

minimizing the negative-log-likelihood yields a new cycle consistency loss, which we call as the uncertainty-aware generalized adaptive cycle consistency loss Lucyc\mathcal{L}_{\text{ucyc}}, given A={aˉi,αˉia,βˉia,ai}\mathscr{A}=\{\bar{a}_i, \bar{\alpha}^{a}_i, \bar{\beta}^{a}_i, a_i\} and B={bˉi,αˉib,βˉib,bi}\mathscr{B}=\{\bar{b}_i, \bar{\alpha}^{b}_i, \bar{\beta}^{b}_i, b_i\},

Lucyc(A,B)=Lαβ(A)+Lαβ(B),\mathcal{L}_{\text{ucyc}}(\mathscr{A}, \mathscr{B}) = \mathcal{L}_{\alpha\beta}(\mathscr{A}) + \mathcal{L}_{\alpha\beta}(\mathscr{B}),

where Lαβ(A)=Lαβ(aˉi,αˉia,βˉia,ai)\mathcal{L}_{\alpha\beta}(\mathscr{A}) = \mathcal{L}_{\alpha\beta}(\bar{a}_i, \bar{\alpha}^{a}_i, \bar{\beta}^{a}_i, a_i) is the new objective function corresponding to domain AA,

Lαβ(aˉi,αˉia,βˉia,ai)=1Kj(aˉijaijαˉija)βˉijalogβˉijaαˉija+logΓ(1βˉija),\mathcal{L}_{\alpha\beta}(\bar{a}_i, \bar{\alpha}^{a}_i, \bar{\beta}^{a}_i, a_i) = \frac{1}{K}\bm\sum_{j} \left(\frac{|\bar{a}_{ij}-a_{ij}|}{\bar{\alpha}^{a}_{ij}} \right)^{\bar{\beta}^{a}_{ij}} - \log\frac{\bar{\beta}^{a}_{ij}}{\bar{\alpha}^{a}_{ij}} + \log\Gamma(\frac{1}{\bar{\beta}^{a}_{ij}}),

where (aˉi,bˉi)(\bar{a}_i, \bar{b}_i) are the reconstructions for (ai,bi)(a_i,b_i) and (αˉia,βˉia),(αˉib,βˉib)(\bar{\alpha}^{a}_i, \bar{\beta}^{a}_i), (\bar{\alpha}^{b}_i, \bar{\beta}^{b}_i) are scale and shape parameters for the reconstruction (aˉi,bˉi)(\bar{a}_i, \bar{b}_i), respectively. The L1\mathcal{L}_1 norm-based cycle consistency Lcyc\mathcal{L}_{\text{cyc}} is a special case of Lucyc\mathcal{L}_{\text{ucyc}} with (αˉija,βˉija,αˉijb,βˉijb)=(1,1,1,1)i,j(\bar{\alpha}^{a}_{ij}, \bar{\beta}^{a}_{ij}, \bar{\alpha}^{b}_{ij}, \bar{\beta}^{b}_{ij}) = (1,1,1,1) \forall i,j. To utilize Lucyc\mathcal{L}_{\text{ucyc}}, one must have the α\alpha maps and the β\beta maps for the reconstructions of the inputs. To obtain the reconstructed image, α\alpha (scale map), and β\beta (shape map), we modify the head of the generators (the last few convolutional layers) and split them into three heads, connected to a common backbone.

Once we train the model, for every input image, the model will provide the scale (α\alpha) and the shape (β\beta) maps that can be used to obtain the aleatoric uncertainty given by,

σaleatoric2=α2Γ(3β)Γ(1β)\sigma^2_{\text{aleatoric}} = \frac{\alpha^2\Gamma(\frac{3}{\beta})}{\Gamma(\frac{1}{\beta})}

To see the resulting uncertainty maps along with our perturbation analysis of the trained model please check Section 4 of the paper.

(c) 2021 Explainable Machine Learning Tübingen Impressum