Unpaired Image-to-Image Translation using Generative Adversarial Networks

Updated: Jan 2, 2019


In paired image translation, the goal is to find a mapping function between two images using a convolutional neural network trained on a set of paired images. In the real-world problem, we don’t always have paired datasets, that’s why unpaired translation is the solution to data deficiency.


How would Florida look like under a heavy snow? How will London look like under a sand tornado? These question usually don’t come to mind, but what if we want to simulate Elizabeth Tower under heavy snow with its current construction going on. We need someone with good photoshop experience to help us out. These solutions need a huge amount of labour power and usually time-consuming. We can imagine Trafalgar Square in Christmas, with its amazing festive decoration, but how will it look like if there are 10,000 tourists bursting into the square and taking photos. That’s hard to imagine but these kinds of data and simulations can be useful for security and control organizations. Using cycle-consistent networks we can simulate and generate these sort of situation. Gans and in specific Cycle-Gans can have a broader use case in solving real-world problems.

They are useful in health care systems to generate new medicines for uncured diseases.

GANs have been used in the context of continuous laboratory time series data. It can devise an unsupervised evaluation method that measures the predictive power of synthetic laboratory test time series.

NVIDIA showed an amazing example of this approach in action: they used GANs to augment dataset of medical brain CT images with different diseases and showed that the classification performance using only classic data augmentation yielded 78.6% sensitivity and 88.4% specificity.

Generative Networks

Generative networks are a method of deep learning where the network can learn and mimic the data. Its aim is to model the distribution that a given set of data (e.g. images, audio) came from. Normally this is an unsupervised problem, in the sense that the models are trained on a large collection of data.

Generative model vs Discriminative model

In generative models, we learn the joint probability of P(x, y) but in discriminative models, we learn the probability of P(y | x).

In generative models, the aim is to generate new data by creating a convolutional neural network given a dataset.

The advantages of generative networks are that by having the knowledge about data distribution, either a Gaussian or multinomial, we can select a point in that distribution. That point shouldn’t necessarily be in the dataset, but it’s part of the range of possibilities. Thus it can be a new data point for us that has all the properties of our dataset. So by learning about the distribution, we can generate new samples from it that completely fits in the probability distribution.

Generative models are usually costly to train compared to discriminator networks. Discriminator models are much easier to train and their objective is to differentiate their input from their training dataset ( P(y | x ), y is the new image and x is the dataset ).

Generative models are not just bounded to convolutional neural networks, there are other popular models as well. Models like Gaussian, Naive Bayes, Mixture of Multinomial, Mixture of Gaussian, Hidden Markov Models ( HMM ), Sigmodial belief networks, Bayesian networks and Markov random field.

Also, we can use different models for a discriminator network. Some of the more popular ones are Logistic regression, SVM, CNN, Nearest neighbour, Conditional Random Fields.

Generative Adversarial Networks

Generative adversarial networks use an elegant training criterion that doesn’t require computing the likelihood. In particular, if the generator is doing a good job of modelling the data distribution, then the generated samples should be indistinguishable from the true data. So the idea behind GANs is to train a discriminator network whose job is to classify whether an observation is from the training set or whether it was produced by the generator. The generator is evaluated based on the discriminator’s inability to tell its samples from data.

In other words, we simultaneously train two different networks, a generator network G and a discriminator network D. The two networks are trained competitively. As Ian Goodfellow’s example in his paper, the generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles.

The discriminator is trained just like a logistic regression classifier. Its cost function is the cross-entropy for classifying real vs. fake:

JD = Ex∼D[− log D(x)] + Ez[− log(1 − D(G(z)))] (1)

Here, x ∼ D denotes sampling from the training set. If the discriminator has low cross entropy, that means it can easily distinguish real from fake. If it has high cross-entropy, that means it can’t. Therefore, the most straightforward criterion for the generator is to maximize the discriminator’s cross-entropy.

This is equivalent to making the generator’s cost function the negative cross-entropy: JG = −JD = const + Ez[log(1 − D(G(z)))]

This is shown schematically as follows:

Different GANs

Throughout the years' Generative networks have evolved a lot. They are usually costly to train and they have major issues with their cost functions and they could only generate small size data.


One of the first improvements towards an optimal GAN was the creation of Deep Convolution GANs ( DCGANs ).

They are more stable and optimized in both the generator and the discriminator network.

They introduced batch normalization in both networks and by reducing the degree of hidden layers they got improvements in performance and results on high dimensionality data.

Some of the features they used to improve DCGANs:

Feature matching: instead of having the generator trying to fool the discriminator as much as possible, they propose a new objective function. This objective requires the generator to generate data that matches the statistics of the real data. In this case, the discriminator is only used to specify which are the statistics worth matching.

Historical averaging: when updating the parameters, also take into account their past values.

One-sided label smoothing: simply make your discriminator target output from [0=fake image, 1=real image] to [0=fake image, 0.9=real image]. This improves the training.

Virtual batch normalization: avoid dependency of data on the same batch by using statistics collected on a reference batch. It is computationally expensive, so it’s only used on the generator.


The idea of Cycle Consistency is using transitivity as a way to regularize structured data has a long history. In visual tracking, enforcing simple forward-backwards consistency has been a standard trick for decades. In the language domain, verifying and improving translations via “back translation and reconciliation” is a technique used by human translators, as well as by machines. More recently, higher-order cycle consistency has been used in structure from motion, 3D shape matching, co-segmentation, dense semantic alignment, and depth estimation. We use cycle consistency loss as a way of using transitivity to supervise CNN training.

Our goal here is to find a mapping function between two datasets that following each path, we get to the same entry data. Mathematically, if we have a network G: X -> Y and another network F: Y -> X, then G and F should be inverse of each other and both networks must be bijections.

We apply this structural assumption by training both the mapping G and F simultaneously and adding cycle consistency loss that encourages F(G(x)) = x and G(F(y)) = y. Combining this loos with adversarial losses on domain X and Y yields our full objective for unpaired translation.

L(G, F, DX, DY ) =LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + λLcyc(G, F),

where λ controls the relative importance of the two objectives. We aim to solve:

G ∗, F∗ = arg min G, F max Dx, DY L(G, F, DX, DY ).

Wasserstein GAN

Training two distribution networks, means we have two cost functions that we need to find an optimal point. But one of the main problems with generative networks is that this process can go forever because the two cost functions never converge. This means we never know when our leaning has finished.

Wasserstein distance, mathematically speaking, is the distance between two or more distributions.

This can be useful in generative adversarial networks as it has two probability distribution networks competing with each other.

As we can see from the figure below, a normal cost function for a generative network has no convergence and there is no optimal point.

By using Wasserstein distance to optimise our cost function we get the figure below.

In particular, training WGANs does not require maintaining a careful balance in training of the discriminator and the generator, and does not require a careful design of the network architecture either. The mode dropping phenomenon that is typical in GANs is also drastically reduced. One of the most compelling practical benefits of WGANs is the ability to continuously estimate the EM distance by training the discriminator to optimality. Plotting these learning curves is not only useful for debugging and hyperparameter searches, but also correlate remarkably well with the observed sample quality.

Future Work

Optimization on Discriminator

Normalizing my datasets

Resizing input images

Training on smaller feature sets

Improving cost function

Testing 2 or more steps discriminator

More optimized implementation on Amazon SageMaker

Element Distribution

This is one of the concepts that I’m currently researching about.

If we can have a discriminator that is trained only on a specific element in an image we can have our generator to randomly add or remove that element from the image and our discriminator to make it a fit in our dataset. If we assume our dataset is trained on only the logical distribution of that element, we can say the newly generated image is logical. This can be helpful for simulation in health-care, gaming and security.