Synthetic Image Generation — Mona Lisa’s Sister

7 min readApr 21, 2022

Creation by Noise

Working with original data is often quite limited. Legal, ethical or business limitations frequently hinder the use of an original datasets which -as an inconvenient consequence — hinders the use of high-tech artificial intelligence tools.

Imagine the potential use of Deep Learning architectures for medical image classification. Deep Learning models take over the classification of diseases based on medical scans, while releasing medical staff from this duty and giving doctors the necessary time to solely focus on the patient.

Point is that those artificial intelligence applications have to be trained for their task. And in order to train, a lot of data is needed. Though, original data are locked away in most cases due to the reasons mentioned above which puts a lot of constraints on the applicability of machine learning tools/ deep learning architectures.

But — there is a way out: the creation of synthetic datasets.

Synthetic datasets keep the properties of the original data. Nevertheless, a back-tracking to the original data is extremely improbable because synthetic data are built up of “noise” in a probabilistic way without ever seeing the original data.

How is that even possible?

In short, when you are in the possession of precious original data:

photo credits: WikiImages via pixabay.com

The original data are then replaced by synthetic data which are built up from scratch, or in statistical parlance: from “noise”:

Though, the synthetic data shall keep the properties of the original data while not being retraceable and especially one does not want to end up with a somehow useless and out-of-range copy of the original dataset:

photo credits: OpenClipart-Vectors via pixapay.com

There are several ways to achieve this. Generative Adversarial Networks, Markov Chain Generators or Gaussian Mixture Models are some of them.

In this article, we take Generative Adversarial Networks to explain how synthetic data is created from scratch/ noise.

Generative Adversarial Networks

A Generative Adversarial Network (GAN) comprises two Deep Learning structures. The first structure is called the Generator. The second structure is called the Discriminator.

In short, the task of the Generator is to create synthetic images (i.e. fake images) which resemble the original images in a very close way. The Discriminator is an image classification tool which tries to prevent the Generator getting its fake images accepted.

The basic structure of a GAN:

The basic idea of the Discriminator is that this machine should be able to distinguish between real and fake images. It is initially trained on the original dataset. But with the Generator producing fake images, those fake images are also used to train the Discriminator together with the original dataset. Goal is to improve the classification accuracy of the Discriminator constantly and to make it harder for the Generator to get its (fake) images accepted.

The Generator starts to produce fake images out of a “rubble” of colours. This rubble is called White Noise in statistical terms. After generating the image, the Generator sends the fake image to the Discriminator and the latter decides if the provided image is a fake or an original one. Based on the feedback loop offered by the Discriminator, the Generator itself “improves” its fake images (i.e. feedback in terms of a loss function and improving the images via changing its parameter weights by means of an optimiser function).

So in fact, both parts are improving constantly and are competing to each other.

A very important point here is that the Generator never gets somewhere close to the original dataset. It solely improves by the feedback provided by the Discriminator which trains itself based on the original dataset and the generated fake images. There is no direct connection between generation and checking and therefore no direct connection with the original images.

From White Noise to Images

So, let’s see how this works in practice, how the generation of fake images from pure rubble/ noise is taking place.

This example will be executed in R with Keras (and Tensorflow as backend). Please see the references below for my own sources and impulses and for a much deeper insight.

The Dataset

As showcase we take a dataset consisting of handwritten digits “7”. Here is an excerpt of the dataset:

Here are the handwritten digits in data-object terms:

In other words, the original dataset we are training our GAN with has 6,265 images of handwritten digits provided in a size of 28 x 28 pixels and with 1 channel (as those are grey-scale images). One image in data-format as follows:

This represents the original dataset. Let’s assume now, the task is to generate fake images of the digit “7” which resemble the original ones very closely.

The Generator

The Generator is build up as follows:

And we have the following layer/ parameter structure:

The Discriminator

The Discriminator is built up in the following way:

Resulting in the following layer/ parameter structure:

The GAN Net

Now, the Generator and the Discriminator have to be put together within the Generative Adversarial Network:

In this context, it is important to freeze the weights of the Discriminator.

Reason is that while training the GAN, the Generator weights have to be updated in this way that the probability increases the Discriminator classifies fake images as original ones. Would the weights of the Discriminator not be frozen, it would always classify the generated fake images as original as this would automatically increase the accuracy of the model due to the imbalanced dataset (more original pictures than generated fake images for classifying). In any case, a result which is not intended.

Training the Model

After doing some pre-preparation (e.g. setting the batch size, fixing the number of iterations, normalising the data values and similar), the GAN model is ready to be trained.

As you can see, the whole procedure is started with randomly generated points in the latent space. This is the colour rubble (in this case, the greyscale rubble) we were talking about in the beginning. Statistically it is called “White Noise” which you could see in the graph above and which is produced by a Normal Distribution.

The Results

The task was to generate digits “7” which should be similar to the original dataset. The model was trained in 100 iterations. So, let’s see how this worked out:

As you can see, the training process literally starts with white noise but is able to improve quite fast. Just after 100 iterations, we already get quite usable results for the digit “7”.

Conclusion

The great thing with synthetic data is that it is possible to generate data which is quite similar to the original data by probabilistically approximating the properties of the original data. At the same time, it is highly improbable to retrace the original data via the synthetic data.

Why? Because we have techniques to start out with a bunch of colours without even getting close to the original data. This, we showed in this short example.

Of course, the model as shown here is a rather simple one. And the task of generating synthetic digits which are resembling handwritten digits in greyscale images is also not a quite difficult one.

Though, a lot of progress was done in this area. The topic of synthetic data impacts a lot of high-tech industries, e.g. synthetic data generation for automotive driving, synthetic data helping to detect credit card fraud or synthetic data helping to train image classifiers in the healthcare industry.

photo credits: nature biomedical engineering

To this respect, also the different models available kept up with the pace of development. Transformation Learning helped to push the stage and are of an immense support to run synthetic data.

But this is another story.

References

Deep Learning mit R und Keras by Francois Chollet and J.J. Allaire/ 2018

Generative Adversarial Networks (GANs) with R — Youtube Series by Dr. Bharatendra Rai/ February 23, 2020

Synthetic data in machine learning for medicine and healthcare by Richard J. Chen et.al. published in nature biomedical engineering/ June 15, 2021

Breast Cancer Classification from Ultrasound Images Using Probability-Based Optimal Deep Learning Feature Fusion by Kiran Jabeen et.al. published in sensors, MDPI/ January 21, 2022