A machine learning model for image classification using synthetic data can be as good as one that is trained with real data, according to a study.

It takes a lot of data to train machine learning models to recognize damage in satellite images following natural disasters. These data can be difficult to find. If you have the data, it may be expensive to create them. Furthermore, even the most well-researched datasets can contain biases that could negatively affect a model’s performance.

Researchers at MIT devised a way to train a machine-learning model without using a dataset. Instead of using a dataset, they use a special machine-learning model to generate highly realistic synthetic data that can be used to train another model for downstream vision tasks.

These results showed that a contrastive representations learning model that was trained only using synthetic data can learn visual representations comparable to or better than those from real data.

A generative model is a machine-learning model that requires less memory than a dataset. Synthetic data has the potential to bypass privacy and usage rights restrictions that restrict real data distribution. The generative model can also be edited to remove some attributes such as race or gender. This could help address biases found in traditional datasets.

“We knew this method would work. We just had to wait for the generative models to improve. We were particularly pleased to show that this method sometimes does better than the real thing,” Ali Jahanian (a researcher in the Computer Science and Artificial Intelligence Laboratory, CSAIL), who is also the lead author of the paper.

Jahanian co-authored the paper along with CSAIL graduate students Xavier Puig and Yonglong Tian. Senior author Phillip Isola is an assistant professor in Electrical Engineering and Computer Science. The research will be presented during the International Conference on Learning Representations.

Generating synthetic data

After a generative model is trained with real data, it can create synthetic data that looks almost identical to the real thing. It is necessary to show the generative model millions upon millions of images that include objects in a specific class (like cats or cars) and then it has to learn what a cat or car looks like so it can create similar objects.

Jahanian explains that researchers can simply flip a switch to generate realistic images from a pre-trained generative model. These images are based on the training dataset.

He says that generative models can be even more useful as they learn to transform the data they have been trained on. When a model is trained using images of cars, it can “imagine how a car might look in different situations” — situations it didn’t see during training. Then it will output images showing the car in unique poses and colors.

Contrastive learning is a technique that allows multiple views of the same image. This is where a machine-learning model is shown many images unlabeled to determine which pairs are similar and different.

Researchers connected a pre-trained generative model and a contrastive learning model in such a way that they could work together seamlessly. Jahanian says that the contrastive learner could tell a generative model to produce different views and learn to identify an object from multiple angles.

It was like connecting two blocks. He says that the generative model can provide us with different views of the same thing. It can help the contrastive approach to learn better representations.”

It’s even better than the real deal

Researchers compared their method with several other image classification models that were trained from real data. They found that their method performed just as well or better than other models.

A generative model can theoretically create infinite numbers of samples. This is an advantage. The researchers also looked at how the model’s performance was affected by the number of samples. In some cases, increasing the number of unique samples produced additional improvements.

These generative models can be trained by someone else. They can be found in online repositories so that everyone can access them. Jahanian states that you don’t have to modify the model in order to obtain good representations.