Generative AI Models Explained

Reading time: 14 minutes

Take a look at the featured image above. Beautiful, isn’t it? The interesting thing is, it isn’t a painting drawn by some famous artist, nor is it a photo taken by a satellite. The image you see has been generated with the help of Midjourney — a proprietary artificial intelligence program that creates pictures from textual descriptions.

Neural nets can create images, video, and audio content that not every person can

Neural nets can create images, video, and audio content that not every person can

We just typed a few word prompts and the program generated the pic representing those words. This is something known as text-to-image translation and it’s one of many examples of what generative AI models do.

The hype about generative AI is huge and it continues to grow. Gartner has included generative AI in its Emerging Technologies and Trends Impact Radar for 2022 report as one of the most impactful and rapidly evolving technologies that brings productivity revolution.

Here are some of the key Gartner predictions considering generative AI.

  • By 2025, generative AI will be producing 10 percent of all data (now it’s less than 1 percent) with 20 percent of all test data for consumer-facing use cases.
  • By 2025, generative AI will be used by 50 percent of drug discovery and development initiatives.
  • By 2027, 30 percent of manufacturers will use generative AI to enhance their product development effectiveness.

It would be a big overlook from our side not to pay due attention to the topic. So, this post will explain to you what generative AI models are, how they work, and what practical applications they have in different areas.

What is generative AI and why should you care?

Generative AI refers to unsupervised and semi-supervised machine learning algorithms that enable computers to use existing content like text, audio and video files, images, and even code to create new possible content. The main idea is to generate completely original artifacts that would look like the real deal.

Generative AI that draws a pictures from word prompts be like…

Generative AI that draws pictures from word prompts be like…

Jokes aside, generative AI allows computers to abstract the underlying patterns related to the input data so that the model can generate or output new content.

As for now, there are two most widely used generative AI models, and we’re going to scrutinize both.

  • Generative Adversarial Networks or GANs — technologies that can create visual and multimedia artifacts from both imagery and textual input data.
  • Transformer-based models — technologies such as Generative Pre-Trained (GPT) language models that can use information gathered on the Internet to create textual content from website articles to press releases to whitepapers.

In the intro, we gave a few cool insights that show the bright future of generative AI. The potential of generative AI and GANs in particular is huge because this technology can learn to mimic any distribution of data. That means it can be taught to create worlds that are eerily similar to our own and in any domain.

In logistics and transportation, which highly rely on location services, generative AI may be used to accurately convert satellite images to map views, enabling the exploration of yet uninvestigated locations.

In the travel industry, generative AI can provide a big help for face identification and verification systems at airports by creating a full-face picture of a passenger from photos previously taken from different angles and vice versa.

In healthcare, X-rays or CT scans can be converted to photo-realistic images with the help of sketches-to-photo translation using GANs. In this way, dangerous diseases like cancer can be diagnosed in their initial stage due to a better quality of images.

In marketing, generative AI can help with client segmentation by learning from the available data to predict the response of a target group to advertisements and marketing campaigns. It can also synthetically generate outbound marketing messages to enhance upselling and cross-selling strategies.

Although it may look like this, generative AI doesn’t do all these fantastic things by magic: It must be modeled to make it capable of creating artifacts from real-world content. And here’s how.

Discriminative vs generative modeling

To understand the idea behind generative AI, we need to take a look at the distinctions between discriminative and generative modeling.

Discriminative modeling is used to classify existing data points (e.g., images of cats and guinea pigs into respective categories). It mostly belongs to supervised machine learning tasks.

Generative modeling tries to understand the dataset structure and generate similar examples (e.g., creating a realistic image of a guinea pig or a cat). It mostly belongs to unsupervised and semi-supervised machine learning tasks.

Supervised and unsupervised learning in a nutshell

Supervised and unsupervised learning in a nutshell

The more neural networks intrude on our lives, the more the areas of discriminative and generative modeling grow. Let’s discuss each in more detail.

Discriminative modeling

Most machine learning models are used to make predictions. Discriminative algorithms try to classify input data given some set of features and predict a label or a class to which a certain data example belongs.

Say, we have training data that contains multiple images of cats and guinea pigs. They are also called samples. Each sample has input features (X) and output class labels (Y). And we also have a neural net to look at the image and tell whether it’s a guinea pig or a cat, paying attention to the features that distinguish them.

Discriminative modeling

Discriminative modeling

Let’s limit the difference between cats and guinea pigs to just two features x (for example, “the presence of the tail” and “the size of the ears”). Since each feature is a dimension, it’ll be easy to present them in a 2-dimensional data space. In the viz above, the blue dots are guinea pigs and the red dots are cats. The line depicts the decision boundary or that the discriminative model learned to separate cats from guinea pigs based on those features.

When this model is already trained and used to tell the difference between cats and guinea pigs, it, in some sense, just “recalls” what the object looks like from what it has already seen.

So, if you show the model an image from a completely different class, for example, a flower, it can tell that it’s a cat with some level of probability. In this case, the predicted output (ŷ) is compared to the expected output (y) from the training dataset. Based on the comparison, we can figure out how and what in an ML pipeline should be updated to create more accurate outputs for given classes.

To recap, the discriminative model kind of compresses information about the differences between cats and guinea pigs, without trying to understand what a cat is and what a guinea pig is.

Generative modeling

Generative algorithms do the complete opposite — instead of predicting a label given to some features, they try to predict features given a certain label. Discriminative algorithms care about the relations between x and y; generative models care about how you get x.

Generative modeling

Generative modeling

Mathematically, generative modeling allows us to capture the probability of x and y occurring together. It learns the distribution of individual classes and features, not the boundary.

Getting back to our example, generative models help answer the question of what is the “cat itself” or “guinea pig itself.” The viz shows that a generative model can predict not only all the tail and ear features of both species but also other features from a class. This means it learns features and their relations to get an idea of what those animals look like in general.

And if the model knows what kinds of cats and guinea pigs there are in general, then their differences are also known. Such algorithms can learn to recreate images of cats and guinea pigs, even those that were not in the training set.

A generative algorithm aims for a holistic process modeling without discarding any information. You may wonder, “Why do we need discriminative algorithms at all?” The fact is that often a more specific discriminative algorithm solves the problem better than a more general generative one.

But still, there is a wide class of problems where generative modeling allows you to get impressive results. For example, such breakthrough technologies as GANs and transformer-based algorithms.

Generative Adversarial Networks

A generative adversarial network or GAN is a machine learning algorithm that puts the two neural networks — generator and discriminator — against each other, hence the “adversarial” part. The contest between two neural networks takes the form of a zero-sum game, where one agent’s gain is another agent’s loss.

GANs were invented by Jan Goodfellow and his colleagues at the University of Montreal in 2014. They described the GAN architecture in the paper titled “Generative Adversarial Networks.” Since then, there has been a lot of research and practical applications, making GANs the most popular generative AI model.

GAN architecture

GAN architecture

In their architecture, GANs have two sub-models:

  • generator — a neural net whose job is to create fake input or fake samples from a random input vector (a list of mathematical variables each of whose value is unknown); and
  • discriminator — a neural net whose job is to take a given sample and decide if it’s a fake sample from a generator or a real sample from the domain.

The discriminator is basically a binary classifier that returns probabilities — a number between 0 and 1. The closer the result to 0, the more likely the output to be fake. And vice versa, numbers closer to 1 show a higher likelihood of the prediction being real.

Both a generator and a discriminator are often implemented as CNNs (Convolutional Neural Networks), especially when working with images.

So, the adversarial nature of GANs lies in a game theoretic scenario in which the generator network must compete against the adversary. The generator network directly produces fake samples. Its adversary, the discriminator network, makes attempts to distinguish between samples drawn from the training data and samples drawn from the generator. In this scenario, there’s always a winner and a loser. Whichever network failed is updated while its rival remains unchanged.

GANs will be considered to be successful when a generator creates a fake sample that is so convincing that it can fool a discriminator and also humans. But the game doesn’t stop then as it’s time for the discriminator to be updated and get better. Repeat.

Transformer-based models

First described in a 2017 paper from Google, transformers are powerful deep neural networks that learn context and therefore meaning by tracking relationships in sequential data like the words in this sentence. That’s why this technology is often used in NLP (Natural Language Processing) tasks.

Some of the most well-known examples of transformers are GPT-3 and LaMDA.

GPT-3 is a series of deep learning language models built by the OpenAI team — a San Francisco-based artificial intelligence research laboratory. GPT-3 stands for generative pre-trained transformer model. The 3 here means that this is the third generation of those models. The model can produce text that looks like it was written by a human: It can write poetry, craft emails, and even crack jokes.

LaMDA (Language Model for Dialogue Applications) is a family of conversational neural language models built on Google Transformer — an open-source neural network architecture for natural language understanding.

The transformer is something that transforms one sequence into another. They are a type of semi-supervised learning, meaning they are pre-trained in an unsupervised manner using a large unlabeled dataset and then fine-tuned through supervised training to perform better.

Transformer model with encoders and decoders

Transformer model with encoders and decoders

A typical transformer consists of two parts.

The encoder works on the input sequence. It extracts all features from a sequence, converts them into vectors (e.g., vectors representing the semantics and position of a word in a sentence), and then passes them to the decoder.

The decoder works on the target output sequence. Each decoder receives the encoder layer outputs, derives context from them, and generates the output sequence.

Both the encoder and the decoder in the transformer consist of multiple encoder blocks piled on top of one another. The output of one block becomes the input of another.

Transformers work through sequence-to-sequence learning where the transformer takes a sequence of tokens, for example, words in a sentence, and predicts the next word in the output sequence. It does this through iterating encoder layers.

Transformer models use something called attention or self-attention mechanisms to detect subtle ways even distant data elements in a series influence and depend on each other.

These techniques provide context around items in the input sequence. So, instead of paying attention to each word separately, the transformer attempts to identify the context that brings meaning to each word of the sequence.

On top of that, transformers can run multiple sequences in parallel, which speeds up the training phase.

Types of generative AI applications with examples

Generative AI has a plethora of practical applications in different domains such as computer vision where it can enhance the data augmentation technique. The potential of generative model use is truly limitless. Below you will find a few prominent use cases that already present mind-blowing results.

Image generation

The most prominent use case of generative AI is creating fake images that look like real ones. For example, in 2017, Tero Karras — a Distinguished Research Scientist at NVIDIA Research — published a paper titled “Progressive Growing of GANs for Improved Quality, Stability, and Variation.”

Generated realistic images of people that don’t exist.  Source: Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017

Generated realistic images of people that don’t exist. Source: Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017

In this paper, he demonstrated the generation of realistic photographs of human faces. The model was trained on the input data containing real pictures of celebrities and then it produced new realistic photos of people’s faces that had some features of celebrities, making them seem familiar. Say, the girl in the second top right picture looks a bit like Beyoncé but, at the same time, we can see that it’s not the pop singer.

Image-to-image translation

As the name suggests, here generative AI transforms one type of image into another. There’s an array of image-to-image translation variations.

Style transfer. This task involves extracting the style from a famous painting and applying it to another image. For example, we can take a real picture we made in Cologne, Germany, and convert it into the Van Gogh painting style.

A photo in the Van Gogh painting style using GoArt from Fotor

A photo in the Van Gogh painting style using GoArt from Fotor

Sketches-to-realistic images. Here, a user starts with a sparse sketch and the desired object category, and the network then recommends its plausible completion(s) and shows a corresponding synthesized image.

Sketch-to-image example. Source: DeepFaceDrawing: Deep Generation of Face Images from Sketches

Sketch-to-image example. Source: DeepFaceDrawing: Deep Generation of Face Images from Sketches

One of the papers discussing this technology is “DeepFaceDrawing: Deep Generation of Face Images from Sketches.” It was published in 2020 by a team of researchers from China. It describes how simple portrait sketches can be transformed into realistic photos of people.

MRI into CT scans. In healthcare, one example can be the transformation of an MRI image into a CT scan because some therapies require images of both modalities. But CT, especially when high resolution is needed, requires a fairly high dose of radiation to the patient. Therefore, you can only do an MRI, and synthesize a CT image from it.

Text-to-image translation

This approach implies producing various images (realistic, painting-like, etc.) from textual descriptions of simple objects. Remember our featured image? That’s an example of test-to-image translation. The most popular programs that are based on generative AI models are the aforementioned Midjourney, Dall-e from OpenAI, and Stable Diffusion.

To make the picture you see below we provided Stable Diffusion with the following word prompts: a dream of time gone by, oil painting, red blue white, canvas, watercolor, koi fish, and animals. The result isn’t perfect yet quite impressive, taking into account that we didn’t have access to the original beta version with a wider set of features but used a third-party tool.

id=””>The result of using Stable Diffusion on Dezgo

The result of using Stable Diffusion on Dezgo

The results of all these programs are pretty much similar. Although some users note that on average Midjourney draws a little more expressively and Stable Diffusion follows the request more clearly at default settings.


Researchers have also used GANs to produce synthesized speech from text input. Advanced deep learning technologies like Amazon Polly and DeepMind synthesize natural-sounding human speech. Such models operate directly on character or phoneme input sequences and produce raw speech audio outputs.

Audio generation

Audio data can also be processed by generative AI. To do this, you first need to convert audio signals to image-like 2-dimensional representations called spectrograms. This allows for using algorithms specifically designed to work with images like CNNs for our audio-related task.

A spectrogram example. Source: Towards Data Science

A spectrogram example. Source: Towards Data Science

Using this approach, you can transform people’s voices or change the style/genre of a piece of music. For example, you can “transfer” a piece of music from a classical to a jazz style.

In 2022, Apple acquired the British startup AI Music to enhance Apple’s audio capabilities. The technology developed by the startup allows for creating soundtracks using free public music processed by the AI algorithms of the system. The main task is to perform audio analysis and create “dynamic” soundtracks that can change depending on how users interact with them. That said, the music may change according to the atmosphere of the game scene or depending on the intensity of the user’s workout in the gym.

Video generation

Video is a set of moving visual images, so logically, videos can also be generated and converted similar to the way images can. One of the most prominent use cases is video frame prediction. If we take a particular video frame from a video game, GANs can be used to predict what the next frame in the sequence will look like and generate it.

Pioneering generative AI advances, NVIDIA presented DLSS (Deep Learning Super Sampling). It is a neural graphics technology to reconstruct images. The 3rd generation of DLSS increases performance for all GeForce RTX GPUs using AI to create entirely new frames and display higher resolution through image reconstruction.

Basically, it outputs higher resolution frames from a lower resolution input. DLSS samples multiple lower-resolution images and uses motion data and feedback from prior frames to reconstruct native-quality images.

But that’s not all.

The icing on the cake? There are artifacts like PAC-MAN and GTA that resemble real gameplay and are completely generated by artificial intelligence.

In this video, you can see how a person is playing a neural network’s version of GTA 5. The game environment was created using a GameGAN fork based on NVIDIA’s GameGAN research.

Image and video resolution enhancement

If we have a low resolution image, we can use a GAN to create a much higher resolution version of an image by figuring out what each individual pixel is and then creating a higher resolution of that.

It’s totally fine if you feel like this right now (BTW, the meme resolution has been also upscaled using Generative AI)

We can enhance images from old movies, upscaling them to 4k and beyond, generating more frames per second (e.g., 60 fps instead of 23), and adding color to black and white movies.

Synthetic data generation

While we live in a world that is overflowing with data that is being generated in great amounts continuously, the problem of getting enough data to train ML models remains. When we say “enough data,” we mean enough high quality data. Acquiring enough samples for training is a time-consuming, costly, and often impossible task. The solution to this problem can be synthetic data, which is subject to generative AI.

As we already mentioned NVIDIA is making many breakthroughs in generative AI technologies. One of them is a neural network trained on videos of cities to render urban environments.

NVIDIA’s Interactive AI Rendered Virtual World

Such synthetically created data can help in developing self-driving cars as they can use generated virtual world training datasets for pedestrian detection, for example.

The dark side of generative AI: Is it that dark?

Whatever the technology, it can be used for both good and bad. Of course, generative AI is no exception. There are a couple of challenges that exist at the moment.

Pseudo-images and deep fakes. Initially created for entertainment purposes, the deep fake technology has already gotten a bad reputation. Being available publicly to all users via such software as FakeApp, Reface, and DeepFaceLab, deep fakes have been employed by people not only for fun but for malicious activities too.

For example, in March 2022, a deep fake video of Ukrainian President Volodymyr Zelensky telling his people to surrender was broadcasted on Ukrainian news that was hacked. Though it could be seen to the naked eye that the video was fake, it got to social media and caused a lot of manipulation.

Hard to control. When we say this, we do not mean that tomorrow machines will rise up against humanity and destroy the world. Let’s be honest, we’re pretty good at it ourselves. But due to the fact that generative AI can self-learn, its behavior is difficult to control. The outputs provided can often be far from what you expect.

But as we know, without challenges, technology would be incapable of developing and growing. Besides, such things as responsible AI make it possible to avoid or completely reduce the drawbacks of innovations like generative AI.

By the way, don’t worry: The post you have just read wasn’t generated by AI.

Or was it?

Add a comment