Lensa AI. The Good, the Bad and the Ugly.
Lensa AI

Lensa AI. The Good, the Bad and the Ugly.

Its 2023. The digital age which was being talked about is slowly taking shape. Apps like Instagram are at a point where it has peeked into the minds of the younger generation, me included. You may have recently noticed your friends or your favorite celebs flexing their pictures which look like they have been facially retouched or algorithmically altered which we all got to agree is eye catching.

This feat of innovation is achieved by Lensa, the AI photo app. Owned and operated by Prisma Labs, this is one of the apps that’s making a lot of buzz these days. Its fun and marks the first time a vast majority of people have interacted with a generative AI tool. So as a tech geek, it intrigues me to understand how this was achieved.

Lensa works on the concept called stable diffusion. Sable Diffusion’s free and is open source, used for image generation. Lensa acts as the middle man. The process involves you sending Lensa 10-20 images and in turn it will return you with a set of stylized portraits of you as a Jedi or as an anime villain or as any other form of avatar you want to picture yourself as. The results are marvelous, and the process implemented to achieve this is too.

Stable Diffusion is used to create striking images based on the text descriptions. It sure is a shift in the way humans create art. To look under the hood, Stable diffusion is made of several components. On a high level, it has a Text Encoder and an Image Generator. The Text Encoder is based on a special transformer language model (CLIP Model) which takes input text and outputs an array of numbers for each word or token in the text.

No alt text provided for this image
Stable Diffusion on High level

                                                                                 

Image Generator has two stages. One of them is the Image Information creator stage. This is the secret ingredient of Stable Diffusion. This is where optimization has been achieved over the past few years. It is made up of a UNet neural network and a scheduling algorithm and works completely in the image information space (Latent space). The second stage is the Image Decoder stage where the model paints a picture from the information it gets from the previous stage to produce the final pixel image. Therefore we see 3 main components in the whole process:

1.      ClipText for Text Encoding

2.      UNet +Scheduler to gradually process the information in the latent space.

3.      Autoencoder Decoder that paints the final image using the processed information array.

No alt text provided for this image
Stable Diffusion with image genarator components

So we can come to the understanding that most of the actual diffusion is taking place inside the Image Information Creator. So, to understand this let’s look at how the model is trained. The major component here is noise. We take an image in the training set, choose some random generated noise, choose some level of intensity of noise, and add it to the image. This process is considered as forward diffusion.

No alt text provided for this image
Image Information Creator stage

This noisy image generated is the training example. We give the noisy image and the intensity of noise(Noise level) as the input to the model and train the model to detect the noise we had put. The network we are using here for training is UNet.

No alt text provided for this image
Training the model

So, this model now basically becomes a noise predictor. If you are familiar with ML training, we know the steps go like this:

1.      Pick a training example

2.      Predict the noise through the Model

3.      Compare with the actual noise

 

So, the image with a certain amount of noise is supplied to the model while diffusion is working. It detects the noise level and removes it. This won’t completely remove the noise but gives a version of slightly de-noised image. This process carries on till the original image is found. But as this process is not perfect, it ends up creating a slightly altered images. This is the basic diffusion process.

No alt text provided for this image
Stable diffusion in the works

Now let’s add the text encoder values to this. This uses CLIPText Model released by OpenAI. So now what happens is Text values generated by the Text Encoder is also supplied as input to the predictor while training. This enables the predictor to get trained based on the input user gives. So now if the user gives the input as “Cat on a chair”, the text encoder creates an array with the appropriate values, which acts as the input to the model. So basically, the model makes sure the image moves towards an image where a cat sits on a chair. In laymen terms, the image is being created from noise. Its all about the way in which the de-noising is taking place.

No alt text provided for this image
Training the model with text

Now that we know the basics of Stable Diffusion, lets concentrate on the way Lensa is doing. We understand that Lensa is not using text for the model, so the model has been tweaked a little bit. The 10-20 images which the user provides acts as the input for this model. Lensa has a set of preexisting factors designed based on the details provided by the users. This along with the images acts as the input. Stable diffusion acts on this and provides us that wonder image or avatars in which we all are awe of. The process looks something like this:

No alt text provided for this image
Working of Stable Diffusion with Lensa

As with every upgrade in Artificial Intelligence, there are some backslash Lensa is facing. Billions of images were used to train the model which Lensa is using. The catch is these images used were not entirely approved. Many artists believe their works were used to train the model and they were not even acknowledged nor compensated for. The Lensa team got back stating “can't be described as exact replicas of any particular artwork”, exempting them from copyright law.

Then comes the issue of data privacy. Many users are alerting the internet on whether their images are being used for something other than creating these avatars. Lensa got back stating that the images are used responsibly and are deleted after 24 hours, but there is always a catch with these private entities. Some critics point out that they could certainly be paid by another entity for access to their data. Their point is that these terms and regulations are not strong enough in terms of data privacy.

The other issue which is faced by the app is the sexualization of female users. Some users expressed that the app made the photos skinnier or show them at an uncomfortable degree. These issues certainly put the app under a red light. Sheldon Fernandez, CEO of Darwin AI summarizes this as “One of the challenges with this is the technology just moves so quick and quicker than often legislative bodies can”.

Artificial Intelligence sure has come a long way and the design of some of these models are truly fascinating, but the society also needs to consider these issues and develop a responsible approach for this technology. The age where technology starts playing with ethics might be around and lets all just hope that we find the right balance.

References:



Sumukh Ananth

Audit Executive | ACCA Aspirant

1y

Well written Achyutha Rao Sathvick. Have always been a fan of AI and it has helped my workflow easier, but never knew how they actually worked. TIL the process of generating AI images. The point made here is very clear about the data privacy invasion. Taking a back step helps us realize the amount of data we have provided to all these tech giants and in today's world data can be used in any way to make anything possible and is a thing now to think about when AI might be taking over some entry-level jobs.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics