Interpreting Deep Generative Models for Interactive AI Content Creation by Bolei Zhou (CUHK)

Machine generated transcript…

okay great so lets get started welcome to the tutorial on interpretable machine learning for computer vision im body joe so im an assistant professor at the chinese university of hong kong so today i will give you a tutorial on interpretable deep generating models for interactive ai content creation so you may now have been working on interpretable machine learning for quite a while recently we have done many works in the areas of interpreting deep generality model.

with application of content creation because you know after we interpretable the concept inside the generating model then we can control those factors to customize the outputs therefore we can facilitate this interactive ai content creation like put humans in the loop of creating new contents using ai models so this talk we will talk about many different uh deep generations but here we will focus on against the generative adversarial network so lets take a look at the progress.

for image generation you know the game was first proposed in 2014 and over the years there are a lot of different game models have been proposed also in terms of image quality you can see that image quality are greatly improved uh especially recent style game v2 and the style gun ada can synthesize realistic image in different viewpoints and for different data type under limited data amounts also recent one years we see another kinds of general models such as the neural radiance fields there is a surge of work in those area.

So we may see many many work in of nerve, so nerve can take dense images with camera parameters at inputs and can know the model. Then it can synthesize the new views um for the image instance. Also early this year. We have this openai daily work from open ai that can translate a later language sentence into an image. So here we have example an

armature in the shape of a candle so this model is quite amazing because you can see before this if you google of a conduct here you can get nothing from google button but right now this model can learn how to compose new sound concept from the previous concept in the data set which means this generally model already know how to be creative um so this lecture we will focus on the.

generative adversary network again so here is a training pipeline for gans you know we have the generator that will take a random vector at the inputs then they will generates a fake image then there will be the discriminator that try to distinguish whether the inputs is generated by the generator or its a real image so so these are the virtual trainings on generator and the discriminator can getting better and better then after converge the generator can synthesize the realistic image so we will focus on the generator part because this is a call for neural image generation.

you know this generator is basically a convolutional neural network so it will take a random vector sample from some distribution at the inputs then that will generate a realistic image so if we change the input random vector the outputs will also change so another random vector of the inputs another output so this neural image generation is very simple its also very easy to use so input output you can get this realistic image without much hassle however for this image generation pipeline there is no.

way for people to steer the generation you know sometimes users may want different outputs you know for this one user may want a different image once an image with different lighting condition or they may want the same image but the different views so current pipelines doesnt support such a customization or human in the loop content creation so therefore we want better understand.

the generating model and interprets the general models with human understandable concepts so after we identify the human understandable concept inside the internal plantations we can actually consider those concepts as some switch then we can turn on and turn off and control this image generation process then we can add it the final output so that’s that we can put human in the loop of this ai content creation then human can collaborate with the am model to create new concepts or new content we want so that is the motivation behind interpreting the deep generative models so here we have this deep.

generative representations so lets take a first take a look at what are the deep generations so there are two kinds of representations so the first is a convolutional network for this generator so you know this is a combination new network then a lot of convolution filters so basically we can interpret those individual filters and figure out what each one is doing we also can looking into the latent space you know the latent.

space is uh the beginning of this generation so with different random vectors the output image will be different so this latent space actually control what kind of concept can be inside the output image so we can also study the latent space so here i summarize all the interpretation approach into three categories based on their supervisions so the first category is a supervisor approach which means we can use labels and the trinity classifiers to prove the representation of the generator the second approach is unsupervised approach which means for those models we dont.

have any labels or classifiers how can we still identify the controllable dimensions of the generator so third approach is more recent work that they try to analyze the some language in bending with generative representations uh therefore they can do the zero workshop interpretation without using any labels because right now the language impending already are lights with this generative representation than giving any words or sentence it can generate the.

image then lets get into each approach lets first take a look at supervised approach uh use labels and trim classifiers to prove the representations of generator so one of the earliest work for supervisor approach is the scanned section so its derived from our previous work for network deception so the basic idea of gain detection is to align the semantic mask with scan feature map you know for this output um image we can.

apply some semantic experimentation at work and get the semantic labels for each one of the pixel then we can go back into the generator and uh get their feature map of some units then we can measure the agreements between the feature map spatial feature map with the semantics notation then we can associate label for each one of the individual units so it turns out about one-third of the units have a very clear meanings with high agreement so after we identify the semantic filters inside the generators then we.

can simply treat uh each filters as a switch and the user can um simply interact with the output image like to brush some regions we can just turn down and turn off the units at those regions then we can add we can add and remove some content so here this demo showing you this uh interactive content creation so you can grab brush some regions and include and remove the trees skies and clouds different objects so after we have this interpretation on the individual units then lets take a look at the another.

parts that were looking into the latent space because you can see that latent space is really the treatment force behind all the image generation so we can disentangle and identify the meaningful concept inside the latent space and then we can have more controllable image generation um so here is a one word that we we interpret the scene generative network so we have a network that can generate the the indoor sense so we want to interpret the internal.

representations with some linear classifiers so what we can do is that for the output image we can apply some of the shell predictors you know computer vision researchers or computer graphics researchers uh develop and release a lot of protruding models that can predict the attributes or semantic labels for the output image so we just take the off the shelf predictors and get the labels now we can go back into the little space and identify the causality relations between the latent activations and the output attributes after that then we can control this.

generation so how this part work lets dive a little bit deeper into this so starting from this latent space we can sample a lot of random vectors so each blue dot is a random vector we sample from some distribution then for each random vector we can throw into the generator and have this image then we have the synthetic image space now we can take some of the shell predictors and predict the attributes uh intensity for each one of the output image so for this one we are looking at this indoor lighting um predictions so so right now we can.

treat this a predicted label as a pseudo label for each one of the latent vectors then we can trim the linear classifiers in the latent space to distinguish whether this uh latent vectors correspond to nature lighting or indoor lighting so it turns out a linear classifier can achieve more than 90 percent um binary classification accuracy for this uh lighting prediction which means uh in this latent space a lot of the feature are already quite disentangled from each other so after this then we have the last.

very important step called counter fracture verification because the previous three steps are just doing this correlation you know we just learned some correlation between the latent activation and the output attributes so right now we want to escalate the correlation into causality then we have this verification steps so what we are doing is that we take this uh normal vector on the trinity linear classifier and add this vector on top of the original latent vector then generates a new image.

and then observe how much the attribute change accordingly compared to the original synthesized image so we only consider those attributes can be reliable manipulate through this so after this step we can identify this reliable attribute manipulator and then we can start doing this manipulation so our manipulation is uh pretty simple its just a linear model so after we know this boundaries then we can have this uh normal vectors.

so right now we just push uh this latent codes across the boundary by just adding this uh vector boundary vector so gradually you can see this uh indoor lighting attribute is intensified for this input image so you can see for the two images so you can see the the the lamp is turned on and also the global lighting condition is greatly improved so here is another.

um demo video showing you that for synthesized image then we can push the latent codes across the boundary then there the lamp is turned down and global lighting condition is greatly improved uh so in the latent space not only we have this indoor lighting we have can also have many other attributes so here is another example uh showing you uh we just push the latent codes across the cloud list boundaries so right now we have four.

images so the the sky is very boring then we want to include uh some cloud insights then we just push the uh latent code across the cloudless boundary then this cloud just automatically grouts uh in the sky so this kind of linear calculation is very simple and generalizable so here we apply this kind of techniques to interpret the phase generative models so here we have a phase generation model then we apply this linear classifier.

to identify the official attributes inside the latent space then we can manipulate different kinds of physics official attributes such as age gender pose artifact so later on people start improving this linear manipulation model you know we have a very strong assumption this manipulation model is just linear so we have this latent vector w then we just add this uh attribute factor on top of that so obviously this is not uh cannot capture all kinds of transformation then we have.

a recent c graph work from some group that they develop a model called stealth flow that come by and still gain with flow-based unconditional model so there our model is that replace the mlp with an invertible flow model condition on attributes so you can see the uh latent space z and the resistance w have uh this non-linear relations conditional on attributes and therefore they can have this feed forward process then for this sample the w uh its already contain the attributes it wants then this model can generate the outputs uh corresponds to the.

attributes so here is a um thermal media showing their facial manipulation results so basically they can choose the attributes they want then generate new w then this new w will bring those attributes onto the output phase um so from previous work um you can see in the latent space they capture all kinds of attributes um one very interesting observation is that there are some 3d structure attributes.

uh inside this latent space so here i show you two uh demo videos from our previous work so one is uh changing the scene view um by young uh this igc will work on the right is this uh interface gun work that we change the face pose so this result is very interesting because when we first trim the steep generative models the training data we have are just using individual samples those individual samples are downloaded from the web so we dont have the same multiple views data for the same instance so all the.

data are just independent data somehow in order to synthesize the 2d image this is generally model just to figure out the 3d structures of the data then this reads the question very interesting question does 3d structure emerge from the 2d image generation so to better extract the 3d information so here is a very recent work from you toronto and the media.

published at the icor this year um so this work tried to parse the 3d information from the 2d image generator so they use differentiable rendering for inverse graphics and interpretable 3d rendering so their pipeline start with the samples from style gun you know you can sample from the stock end and generate a lot of um instance like a multi-view data and then they have very quick annotations to annotate the views of different synthesized image so after they collect bunch of multiview.

data then they uses uh to change some inverse on graphics network so the diagram on the right showing you this process so for this inverse graphics network the input is a um image then this output of the network will predict the mesh night and the texture then it was through this um different prediction into this differentiable graphics renders then they will render those components back into an image then back into the image then they can construct a loss function because this.

renderer is a differentiable then they can back propagation and train this inverse graphics network so after they change this inverse graphing network they further use this network to trim disentangling a mapping network so this mapping network is very interesting its more like a parameterized model for this latent vectors you know the inputs is this uncontrollable dimension such as camera mesh textures backgrounds then they were doing this convolution or connections then will output latent codes so this latent code will contains all the controllable dimensions then throw into the style gun then style gang will work like.

3d neural renderers than to render the information into this realistic image so here i show you some rendered controllable outputs from this style gun so this work is very interesting it combines this differentiable render with a newer rendering together and try to distill the controllable information using this inverse graphics network so there are still many challenges for the supervised approach so here i list some of the challenges i think are very important.

and the first one is how can we further expand the annotated dictionary size you know the number of attributes we can interpret highly depends on this annotated dictionary so some of the concept may not in inside our dictionary so how can we like scale up the dictionary is the unsolved problem and the second is how how to further disentangle the relevant attributes so one observation we have is that some.

attributes are actually entangled with each other like when we change peoples age then they will intend to wear the glasses you know the age is entangled with wearing glasses so how can we do further disentanglement on this so that’s also relevant to the bios of the data the third question is how how to align the latent space with image region attributes because the latent um space uh latent vector is just the.

vectors so there are no spatial relations between these three vectors so then but a lot of the case we want control some mg regions now how can we build this relation spatial relation between the latent vector and the image regions is an unsolved problem okay so then lets take a look at the on second approach um this unsupervised approach.

that will try to identify the controllable dimensions of the generator without labels and classifiers you know um as the generality model become more and more popular then people start adopt this model for different cases than training on different kinds of data so here we people train some models uh to generate the cats uh so also we have this uh generating model.

uh for cartoons so for those kind of data and its very difficult to have some of the shell predictors also its very difficult to annotate the labels like for different models we have annotate our in-house classifiers then the problem here is how can we still identify those steerable dimensions without the access to the predictors or without annotating the synthesizer data so here we have a cvpr work this year to look into this problem i try to develop and so we develop a class for a closed form of fertilization.

of the latent space uh in dance so basically we for this image generation pipeline were looking into the first steps so you know the first step is to convert the latent vector into the first layers activation so between the two um is actually doing some r5 transform so for this latent code z there will be uh learning the parameter a and b now doing this r phi transform into the latent activation first generator so we also have this observation future difference.

after editing is actually independent of the original uh latent vector you know because of this linear rfi transform so when we do this difference on this feature uh intermediate feature map then it will independent of the original written codes which means is actually the future difference encodes uh the attribute difference so therefore we develop some objective and to maximize the variation of the difference so to maximize the variation difference so ideally we want this uh feature difference to encode all kinds of attributes so we turn out to optimize this objective function then more like we are doing pca.

on the knowledge uh feature map so we can have different uh components uh that controllable components uh align with different attributes so based on this work we build some interactive content creation demo so on the left you can see there are different uh sidebars so user can just control the sidebar to change the intensity of different controllable dimensions than the output image will change accordingly so this sidebar you can see it control different attributes such as the pose.

and texture and the colors of the cats so with this message we can put human in loop for this ai content creation and so that will be a very providing future directions so you can see knowing that you can imagine that’s not limited to image generation it also can be extends to other like data domains such as a music generation or some industrial designs so there are a lot of space out there to facilitate this human in the loop on ai con creation um so here is another work from adobe i think so they developed some method called gan.

space basically this measure applied uh pca so the principal component analysis and to the uh latent space of the style gun so starting from the latent space the first generates a lot of images then collect the feature activations then apply the pca in the future activations so it can identify the the some principal components for the future activations so after identifying the feature activations then they can go back into the latent space and uh treats this as a regression popular to.

regress the latent space according to the future activations um principle component directions so after this they can have a latent boundaries in the latent space and can control this image generation so here is a demo video showing you their manipulation results so this that’s their um discovered um components on dimensions you can see this difference on components and control different output attributes such as the pose of the cars and the zooming and zoom out or some geometrical change and also the type of the cars so you can go back to watch the video so.

we dont have time to go through the whole video so here is another on work from berkeley and mit so they develop a method called hazel penalty so as a week prior for unsupervised disentanglement so here the first equation is a generation equation so image uh will equal to the um generator on the on the z variable z uh so they first calculate the hasten matrix so they will calculate.

the direct first derivative on the the i that concludes the second derivative on the j um so ideally if two attributes are disentangled from each other so this is a histo matrix the long diagonal elements should be zero so the attributes should be disentangled from each other so in that case they they rewrite these equations into some hasten penalties uh in trainings so they want to also.

minimize this adjacent penalty then they can make this long diagonal element as zero as possible so here is also a demo video what does it mean to disentangle a neural network lets take a look at this generative model g intuitively wed like the change in the output image caused by perturbing one input z component to be independent of the other z components mathematically this is equivalent to saying that g has a.

diagonal hessian in its input we propose the hessian penalty a regularization term that penalizes all diagonal terms in g-session matrix the hedge and penalty can be computed efficiently during training and it takes just a few lines of code to implement it a popular data set and image to image translation okay so this is a work um for.

is a training regularizers to encourage better disentanglements through this training steps and so heres another uh work from shuguan shans group and so they try to develop inductive bios of this entanglement in the generator so rather than use previous mlp in the latent space so they design a more parameterized latent space so its more like they design this inductive bios uh inside the generator than trim these models so after the trainings this latent space already quite disentangled from each other all the different dimensions disentangle.

dimension already can be found through this so here i show you some results that they have this after trainings they can identify the distant angle dimensions such as age arrays and the gaze so its a very interesting result that introduces inductive bios of this entanglement directly in the generator so similar work like this block gain or hologens also try to design some inductive bios in the generator then they can this generator can nurse some 3d structure or controllable switch out of the box after the training so that is also very interesting.

direction to design some normal in inductive bios inside this generators then we dont need to do this post-training interpretation um for this almost provided approach uh there are still many challenges so here i list some of the challenges uh first is how to evaluate the results you know because we are doing this unsupervised discovery so we we may get some dimensions but how can we quantify uh the disentanglements how can we evaluate this degree of disentanglement is still unsolved problem and the second question is how to.

annotate each disentangle dimensions you know we can have many disentangle dimensions then how can we analyze them so right now just through this manual annotations we just plot them and ask human to come up with a name for the dimensions also the third one is how how can we barely improve the disentanglement in training process you know a lot of the work have been done after the training after the.

interpretation right so but though ideally we want to encourage uh improve this disentanglement during the training then we can have a more disentangled model out of the box after this training lastly lets take a look at this very recent um zero version approach um that try to apply some language uh in branding uh with generative representations so here is a very recent work from um adobe and hubris university.

that’s uh they they develop a method called the style clip lets try to combine the click in bending with style gun so their model can work with the input like this so giving a source image and some text inputs so we want to manipulate the source image according to the text input so here is a output so the output is quite amazing you see create this person without makeup or cute cats or change this tiger into lions um or.

have a different building styles for the source image so how this method work so this method tried to link the some pre-trim model with a style gun generation so here is a preaching model from open ai its called contrastive language image precision pre-training shortly that’s a clip so it is a preaching model from 400 million image tax pairs so they download a lot of images with tags from the web then they can train some project model so this.

project model can imbend the image or any given text into this joint in bending space so after they have this clip joining bending space then they develop some optimization objective so this objective is to try to optimize the latent code to generate some image to satisfy the conditions so the first have the source image so we can do this uh gang inversion to inverse this source uh latent code ws right now they want to optimize the w so this objective function has three.

components so the first component is this drawing in bending space so you know you can see that it will try to optimize the lifting codes to minimize uh the distance in the joints in bending space so here the t uh is a feature from the the tag surprised then the gw is the image so they want in the join the space they want to generate.

the image that can minimize the distance in the joints in bending space the second and the third terms are the regulators and the the second one is just they dont want and change the image content too much so they want the latent code uh still be similar and the third realizer is just on the identity so they wont keep the identity of the person.

the same so after the operations so they can have a written code w star then one can synthesize the image satisfy the conditions so with this kind of uh iterative optimizations so they can generate all kinds of very interesting images such as makeup harley hair street here and bobcat for any given input image so in this work they also develop some feed forward network and treated this as an object function to change this field for one network.

so in that case it can just input image then can do this manipulation directly without this uh very slow of iterative optimization so if you’re interested you can take a look at their papers or their paper um so heres another concurrent work from mit so from my previous group down by david bowe and antonio torraba so they they also try to combine the clip uh with region based style gun inversion basically user can just brush uh some regions then you want to like change the style of the bat and keep the other components.

of the image the same and they also develop some objective function you know they they manipulate some on the latent code then they want to optimize the latent codes to minimize some loss function so you can see on the right those are the loss function we take this clip um preaching models and to minimize the distance uh in this joint in bending space then we can optimize the code and generate the image um so here is another uh work um from open ai so its a very amazing work.

um so this work explore the importance of the data so you just train a massive uh number of data using a massive uh models you know this model is a 12 billion parameter model trim down 250 million text image pairs uh download from the internet um so there their image model is just uh like auto encoder model so um so its a um improved version of the vqbae so the latent space of this auto encoder is uh become discrete discrete um then they can put a more strong priors more like language.

priors as a transformers in sizes latent space then the second step just trim uh autoregressive a transformer to model this joint and distribution of text and the image tokens so after the chain on this massive number of datas and the results are quite impressive so then giving any sentence they can generate a very fun and creative image so the middle one you can see the the input sentence is an illustration of.

a baby hedgehog in a christmas sweater working a dog so you can generate a lot of variations uh also the last one is neon sign reads a back probe so you can generate such amazing results so here is a summary for this interpretation approach for deep generated models so we have the supervised approach that use labels and trinidad classifiers to prove their representation the second is unsupervised approach that identifies the controllable dimensions of the generators without using labels or classifiers so we also have this recent zero shot.

approach that try to use some uh language bending with this generative representations then they can have more controls on the uh generation process lets then take a look at the latent space of the gas generator uh you know at the beginning so we only have this this again or origin again so starting from the z space um this base is a source of the slogastic so we will random or we will sample from some random vector then throw into the generator then we will generate the image but later on we have more advanced.

models such as style gun so its a style based model so the uh the inputs at the beginning is just the constant vector so the stochastic is actually come from each layers so there there is this space and the w space so this space is uh uh they will sample some d vector from this random distributions then well throw into this uh mlp some non-linear layers that have this w space uh so after this nonlinear transformation the w space uh has uh much better um coding capabilities compared to the original uh.

z space then this w will throw into each layer so each layer have done on a dying operations so either in operation will take some the latent codes w then doing some transformations then throw into each layers then each layer will have their own uh latent vectors so after the the little vector after editing people come up with a new name called the style space or s space and so a space becomes layer wise.

because each layer have their own in parameters then after this each layer have their own latent vectors then they will throw into the generator to reconstruct or to generate the image later people come up with this w plus space in order to reconstruct the image or doing the scanning inversion then after this some people come up with some non-linear operations on the w or w plus space then come up with a p space and the p plus space they have some unique properties on this so different latent space have different uh functionalities for example the stubby space is known.

for um disentangled manipulation so most of the linear manipulation work happen in the established space we just add this uh nearly the attributes of vectors on top of the original w vectors then we can do this manipulation and then uh um for the scanning version you know gain inversion is that giving a real image so we want to optimize the latent codes to reconstruct uh the inputs or given image then this w plus space is one of the most popular latent space people use for gaining versions just optimize the.

layer-wise codes at each dimensions now we can reconstruct uh the the real image then this is a involved question so which latent space uh is more disentangled so we have the z space w space as space so to measure this disentanglement and there are a lot of recent work so here are two work from cvpr so one is from my group another is from hebron university and adobe so looking into the similar problems so the one that evaluates the evaluates the.

disentanglements of the latent um codes and so in this shoot um cbpr work so we use a reconstruction arrow to measure the disentanglement then we we found out uh it is actually the ad space style space is more disentangled compared to the original w space and this space um also in this woo on cvpr work they also show this disentanglement airspace have better disentanglements compared to the w space and this space so in essence um seems if you do manipulation or some other operations in the space it will be more effective compared to the w space that is also.

relevant to another uh uh inversion method like people change some encoder for this style gun generator you know one weakness for style gun is that all the gam model is that they dont have the inference capabilities there is no encoder for this generator then people have to uh trim some encoder after words you know then people need to think about which latent space to use as the median for this encoder and the.

generator so that many latent space to use so here we have an evaluation on which latent space to use for this encoder training so we found out it is actually the space is more effective to reconstruct on the the input image compared to the previous wplus work so we we show some on qualitative results uh with the previous work aoae and id invert and we show that if.

you do this train this encoder in the style space you can have better uh reconstruction results also the feature is also can be used in many different applications so you can take a look at our cvpr work if you’re interested to know more and so this kind of gang inversion can be generalized to many image um processing tasks so we have some uh recent work to explore the applications of the scanning version you know the daily version it just treats this generator as a generative image priors then we can optimize the.

latent codes and doing this reconstruction so we can modify this uh um objective a little bit then we can um realize a lot of image processing applications such as colorization super resolution or some masked optimization so that it opened the doors to many image applications okay so here is a summary so you can see that we can apply a lot of different.

approaches to identify those interpretable dimensions or factors uh hidden inside the generating model so interpreting the general models can greatly facilitate this interactive content creation and the human ai collaborations so if you’re interested in knowing more so you can refer to this recent survey purpose on gang inversion so that’s the end of my lecture so thank you very much and so i.

hope you can also check out other speakers for this tutorial thank you bye bye


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *