[논문리뷰] LDM: High-Resolution Image Synthesis with Latent Diffusion Models

쓰다보니, 거의 논문번역에 가까운.. 근데 흥미로운 내용들이 많다. DM의 거대한 컴퓨팅 리소스 demand를 줄임에 따라 Diffusion 모델 개발의 민주화와 동시에 환경 파괴를 막고자 하는 의도 등의. 오픈소스로 Stable-Diffusion을 공개할 수 있었던 것도 이러한 의도와 연관되어 있지 않을까.

LDM: High-Resolution Image Synthesis with Latent Diffusion Models

paper: https://arxiv.org/abs/2112.10752 github: https://github.com/CompVis/latent-diffusion

LDM: High-Resolution Image Synthesis with Latent Diffusion Models 0. Abstract 1. Introduction Democratizing High-Resolution Image Synthesis : 고해상도 이미지 생성의 민주화 Departure to Latent Space : Latent Space로의 발전 3. Method 3.1. Perceptual Image Compression (Pixel Space ↔ Latent Space)3.2. Latent Diffusion Models 3.3. Conditioning Mechanisms 4. Experiments 4.1. On Perceptual Compression Tradeoffs 4.2. Image Generation with Latent Diffusion 4.3. Conditional Latent Diffusion 4.4. Super-Resolution with Latent Diffusion 4.5. Inpainting with Latent Diffusion 5. Limitation & Societal Impact Limitation Societal Impact 6. Conclusion Review Reference

0. Abstract

Diffusion Model들은 image 및 여타 데이터 generation에 있어 SOTA를 달성했다. 그러나 DM(Diffusion Model)이 pixel space에서 직접적으로 동작함에 따라, DM을 최적화하기 위해서는 대량의 GPU와 시간이 필요했다. DM이 퀄리티와 유연성을 유지하면서도 제한된 컴퓨팅 리소스만을 사용해 학습이 가능하도록 우리는 강력한 pretrained auto-encoder의 latent space를 적용했다. 이전 모델들과는 다르게 diffusion model을 해당 representation에 대해 학습시킴에 따라, complexity reduction(복잡도 감소)과 detail-preservation(디테일 보존)간의 near-optiaml point를 처음으로 찾아냈다.

cross-attetion layer를 model architecture에 도입함에 따라, 우리는 diffusion model을 general conditioning input(텍스트 or bounding box와 같은)에 대해 강력하고 유연하게 바꿀 수 있었고, high-resolution(고해상도)은 convolution 방식으로 가능해졌다.

우리의 Latent Diffusion Model(LDM)은 (1)image intainting, (2)class-conditioned image synthesis에서 SOTA를 달성했고, (3)text-to-image synthesis, (4)unconditional image generation, (5)super-resolution에서 컴퓨팅 requirements를 줄이면서도 pixel-based DM 대비 높은 퍼포먼스를 보였다.

1. Introduction

Image Synthesis는 최근 가장 극적인 발전을 이룬 분야이며 동시에 가장 큰 computational 수요가 존재한다. 특히, 복잡하고 자연스러운 장면을 고해상도로 생성하는 연구는 최근까지 billions of parameters를 갖는 Auto-Regression(AR) Transformer를 scaling up하는 방식을 활용했다. GAN의 경우에는 adversarial 학습 때문에, 복잡하거나 multi-modal distribution으로 scale하기 힘들어 제한된 variability를 가진다.

최근, denoising autoencoder로 구성된 DM(Difffusion Model)들은 image synthesis에 인상적인 결과를 남겼고, class-conditional image synthesis와 super-resolution에서 SOTA를 달성했다. Likelihood-based 모델을 선택함에 따라 DM은 GAN의 mode-collapse나 학습의 불안정성과 같은 단점들을 겪지 않았고, parameter sharing을 개척했다. DM은 많은 파라미터 없이도, 자연스러운 이미지의 굉장히 복잡한 분포를 모델링 할 수 있다.

Democratizing High-Resolution Image Synthesis : 고해상도 이미지 생성의 민주화

DM은 여전히 많은 컴퓨팅 리소스를 요한다. 왜냐하면, 모델이 여전히 RGB 이미지의 높은 차원 공간에서 training and evaluation되고 있기 때문이다. 일례로, 가장 강력한 DM을 학습시키기 위해선, 150~1000 V100 days가 필요하다. 이것은 연구 커뮤니티와 유저에게 두가지 결과로 이어진다.

DM을 학습시키기 위해서는 필드의 소수만 접근가능한 거대한 컴퓨팅 리소스가 필요하다. 또한 수많은 탄소발자국을 남긴다. (거대한 컴퓨팅은 환경을 파괴한다.)

이미 학습된 모델을 평가하는 것이 시간과 memory 관점에서 expensive하다.

이 컴퓨팅 리소스 사용량을 줄이면서도 강력한 모델의 접근성을 높이기 위해선, training과 sampling 둘 모두의 compuatational-complexity를 줄여야 한다. DM의 성능을 유지하면서도 computing demand를 줄이는 것이 고해상도 이미지 생성 민주화의 열쇠다.

Departure to Latent Space : Latent Space로의 발전

본 논문의 접근 방식은 pixel space에 대해 기존에 학습된 모델을 분석하는 것으로 시작된다. Figure 2는 학습된 모델의 rate-distortion trade-off를 보인다(LDM은 imperceptible한 detail만을 제거하면서도 효율이 좋다.) 우리는 먼저 perceptually equivalent하면서도 computationally suitable한 space를 찾고자 했다. 학습은 두 단계로 나뉘어지는데,

data space와 perceptualy equivalent한 lower-dimensional representational space를 제공하는 autoencoder를 학습한다. 우리는 DM을 학습된 latent space(잠재 공간)에서 학습시켰는데, spatial dimensinality(공간적 차원; pixel space)보다 더 좋은 scaling property를 보인다.

감소된 complexity는 latent space에서의 효과적인 image 생성을 single network pass만으로 가능하게 한다. 우리는 이 모델을 Latent Diffusion Model(LDMs)라고 부르기로 했다.

3. Method

3.1. Perceptual Image Compression (Pixel Space ↔ Latent Space)

본 논문의 Perceptual compression 모델은 autoencoder로 구성되어 있다. RGB 공간에 있는 에 대해 encoder 는 를 latent 표현인 로 인코딩하고, decoder 는 latent로 부터 image 를 reconstruct한다. encdoer는 이미지를 factor에 따라 downsamples 하고, 본 논문에서는 각기 다른 downsampling factors 에 대해 실험했다.

논문에서는 autoencoder에서 latent space의 high-variance 문제를 피하기 위해 두가지 regulation을 실험했다.

KL-reg: A small KL penalty towards a standard normal distribution over the learned latent, similar to VAE.

VQ-reg: Uses a vector quantization layer within the decoder, like VQVAE but the quantization layer is absorbed by the decoder.

LDM의 compression model은 latent space 에 대해 two-demensional structue(2차원 구조)로 디자인 되었기 때문에 가 1차원 구조였던 이전보다 의 디테일을 더 잘 보존할 수 있었다.

3.2. Latent Diffusion Models

Diffusion Model

Diffusion Model은 data distribution p(x)를 학습하기 위해 디자인 되었다. 이 모델들은 denoising autoencoder 의 weighted sequence로 볼 수 있으며, noisy input 로 부터 원본 이미지 를 predict하고자 한다. 단순화된 Objecive는 아래와 같다.

Generative Modeling of Latent Representations

와 로 구성된 perceptual compression 모델을 통해 low-dimensional latent space에 접근할 수 있는데, 이는 likelihood-based 생성 모델에 더욱 적합하다. (i) 데이터의 중요하고 sematic한 bits에 집중할 수 있고, (ii) 더 낮은 차원에서 학습할 수 있어 computionally 효율적이다.

모델에서 neural backbone인 은 time-conditional UNet이다. forward process가 고정되어 있기 때문에, 는 학습 과정에서 에 의해 쉽게 얻어질 수 있고, 로 부터의 sample은 를 통해 single pass로 쉽게 decoding될 수 있다.

3.3. Conditioning Mechanisms

다른 타입의 generative 모델과 유사하게 diffusion model은 conditional distribution을 로 모델링한다. conditional denoising autoencoder 가 될 수 있고, 이는 이미지 생성을 input (text, semantic maps)에 따라 컨트롤하거나, image-to-image translation task를 수행할 수 있게 한다. 본 논문에서는 UNet backbone을 다양한 modality에 대해 conditioning 가능하도록 cross-ateention mechanism로 구성했다.

를 다양한 modalities(텍스트 프롬프트, semantic 맵과 같은)로부터 전처리하기 위해 본 논문에서는 domain specific encoder 를 도입했고, 이는 를 로 project 한다. (UNet에 적합하게 encoding)

는 Query, 는 Key(input으로 넣고자 하는 embedding)이며 이 둘을 Dot 연산한 뒤, softma로 weight 형태로 이끌어낸다. 이후 이것을 다시 (Value; 와 마찬가지로 input embedding)와 내적하는데, 전형적인 cross attension mechanism이다.

는 의 UNet에서의 (flattened) representation을 의미

와 & , 는 학습가능한 projection metrices

conditional LDM의 단순화한 objective는 다음과 같다.

4. Experiments

4.1. On Perceptual Compression Tradeoffs

Downsampling factore 에 따른 실험 결과. LDM-라고 부르며, LDM-1은 pixel-based DM이다. LDM-{4-16}이 효율과 품질 간의 좋은 balance를 보였고, LDM-4, LDM-8이 high-quality 결과에 최적의 조건이었다.

4.2. Image Generation with Latent Diffusion

CelebA-HQ에서 FID 기준 SOTA 달성

낮은 FID Score를 유지하면서도, 1B 정도의 적은 parameter 사용.

4.3. Conditional Latent Diffusion

<Text-To-Image>

<Layout-To-Image>

<Class Conditional Image Geneartion>

4.4. Super-Resolution with Latent Diffusion

4.5. Inpainting with Latent Diffusion

5. Limitation & Societal Impact

Limitation

LDM은 pixel based approach에 비해 computational demands를 크게 줄이지만, 샘플링 프로세스는 여전히 GAN보다 느리다. 또한 high precision이 필요한 경우 LDM의 사용이 의심스러울 수 있다. 우리의 f=4 autodncoder 모델(LDM-4)에서 이미지 품질의 손실은 매우 작지만 (Figure 1.), pixel space에서 미세한 정확도가 필요한 작업에는 reconstruction capability가 bottle-neck을 일으킬 수 있다. 우리는 super-resolution model모델도 이와 관련해 이미 제한이 있다고 추정중이다.

Societal Impact

이미지와 같은 미디어 생성 모델은 양날의 검이다.

다양한 창의적인 응용 프로그램을 가능하게 하고, 특히 학습 및 추론 비용을 줄이는 이와 같은 접근 방식은 이 기술에 대한 접근을 용이하게 하고 연구를 민주화할 수 있는 잠재력을 가지고 있다. 한편, 조작된 데이터를 생성 및 유포하거나 잘못된 정보 및 스팸을 유포하기가 더 쉬워진다는 의미이기도 하다. 특히 고의적인 이미지 조작("deep fakes")은 이러한 맥락에서 흔히 발생하는 문제이며, 특히 여성은 이에 의해 상당한 영향을 받는다.

생성 모델은 training data를 공개할 수 있는데, 민감한 데이터나 개인 정보가 포함되어 있는 데이터가 명시적인 동의 없이 수집된 경우에 대해 큰 우려가 된다. 그러나 이것이 DM의 이미지에 적용되는 정도는 아직 완전히 이해되지 않았다. 마지막으로 Deep Learning Module은 데이터에 이미 존재하는 bias를 재현하거나 악화시키는 경향이 있다. Diffusion Model은 예를 들어 GAN 기반 접근 방식에 비해 보다 나은 data distribution coverage를 달성하는데, LDM의 접근 방식이 데이터를 잘못 표현하는 정도는 중요한 연구 질문으로 남아 있다.

Deep Generative 모델의 윤리적 discussion의 경우 다음을 살펴보면 좋다. Ethical considerations of generative ai, Emily Denton, CVPR, 2021.

6. Conclusion

본 논문은 quality를 저하시키지 않고 Denosing Diffusion Model의 학습 및 샘플링 효율성을 크게 향상시키는 간단하고 효율적인 방법인 LDM: Latent Diffusion Model을 제시했다. Cross-attention conditioning mechanism을 기반으로, Task별 별도의 아키텍처 없이 광범위한 Conditional Image Synthesis task에 SOTA 모델들과 비교해도 손색이 없었다.

Review

전반적으로 수식은 뒤로 물러나고, 컨셉 설명에 충실한 논문이었던 것 같다. 당연히 여러 task에서 SOTA를 찍은 부분이나, 그 퀄리티가 정말 놀랍다.

현재 아래 그림 때문에 살짝 헷갈리는 지점이 있는데, 본 논문에서 DDIM이 Sampler로 활용되는 부분을 아직 잘 이해하지 못했다. Stable-Diffusion 사용시 Scheduler로서 활용된다고 알고는 있으나, 정확히 Scheduler Algorithm이 어떤 방식으로 모델 아키텍쳐에 적용되는지 궁금하게 됐다.

Reference

https://arxiv.org/abs/2112.10752

https://velog.io/@hewas1230/StableDiffusion

< PREV

📑

[논문리뷰] DDIM: Denoising Diffusion Implicit Model

NEXT >

📑

[논문리뷰] DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation