[논문리뷰] TEXTure: Text-Guided Texturing of 3D Shapes

TEXTure: Text-Guided Texturing of 3D Shapes

paper: https://arxiv.org/abs/2302.01721 project page: https://texturepaper.github.io/TEXTurePaper/

TEXTure: Text-Guided Texturing of 3D Shapes 0. Abstact 1. Introduction 2. Related Work 3. Method 3.1. Text-Guided Texture Synthesis 3.2. Texture Transfer 3.3. Texture-Editing 4. Experiments 4.1. Text-Guided Texturing 4.2. Texture Capturing 4.3. Editing 5. Limitation Reference

0. Abstact

TEXTure는 pretrained depth-to-image diffusion model을 레버리지하여 3D object를 paint한다. 기존 연구들의 문제는, depth-to-image모델들이 single viewpoint에 대해서는 그럴듯한 texture를 생성하는데 반해, generation process의 stochastic 특성으로 인해, 3D object의 texturing 과정 전반에서는 inconsistancy 문제가 있다는 것이다.

문제 해결을 위해 본 논문에서는 rendered image를 3부분으로 파티셔닝한 trimap(generate, refine, keep)을 정의하고, trimap represenation을 사용하는 새로운 diffusion sampling process를 제안한다. 최종적으로, TEXTure가 새로운 texture 생성 뿐 아니라, text prompt 혹은 scribble(낙서)를 통해 기존 texture의 editing, refine도 가능함을 보인다.

1. Introduction

본 논문에서는 3D object를 Texturing하는 것에 집중하며, 주어진 3D input mesh에 대해 seamless하게 paint하는 기법인 TEXTure를 소개한다. 이전의 접근 방식들과는 다르게, 본 논문에서는 depth-conditioned diffusion model을 이용하여, rendered image에 대해 직접적으로 full denoising process를 적용하는 것을 선택했다. 본 method는 반복적으로 object를 다른 viewpoints에 대해 render하고, depth-based painting scheme을 적용하며, 이를 다시 mesh vertices or atlas에 project한다. 저자들은 이 접근방식이 running time과 generation quality측면에서 유의미한 boost가 있음을 발견했으나, 이 process를 naive하게 적용하면, 높은 inconsistent texturing으로 이어진다.

이러한 inconsistencies를 줄이기 위해 dynamic partitioning(rendered view → trimap)을 소개한다. trimap은 “keep”, “refine”, “generate”로 구성되어있고, 각 diffusion process 전에 추정된다. diffusion process에서 “keep” region을 freeze함으로서, 우리는 더 consistent한 output을 얻을 수 있다. 그러나 새롭게 generated된 region은 여전히 global consistency의 부족 문제를 겪는다(Figure 2 (B)). “generate” region에서 Global consistency를 높이기 위해, 본 논문에서는 depth-guided and mask-guided diffusion model을 도입했다.(Figure 2.(C)) 마지막으로 “refine” region을 위해, 저자들은 이 region을 repaint할 수 있는 새로운 process를 고안했다. 이 테크닉들을 이용해 highly-realistic results를 단 몇 분만에 생성할 수 있다.

이 method는 text prompt 뿐만아니라, exititng texture로도 가능하며, surfeace-to-surface mapping이나, 즉각적인 reconstruction이 필요하지 않다. 대신 An image is worth one word나 Dreambooth 등을 depth-conditioned model로 확장하고, 학습된 view-point token을 도입하여 특정 texture를 표현하는 semantic token을 학습하는 방식을 사용한다. text-based editing과 exiting texture map의 editing도 가능하다.

2. Related Work

Text-to-Image Diffusion Models

Stable Diffusion(LDM)부터 DreamBooth 등의 기법까지 소개한다.

Texture and Content Transfer

3D surface에 대한 texture를 생성하는 것은 color와 geometry까지 고려해야 하기 때문에 2D보다 더 어렵다. Geometric texture synthesis와 3D Color texture synthesis에 대한 선행연구들이 있다.

3D Shape and Texture Generation

3D Shape & Texture 생성이 최근에 주목을 받고 있는데, Text2Mesh, Tango, CLIP-Mesh는 CLIP-space similarities를 objective로 사용하여, novel shape and texture를 생성해낸다. 최근, DreamFusion은 text-prompt conditionend 3D-NeRF-models를 생성을 위한 pretrained image diffusion model(아마도 Imagine) 사용을 소개했다. key-component는 Score-Distillation loss로 이 loss는 pretrained 2D diffusion모델을 3D NeRF scene을 optimizing하는데에 활용할 수 있다. 또한 Latent NeRF는 Stable Diffusion의 3D NeRF models를 생성하기 위한 latent space 내에서 Score-Distillation loss를 사용가능함을 보였다.

Only texture generation에서는, Latent-Paint가 Score-Distillation로 latent texture maps을 paint하고, final Colorization을 위해 RGB로 decode된다. 비슷하게, Magic3D는 texture와 초기 쉐입 refine을위해 Score-Distillation를 활용한다. 두 method 모두 본 논문에 의해 slow convergence와 less define texture문제가 있다.

3. Method

3.1. Text-Guided Texture Synthesis

논문의 Texture generation method는 dept-to-image diffusion model인 와 inpainting diffusion model인 를 사용하며 둘다, pre-trained stable diffusion 모델이고 latent space를 공유한다. Genreation process에서 texture는 UV mapping을 통해 atlas(전개도)로 표현된다. (XAtlas로 계산된다.)

method는 임의의 initial viewpoint 에서 시작하고, 이때 은 카메라의 radius(범위), 는 카메라의 각(azimuth angle), 는 elevation(높이)이다. 를 사용하여 viewpoint 에서 바라본 mesh의 초기 colored 이미지 (depth map 에 의해 conditioned)를 생성한다. 는 이어, texture atlas(텍스쳐 전개도) 로 projected 된다. 이 초기화 step이후, Figure 3.에 나타난 incremental colorization이 시작된다. (iterate through a fixed set of viewpoints)

본 논문은 각각의 Viewpoint에서 와 를 얻기 위해, Renderer 을 통해 mesh를 render한다. 는 이전의 모든 colorization step을 고려한 viewpoint 에서 바라본 mesh의 rendering이다. 최종적으로, 를 고려하면서 next 이미지 와 업데이트 된 texture atals ,를 생성할 수 있다.

Single View가 painted되면 generation task는 더욱 어려워지는데, texture generation동안 local과 global consistency를 고려해야 하기 때문이다. 이제 incremental painting 프로세스의 single iteration t와 위의 문제를 다룬 방식에 대해 알아보자.

Trimap Creation

주어진 viewpoint 에 대해 rendered image를 3개의 영역(”generate”, “keep”, “refine”)으로 partitioning한다. “generate” 영역은 처음으로 보이는 영역이자, 이전에 paint된 영역과 match되도록 paint되어야 하는 rendered area이다.

“keep”과 “refine” 영역 간의 구분은 더 미묘한 차이가 있는데, 이는 기울어진 각도에서 mesh를 coloring하는 것이 큰 왜곡(distortion)을 발생시킬 수 있기 때문이다. 왜곡은 screen이 낮을 때의 단면(cross-section) 때문인데, mesh texture image 의 low-resolution 업데이트를 야기한다. 특히 저자들은 trianlge의 cross-section을 카메라 좌표시스템상의 face noraml 의 component 로 측정한다.

만약 현재의 view가 이전 view에서 paint된 영역에 대해 더 나은 colorization angle을 제공한다면, 우리는 존재하는 texture를 “refine”하고 싶을 것이다. 반대로, 이전 view와의 consistency를 보장하기 위해서는 texture를 수정하지 않고 “keep”하고 싶을 것이다. 이전 region과 단면을 유지하기 위해, 매 iteration마다 업데이트 되는 추가적인 meta-texture map 을 사용한다. 이 추가적인 map은 각 iteration마다 효과적으로 texture map과 render될 수 있고, trimap partitioning을 정의하는 데에 활용된다.

Masked Generation

Depth-to-image diffusion process가 전체 이미지를 생성하도록 훈련되었기 때문에, sampling process를 수정해 이미지의 “keep” 파트를 고정되도록 해야한다. Blended Diffusion에 따라 각 denoising step에서 “keep” region에서의 를 noising하여(i.e. ) diffusion sampling process에 inject하고, 이를 통해 seamless하게 generated result에 blend되도록 한다. 특히 timestep i 샘플링 스텝에서의 latent는 다음을 통해 계산된다.

mask인 는 eq2.에 정의되어 있다. 이 과정은 “keep” region을 위한 것으로 우리는 단순히 가 원래의 값을 fix하도록 세팅한다.

Consistent Texture Generation

“keep” regions를 diffusion process에 inject하는 것은 “generate” region과 더 잘 blending되는 결과를 낳는다. 여전히, “keep” 영역에서 벗어나 “generate” 영역에 깊이 다가갈수록, 생성되는 output은 sampled noise에 의해 통제되고, 이전에 painted된 영역과 consistent하지 않다.

본 논문에서는 masked region을 완성하도록 학습된 inpainting diffusion model 를 적용하는 것이, 더 consistent한 생성을 가능하게 한다는 사실을 발견했다. 그러나 이 방식만 사용하면 conditioning depth인 에서 벗어나, 새로운 geometry를 생성해버린다. 두 모델 모두에서 benefit을 취하기 위해, 본 논문에서는 interleaved process를 소개하는데, initial sampling step에서 두 모델을 조건에 따라 분기하는 것이다. sampling 중에, 다음 noised latent 은 다음과 같이 계산된다.

를 적용할때는 noised latent가 현재의 depth인 에 의해 guide되는데에 반해, 를 적용할 때에는 sampling process가 “generate” 영역을 globally-consistent manner로 완성하게 된다.

Refining Regions

“refine” 영역을 다루기 위해, diffusion process가 이전의 values를 포함하면서도 새로운 texture를 생성하도록 한다. 핵심 observation은 샘플링 process의 첫 스텝에서 격자모양의 mask(checkerboard-like mask)를 사용함으로서, 이전의 value를 향하도록 noise를 가이드 할 수 있다는 것이다. mask는 첫 25 sampling steps에서 적용되며, 아래와 같다.

1이라는 값은 해당 지역이 paint되어야 한다는 의미이고, 다른 경우에는 kept되어야 하는 것이다. Blending mask는 Figure 3.에 나타나있다.

Texture Projection

를 texture atlas 로 project하기 위해 본 논문에서는 에 대한 에 gradient-based optimization을 적용한다.

다른 view에서의 texuture projection을 smoother하기 위해, soft mask 는 “refine”과 “generate” 영역의 경계에 적용된다. 는 2D Gaussian blur kernel이다.

Additional Details

texture는 1024x1024 atlas로 표현

rendering resolution은 1200x1200

diffusion process에서는 inner region을 512x512로 resize하여, realistic back-ground에 met

모든 shape은 8 viewpoints + 2 (up/bottom)에서 rendering

3.2. Texture Transfer

주어진 3D mesh에 대해 새로운 texture를 성공적으로 생성했기 때문에, 이제 주어진 texture를 새로운 texture가 입혀지지 않은 mesh로 transfer하는 방식을 알아보자. 두 가지 방법이 있는데, 한가지는 painted mesh에서 캡쳐하는 방식, 다른 한가지는 몇장의 이미지에서 캡쳐하는 방식이다. 본 논문의 Texture transfer method 역시 concept을 학습하는 An image is worth one word나 모델을 fine-tuning하는 Dreambooth 방법론 위에 구축되었다.

Spectral Augmentations

우리가 input texture에 대한 token representing에만 관심이 있고, geometry에는 관심이 없기 때문에, input texture를 포함하는 geometry의 범위를 커버하는 common token을 학습해야 한다. 이를 위해서는 texture와 그 geometry를 disentangle해야하고, fine-tuned diffusion model의 generalization을 높여야한다. 본 논문에서는 spectral augmentation 기법을 제안하는데, mesh를 랜덤하게 팽창하거나 수축시킨다. 위의 그림에 나타나있다.

Texture Learning

spectral augmentation 기법을 적용하여, input shape에 대한 많은 수의 image - depth map pair를 얻었다. 이제 image를 여러 viewpoints(좌,우,위,아래,앞,뒤 등)에 렌더링하고 색상이 있는 배경 위에 붙여 넣는다. rendered image들이 주어졌을 때, An image is worth one word를 따라서, “a <> photo of a <>” 형태로 optimize한다. 이때 는 렌더링된 이미지의 view direction을 나타내는 token이고 는 texture를 나타내는 token으로서, 6개의 가 같은 각도의 이미지들에서 공유되고, 는 전체 이미지에서 공유된다. 추가적으로 input texture를 더 잘 capture하기 위해, DreamBooth로 fine-tune한다. (Figure 4. 참고) 학습 이후에는, TEXTure 내부의 stable-diffusion 모델을 fine-tuning된 모델로 교체하여, target shape을 color한다.

Texture from Images

표준적인 textual inversion 기법들 (An image is worth one word, DreamBooth)과는 다르게 본 method를 통해 학습된 concept은 대부분 texture를 표현하고, depth-conditioned model에서 학습한 것과 같은 구조가 아니다. 이는 잠재적으로 3D shape을 texturing하기에 더 적합하게 만든다.

이 작업을 위해 pretrained U2Net을 사용하여 이미지에서 눈에 띄는 object를 분할하고 Scale 및 Crop Augmentation를 적용하고 결과를 무작위로 색상이 지정된 배경에 붙여넣는다. 결과적으로 이미지에서 semantic concept을 성공적으로 학습했으며, 명확한 reconstruction stage없이도 3D-shape에 적용했다. (우리는 이것이 실제 object로부터 영감을 받은 texture를 생성하기 위한 새로운 기회라고 생각한다.)

3.3. Texture-Editing

Text-based Editing

Trimap-based TEXTuring을 사용하여 2D editing 기법을 full mesh로 확대할 수 있다. Text-based editing을 위해, 모든 texture map을 “refine”영역으로 지정하고, 주어진 texture가 새로운 text prompt에 align되도록 TEXTuring process를 적용한다.

Scribble-based Editing

추가적으로 scribble(낙서)-based editting을 제공하여, 유저들이 다이렉트로 texture map을 수정할 수 있게 한다. 이를 위해서는 변경될 영역을 “refine”으로, 보존할 영역을 “keep”으로 지정한다.