[논문 리뷰] Attention-Based Lip Audio-Visual Synthesis for Talking Face Generation in the Wild (2022)

Paper Review/Face 2022. 10. 14. 14:47

1. Introduction

Talking Face Generation 분야에서 더 정확한 lip-sync를 맞추기 위해 attention 모듈을 추가한 AttnWav2Lip 모델을 제안하였다. 더 자세히는, spatial attention module과 channel attention module을 통합하여 얼굴 이미지에서 중요한 부분은 입술 영역 생성에 더 많은 관심을 기울이게 된다. 이렇게 Face Generation에서 attention 메커니즘을 도입한 첫 번째 논문이라고 한다.

기존의 Wav2Lip 모델은 네트워크 내에서 lip-sync를 위해서 연속 프레임에서 temporal context 정보를 사용한다. 또한, 생성된 비디오에서 적대적 학습을 할 수 있도록 사전 학습된 lip-sync discriminator를 사용한다. (이 사전 학습된 모델을 활용하여 지표 또한 개발함.) 하지만 lip region에 대한 정보가 전체 비디오의 4%밖에 되지 않기 때문에 이러한 사전 학습은 audio-lip 간의 관계보다는 시각적인 정보에 더 초점을 맞춘다고 한다.따라서, 중요한 부분에 더 집중할 수 있는 attention 메커니즘을 활용하기 적합하다.

이 논문에서는 Wav2Lip 모델에 attention module을 통합하여 channel과 spatial axes에서 feature map이 스스로 강조할 위치와 그렇지 않은 위치를 학습시킨다. 이에 대한 검증은 앞서 리뷰한 LSE 지표를 사용한다.

2. Attention-based lip audio-visual synthesis

2.1 Wav2Lip

먼저 기존의 Wav2Lip의 loss는 아래와 같다.

L1 : 생성된 프레임과 gt 프레임 사이의 reconstruction loss -> 최소화

L_sync : 사전학습된 모델에서 생성한 비디오와 오비오 임베딩을 통해 오디오-비디오의 sync가 맞는지에 대한 확률

L_gen : 판별기가 생성기의 진위를 판단한 정도 (일반적인 GAN loss와 동일)

2.2 Attention Module

해당 논문에서 추가한 attention module인 attn은 wav2lip 내에서 주어진 feature map F에 대한 attention map W를 구하고 이 두 matrix를 곱해 최종 attention map F'를 구한다.

2.2.1 Spatial Attention Module (SAM)

SAM은 강조하거나 그렇지 않을 위치를 결정하는 spatial attention map을 추론한다. 이를 위해 feature map F의 channel axes를 따라 풀링 연산(Woo et al.(2018)에 의하면 정보에 대한 영역을 얻는데 효과적임)을 수행해서 다시 연결해준다. 이를 통해, spatial attention map M_s를 구할 수 있다.

2.2.2 Channel Attention Module (CAM)

CAM은 channel 간 관계를 이용한다. 이는 input feature에서 무엇이 의미있는지 알려준다. 즉, CAM이 feature map의 각 channel에 대한 가중치를 추론한다. 먼저, avg pooling과 max pooling을 이용해 두 개의 spatial context를 생성한다. 이후, 이 두 개 모두 shared conv layer로 들어가고 sigmoid layer를 통과해 channel attention map M_c를 생성한다.

2.3 Attention based Wav2Lip

attention module을 통해 'where'와 'what'에 초점을 맞춘 두 feature map M_c, M_s를 얻을 수 있다. 이 두 과정에서 구해진 attention map의 feature는 다음과 같다.

이를 통한 전체적인 구조는 다음과 같다.

한가지 주의할 점은, 고해상도의 input 이미지에 대한 픽셀에는 semantic 정보가 매우 약하기 때문에 attention 매커니즘의 region properties와 충돌할 수 있다. 따라서, 이미지에 대해서는 attention module을 직접 적용하지 않고, cnn을 통해 처리된 값을 attention module에 넣는다고 한다.

paper : https://arxiv.org/abs/2203.03984

'Paper Review > Face' 카테고리의 다른 글

[논문 리뷰] ArcFace: Additive Angular Margin Loss for Deep Face Recognition (2019) (0)	2022.10.14
Talking head Generation Evaluation Measures (0)	2022.10.06
[논문 리뷰] SyncNet - Out of time: automated lip sync in the wild (2016) (1)	2022.09.28
Face Generative Model Metric - LSE-D, LSE-C (0)	2022.09.28
[DeepFake] 관련 논문 정리 (0)	2022.02.24

ABOUT ME

tkdrnjss tkdrnjss

1. Introduction

2. Attention-based lip audio-visual synthesis

2.1 Wav2Lip

2.2 Attention Module

2.2.1 Spatial Attention Module (SAM)

2.2.2 Channel Attention Module (CAM)

2.3 Attention based Wav2Lip

'Paper Review > Face' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. Introduction

2. Attention-based lip audio-visual synthesis

2.1 Wav2Lip

2.2 Attention Module

2.2.1 Spatial Attention Module (SAM)

2.2.2 Channel Attention Module (CAM)

2.3 Attention based Wav2Lip

'Paper Review > Face' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바