Motivation

Figure 1 Video contents possess diverse motion cues and various restoration difficulties.

Ø We pioneer a novel representation for video modeled as instances,

events, and scenes, providing both global semantics and instance-

specific semantics to boost the performance of video super-resolution.

Ø We propose a Semantics-Powered Attention Cross-Embedding

block to bridge semantic priors and pixel-level features, being aware

of the restored contents.

Ø We further design Instance-Specific Semantic Embedding Encoder

to perform inter-frame alignment in the instance-centric semantic

space via attention mechanism.

Ø As a critical clue of video super-resolution (VSR), inter-frame

alignment significantly impacts overall performance. However,

accurate pixel-level alignment is a challenging task.

Reference Frame !Supporting Frame ! − # Supporting Frame ! + #Supporting Frame ! − % Supporting Frame ! + %

Contributions

1. Features of supporting frames are first coarsely

warped to reference frame via the IMAGE.

Ø Implicit Masked Attention Guided Pre-Alignment (IMAGE)

Ø Semantics-Powered Attention Cross-Embedding (SPACE)

Video Instance Masks

Mask Fill

Attention Masks

EmbedEmbed

Masked Attention

Warped Supporting

Frame Features

!"#

Person/#1

Surfboard/#2

Reference Frame Features

Supporting Frame Features

ℳ

Figure 3 IMAGE Module.

Figure 4 SPACE Block.

Ø Semantic Lens consists of a Semantic Extractor and a Pixel Enhancer.

Semantic Extractor decouples pixel-based video into instances, events,

and scenes, and Pixel Enhancer refines the original pixels and

generates absent information with the guidance of the semantic priors.

Methodology

…

X-Decoder

Text Encoder

Object Category

…

Instance Encoder

Instance Decoder

Frame-Level Detector

Frame-Wise Instance Token Frame-Wise Global Token Video-Wise Instance Token Classificatory Token

",$

…

",%

IMAGE

ISEE

GPS

Embed

Cross-Attention

Pixel

Semantic

!,#

Reconstruction

Feature Propagation

)(

Frame Encoder

Semantic Extractor Pixel Enhancer

Semantic Prior Flow Pixel Feature Flow Semantic to Pixel Flow

!,$

",$

Semantic-Guided Instance-Centric Alignment

MFSAB

IMAGE

SPACE

GPS

ISEE

Semantic

Abundance

(

Figure 2 Overall pipeline for Semantic Lens.

Ø With the aim of improving the inter-frame alignment, we formulate a

semantic-guided instance-centric alignment schema.

Element-Wise Multiplication

Element-Wise Addition

!: query projected by pixel-level feature

"/#: key/value projected by semantic-level feature

$/&: scale/bias for feature modulation

MFSAB

!"#

Layer

Norm

Layer

Norm

Layer

Norm

Layer

Norm

Layer

Norm

Layer

Norm

MLPMLP

MLP

Multi-Frame

Self-Attention

Multi-Frame

Self-Attention

Multi-Frame

Self-Attention

!&#

!"#

!&#

Semantic-Powered Attention

Cross-Embedding Block

Multi-Frame

Self-Attention Block

2. SPACE is composed of the Global Perspective Shifter (GPS) and the

Instance-Specific Semantic Embedding Encoder (ISEE). GPS

modulates the features based on global semantics, and ISEE further

aligns the feature with the guidance of instance-specific semantics.

GPS

ISEE

Experiments

Bicubic EDVR BasicVSR IconVSR

BasicVSR++Frame 01, Clip 422 GTPSRT Semantic Lens (Ours)

HR Reference Frame

LAM

Attribution

Area of

Contribution

LAM

Results

Supporting Frame #1 Supporting Frame #2Reference Frame

Table 1 Quantitative comparison (PSNR↑ and SSIM↑) on the YTVIS (2019/2021/2022) dataset for 4× VSR task.

Figure 5 Visual comparison of VSR (4× ) on YTVIS-22 dataset.

Figure 6 The attribution results of adjacent frames.

Table 2 Results of ablation studies on the YTVIS-19 dataset.

Code Available at:

https://github.com/Tang1705/Semantic-Lens-AAAI24

The 38th Annual

AAAI Conference on

Artificial Intelligence

Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution

Qi Tang

1,2

, Yao Zhao

1,2

, Meiqin Liu

1,2*

, Jian Jin

and Chao Yao

Institute of Information Science, Beijing Jiaotong University, Beijing, China

Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China

Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University, Singapore

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China