Motivation
Figure 1 Video contents possess diverse motion cues and various restoration difficulties.
Ø We pioneer a novel representation for video modeled as instances,
events, and scenes, providing both global semantics and instance-
specific semantics to boost the performance of video super-resolution.
Ø We propose a Semantics-Powered Attention Cross-Embedding
block to bridge semantic priors and pixel-level features, being aware
of the restored contents.
Ø We further design Instance-Specific Semantic Embedding Encoder
to perform inter-frame alignment in the instance-centric semantic
space via attention mechanism.
Ø As a critical clue of video super-resolution (VSR), inter-frame
alignment significantly impacts overall performance. However,
accurate pixel-level alignment is a challenging task.
Reference Frame !Supporting Frame ! # Supporting Frame ! + #Supporting Frame ! % Supporting Frame ! + %
Contributions
1. Features of supporting frames are first coarsely
warped to reference frame via the IMAGE.
Ø Implicit Masked Attention Guided Pre-Alignment (IMAGE)
Ø Semantics-Powered Attention Cross-Embedding (SPACE)
Video Instance Masks
Mask Fill
Attention Masks
EmbedEmbed
!
"
#
Masked Attention
Warped Supporting
Frame Features
$
!"#
$
!
Person/#1
Surfboard/#2
Reference Frame Features
Supporting Frame Features
$
%
!
$%
$&
Figure 3 IMAGE Module.
Figure 4 SPACE Block.
Ø Semantic Lens consists of a Semantic Extractor and a Pixel Enhancer.
Semantic Extractor decouples pixel-based video into instances, events,
and scenes, and Pixel Enhancer refines the original pixels and
generates absent information with the guidance of the semantic priors.
Methodology
v
X-Decoder
Text Encoder
Object Category
!
Instance Encoder
Instance Decoder
"
!
Frame-Level Detector
Frame-Wise Instance Token Frame-Wise Global Token Video-Wise Instance Token Classificatory Token
#
",$
"
",%
&
$
"
'(
IMAGE
ISEE
GPS
!"
#
Embed
Cross-Attention
Pixel
Semantic
$
%
!,#
$
Reconstruction
Feature Propagation
$
"
)(
Frame Encoder
%
"
Semantic Extractor Pixel Enhancer
Semantic Prior Flow Pixel Feature Flow Semantic to Pixel Flow
&
%
"
"
&
"
!,$
&
"
'
",$
&
2
3
1
Semantic-Guided Instance-Centric Alignment
MFSAB
IMAGE
SPACE
GPS
ISEE
&
%
Semantic
Abundance
%
(
"
Figure 2 Overall pipeline for Semantic Lens.
Ø With the aim of improving the inter-frame alignment, we formulate a
semantic-guided instance-centric alignment schema.
!
"
#
$
%
.
Element-Wise Multiplication
Element-Wise Addition
!: query projected by pixel-level feature
"/#: key/value projected by semantic-level feature
$/&: scale/bias for feature modulation
MFSAB
'
!"#
$%
Layer
Norm
Layer
Norm
Layer
Norm
Layer
Norm
Layer
Norm
Layer
Norm
MLPMLP
MLP
Multi-Frame
Self-Attention
Multi-Frame
Self-Attention
Multi-Frame
Self-Attention
'
!
$%
'
!&#
$%
'
!"#
'
!
'
!&#
Semantic-Powered Attention
Cross-Embedding Block
Multi-Frame
Self-Attention Block
2. SPACE is composed of the Global Perspective Shifter (GPS) and the
Instance-Specific Semantic Embedding Encoder (ISEE). GPS
modulates the features based on global semantics, and ISEE further
aligns the feature with the guidance of instance-specific semantics.
GPS
ISEE
Experiments
Bicubic EDVR BasicVSR IconVSR
BasicVSR++Frame 01, Clip 422 GTPSRT Semantic Lens (Ours)
HR Reference Frame
LAM
Attribution
Area of
Contribution
LAM
Results
Supporting Frame #1 Supporting Frame #2Reference Frame
Table 1 Quantitative comparison (PSNR↑ and SSIM↑) on the YTVIS (2019/2021/2022) dataset for 4× VSR task.
Figure 5 Visual comparison of VSR (4× ) on YTVIS-22 dataset.
Figure 6 The attribution results of adjacent frames.
Table 2 Results of ablation studies on the YTVIS-19 dataset.
Code Available at:
https://github.com/Tang1705/Semantic-Lens-AAAI24
The 38th Annual
AAAI Conference on
Artificial Intelligence
Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution
Qi Tang
1,2
, Yao Zhao
1,2
, Meiqin Liu
1,2*
, Jian Jin
3
and Chao Yao
4*
1
Institute of Information Science, Beijing Jiaotong University, Beijing, China
2
Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China
3
Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University, Singapore
4
School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China