
Have you ever wondered how those mesmerizing AI-generated art videos maintain their consistency frame by frame? Let me share our journey implementing various video style transfer algorithms during our project at NTNU.
The Mathematics Behind Video Style Transfer
Video style transfer extends beyond simply applying style transfer to individual frames. The key challenge lies in maintaining temporal coherence while preserving artistic style.
Content and Style Representation
The core idea involves separating content and style representations using convolutional neural networks. For a given frame $I_t$, we extract content features $F_c$ and style features $F_s$. The optimization objective combines content loss $L_c$ and style loss $L_s$:
$$L_{total} = \alpha L_c + \beta L_s + \gamma L_{temporal}$$
where $L_{total}$ is the total loss, $\alpha$, $\beta$, and $\gamma$ are weighting factors.
Temporal Coherence
The temporal loss $L_{temporal}$ ensures consistency between consecutive frames. Given frames $I_{t-1}$ and $I_t$, we compute:
$$L_{temporal}=\sum_{i,j}{||S(I_t)-S(I_{t-1})||}$$
where $S(I)$ represents the stylized output.
Here’s our implementation of of the temporal loss function:
1 | from typing import Tuple |
Our Implementation Approaches
In our project, we explored multiple algorithms:
Gatys Method
The naive approach applies style transfer frame by frame. While simple, it produces noticeable flickering.
Ruder’s Temporal Constraint
We implemented a temporal constraint using DeepFlow for motion estimation. This significantly improved temporal coherence by penalizing large deviations between adjacent frames.
Instance Normalization for Speed
To accelerate the optimization process, we incorporated instance normalization:
1 | class InstanceNorm(nn.Module): |
Feature Distribution Matching
For style representation, we adopted the Wasserstein metric to measure feature distribution differences. The Wasserstein distance between distributions $P$ and $Q$ is defined as:
$$W(P,Q) = \inf_{\gamma\in\Pi(P,Q)}\mathbb{E}_{(x,y)\sim\gamma}[|x-y|]$$
where $\Pi(P,Q)$ represents all possible joint distributions.
Results and Performance
Our experiments showed that combining Johnson’s feed-forward network with instance normalization provided the best balance between speed and quality. The temporal constraint from Ruder’s method effectively eliminated flickering artifacts while preserving artistic style.
The implementation is available in our repository, with different approaches organized in separate modules for experimentation.
Through this project, we’ve demonstrated that effective video style transfer requires careful consideration of both artistic quality and temporal consistency. The combination of proper feature representation, temporal constraints, and efficient optimization techniques makes real-time video style transfer possible.