What is Wan 2.2? Performance, Free Usage, and Use Cases of Alibaba's Open-Source Video Generation AI

Even when planning video content, shooting and editing costs often become barriers. Wan 2.2 is an open-source AI model that generates cinematic videos from text or image inputs. It offers a viable option for companies looking to internalize video production, lowering both development and operational hurdles.

This article summarizes everything from Wan 2.2’s overview to pricing, deployment methods, and use cases based on official information.

Overview of Wan 2.2
Key Features and Strengths
Delivery Methods and Pricing
Setup and Usage Instructions
Usage Scenarios and Examples
Deployment Checklist and Considerations
Summary

Overview of Wan 2.2

Wan 2.2 is a video generation foundation model released by Alibaba Group’s Tongyi Lab, offering three generation modes—text-to-video, image-to-video, and hybrid generation—under the Apache 2.0 license. Its key feature is the adoption of Mixture-of-Experts (MoE) in the video diffusion model, with 27 billion total parameters but only 14 billion active during inference. This reduces computational load while expanding expressiveness, generating higher-quality and more consistent videos compared to the previous version (Wan 2.1).

Source: https://wan.video/

Key Features and Strengths

Wan 2.2 achieves a high balance between video generation quality and execution efficiency through innovations in model architecture and training methods. Here are the representative features:

High Quality & Speed with Mixture-of-Experts

The MoE (Mixture of Experts) architecture limits active parameters per inference step while maintaining a large-scale model structure overall. By switching between expert networks optimized for high-noise and low-noise conditions, computation costs are reduced while enabling diverse expressions. This maintains naturalness in video details and motion while keeping generation speed practical.

Cinematic Aesthetic Control

Lighting conditions, time of day, color tone, composition, camera angles—all cinematic production elements can be precisely specified via prompts. For example, “backlit sunset scene” or “wide-angle lens with bird’s eye view” can be specified textually, making it easier to achieve intended productions. Training with labeled artistic and cinematic datasets enhances aesthetic understanding, achieving natural video expressions.

Complex Motion and Instruction Following

Handles scenes requiring precise motion such as facial expressions, finger movements, and sports moments. This is achieved by expanding training data by +65% for images and +83% for videos compared to the previous version, with design focused on accurately understanding and reflecting prompt semantics. Even in scenes with multiple people or objects moving simultaneously, consistency is maintained with smooth depiction.

High-Efficiency Model Runs on Consumer GPUs

A 5B model (TI2V-5B) incorporating proprietary 3D VAE enables efficient video generation at high resolution. This model highly compresses temporal and spatial axes without losing expressiveness, capable of generating 5-second 720p/24fps videos in minutes even on consumer GPUs with 24GB VRAM (e.g., RTX 4090). It’s well-suited for operations prioritizing reduced environmental impact and costs.

Model	Architecture	Total Parameters	720p Generation Time	Required VRAM*
Wan 2.2 A14B	MoE (2 Expert)	27B (14B active)	5 sec ≒9 min	80GB GPU
Wan 2.2 TI2V-5B	High-compression VAE	5B	5 sec ≒9 min	24GB GPU
Wan 2.1 T2V-14B	Dense	14B	5 sec ≒? (720p capable)	80GB GPU

*VRAM: Minimum recommended memory for inference.

These technologies enable video quality competitive with commercial models despite being open-source.

Delivery Methods and Pricing

While Wan 2.2 itself is free, the cloud version “WAN AI” adopts a credit-based billing system as shown below:

Plan	Monthly Fee (USD)	Monthly Credits	Estimated Generation	Concurrent Limit**
Free	0	0	Unlimited (Relaxed Mode)	1 job
Pro	~10	300	30–60 videos (5 sec)	2 jobs
Premium	~40	1200	120–240 videos (5 sec)	Undisclosed (exceeds Pro)

**Concurrent Limit: Number of instant execution jobs. Wait queue is separate.

Even without owning GPUs, fast generation is accessible simply by purchasing credits.

Setup and Usage Instructions

Clone the repository from the official GitHub and install dependencies with pip install -r requirements.txt. Model weights can be obtained from Hugging Face or ModelScope, specifying the local path with --ckpt_dir.

Example of text-to-video generation on a single GPU:

python generate.py --task t2v-A14B --size 1280*720 \
  --ckpt_dir ./Wan2.2-T2V-A14B --offload_model True \
  --convert_model_dtype \
  --prompt "Two anthropomorphic cats in boxing gear fight on a spotlighted stage."

For environments with less than 80GB VRAM, memory can be saved with options like --offload_model True or --t5_cpu. Image-to-video and text+image-to-video can also be switched using the same script.

Usage Scenarios and Examples

Wan 2.2 has been integrated into multiple platforms immediately after release. Additionally, developers and creators on X (formerly Twitter) have shared numerous videos generated with the model, providing valuable references for evaluating output quality.

ComfyUI officially supports Alibaba’s latest video generation model Wan 2.2 FLF2V feature
Supports both T2V and I2V, fast, high-quality, low-cost, easily accessible on OnLoRA.ai
Demo testing I2V functionality. Natural motion generated from non-AI images

The proliferation of these third-party services makes video production using Wan 2.2 realistic even without deep machine learning knowledge in-house.

Deployment Checklist and Considerations

When deploying Wan 2.2, confirming GPU memory requirements and generation times is crucial. The A14B model requires GPUs with 80GB or more, but the lighter TI2V-5B model operates at practical speeds on 24GB-class GPUs.

The license is Apache 2.0, allowing commercial use, but rights management for generated content (portrait rights, copyright, etc.) is the user’s responsibility. Particularly given the ability to generate high-definition human videos, caution is needed when handling content resembling copyrighted works or real people.

Furthermore, when operating large models, running costs tend to be high, so consideration of memory optimization through inference offloading and quantization is recommended.

Summary

Wan 2.2 is a video generation AI adopting MoE architecture, combining cinematic image creation with high motion reproduction. While being open-source enables easy on-premise deployment, leveraging cloud versions or integrated SaaS allows for rapid deployment. Start with the hybrid model for technical validation and small projects, then consider full-size models or paid plans once effectiveness is demonstrated.

Table of Contents