Unlike classic approaches that slice video into fixed frames, devouring computing power, Apple has developed an architecture with variable token lengths (from a coarse to a fine level). This breakthrough allows for a radical increase in the efficiency of video content compression and generation. For the market, this means that multimodal agents will soon be able to operate directly on edge devices, creating complex video responses without the need to rent cloud servers. The algorithmic elegance of VideoFlexTok is a direct response to the heavyweight solutions of competitors, such as OpenAI’s Sora.
Source: Apple ML Research / arXiv
Generative AIVideoAppleTokenizationR&D