TwelveLabs, a leader in advanced video search and understanding technology, announced at AWS re:Invent the general availability of Marengo 3.0, its most capable and sophisticated video foundation model to date. This new release marks a significant leap forward in video intelligence—going far beyond simply watching clips. Marengo 3.0 interprets video like a human: it listens, reads, senses motion, and understands the flow and context of scenes. It can link a spoken phrase to a gesture minutes later, track objects and emotions over time, and process events with unprecedented depth. Starting today, customers can access Marengo 3.0 via Amazon Bedrock and the TwelveLabs platform.
Built on TwelveLabs’ multimodal architecture, Marengo 3.0 treats video as a living, dynamic information source. The model unifies audio, text, movement, visual cues, and contextual signals into a compact, searchable representation—making it possible for enterprises to explore and operationalize video data at scale. Designed for production from day one, Marengo 3.0 delivers immediate business impact. Internal testing shows a 50% reduction in storage costs and 2x faster indexing, enabling companies with large video libraries to extract more value while optimizing costs.
“Video represents 90% of digitized data, but that data has been largely unusable because it takes too long for humans to break down, and machines have been incapable of grasping and accounting for everything that happens in video,” said Jae Lee, CEO and co-founder of TwelveLabs. “Solving this problem has been our singular obsession. Now, Marengo 3.0 shatters the limits of what is possible. It is an incomparable solution for enterprises and developers.”
Also Read: Hyland Unveils Powerful Cloud and AI Innovations That Streamline Content and Agentic Automation
Smarter, Faster, Leaner: Setting a New Standard for Video Intelligence
The arrival of Marengo 3.0 positions TwelveLabs at the forefront of video intelligence infrastructure. Unlike competing systems that depend on stitched-together image and audio models or isolated frame-by-frame processing, Marengo 3.0 interprets video holistically—with a deep understanding of temporal and spatial relationships across complex scenes.
The model introduces major improvements across industries such as sports, media, entertainment, advertising, public safety, and government. Key capabilities include:
- Native Video Understanding: Purpose-built for video—not adapted from image models—providing foundation-level comprehension.
- Temporal + Spatial Reasoning: A unique ability to understand how objects, context, and events evolve across time and space.
- Sports Intelligence: A first-of-its-kind feature set including team, player, jersey number, and action tracking for rapid highlight and moment identification.
- Composed Multimodal Queries: Users can now combine text and image inputs in one query to retrieve more precise results.
- Production-Ready Economics: With 50% lower storage needs and 2x faster indexing, organizations can reduce costs while unlocking new revenue possibilities.
- Enterprise Deployment: Available through Amazon Bedrock for seamless, secure implementation within existing AWS environments, or directly via TwelveLabs as a monthly service.
Marengo 3.0 supports an API-first developer workflow, offering compact embeddings and extended four-hour video support—double the capacity of Marengo 2.7. The model is also multilingual, supporting 36 languages for global-scale video applications.
“TwelveLabs’ work in video understanding is transforming how entire industries manage their video capabilities, bringing unprecedented speed and efficiency to what has largely been a manual process,” said Nishant Mehta, VP of AI Infrastructure at AWS. “We are excited to be the first cloud provider to offer Marengo 3.0 to our customers through Amazon Bedrock, following great adoption from TwelveLabs’ previous Marengo and Pegasus models.”





















Leave a Reply