TechnicalOpinion

El NUEVO Gemini va a DESTRUIR muchas startups

Xavier Mitjana

Google's new Gemini 2.5 Flash TTS model offers advanced voice generation with detailed emotional direction, surpassing tools like ElevenLabs — and it's free. The presenter demonstrates how to use it via Google AI Studio and even build a custom voice app on top of it. He closes with a broader warning about AI models absorbing specialized software markets and concentrating power in a few large companies.

Summary

The video opens with a dramatic demonstration of two voice samples generated from the same text but with different emotional tones, accents, and interpretations — all produced for free using Google's new Gemini 2.5 Flash Text-to-Speech model. The presenter argues this capability alone threatens the business model of paid voice tools like ElevenLabs, which previously justified their cost through emotional expressiveness and multilingual support.

The presenter explains what makes this model technically distinct: unlike most TTS systems that support only simple emotion tags (e.g., 'whisper' or 'angry'), Gemini 2.5 Flash TTS accepts full natural-language prompt instructions that describe the scene, context, accent, rhythm, and emotional arc of a performance. This allows users to 'direct' the voice like an actor, with results far richer than tag-based alternatives. He also notes that the model ranks above ElevenLabs on the most important synthetic voice benchmarks.

Three practical examples are walked through in Google AI Studio's Playground. The first uses simple inline emotion tags within text. The second uses the Composer mode to generate a dialogue between two different voices (Zephyr and Puck). The third — and most impressive — uses a comprehensive prompt with scene description, director's notes, rhythm guidance, and tagged transcript, producing a cinematic-quality voiceover. The same prompt is then rendered in Spanish, English, and Catalan, demonstrating strong multilingual performance.

To simplify the prompt-writing process, the presenter introduces a custom AI assistant (a Gem) he built based on Google's documentation and best practices. This assistant asks clarifying questions and auto-generates optimized prompts for the TTS model. A vampire character example is used to illustrate how the assistant refines a basic request into a production-ready prompt.

The video then shifts to building a custom voice generation application using Google AI Studio's new 'Build' feature, which now generates five alternative UI designs before coding the app. The resulting app allows users to paste text, auto-enhance it with expressiveness tags, select language and voice, and generate downloadable audio files — essentially a free ElevenLabs clone.

The presenter outlines concrete use cases: YouTubers translating voiceovers into five languages while preserving emotional tone, course creators entering new markets, and corporations replacing agency-level localization projects with a single prompt.

The video closes with a broader analytical point: this is not an isolated event. The presenter observes a recurring pattern where AI model updates absorb the functionality of entire specialized software categories — previously Figma features, Adobe tools, Blender functions, and now ElevenLabs. He warns that the application layer of software is being systematically consumed by foundation models, with ownership of those models concentrated in four or five companies in the US and a similar number in China. He frames this not as good or bad, but as a structural shift most people have not yet internalized.

Key Insights

  • The presenter argues that Gemini 2.5 Flash TTS has already surpassed ElevenLabs on the most important synthetic voice ranking, making the core value proposition of paid voice tools — directed emotional performance — available for free.
  • Unlike competing models that support only basic emotion tags like 'whisper' or 'angry,' Gemini 2.5 Flash TTS accepts full natural-language prompt instructions describing scene, context, accent, and emotional arc — making tag-based systems seem 'very, very limited' by comparison.
  • The presenter demonstrates that a single detailed prompt can be used to generate the same voiceover in multiple languages (Spanish, English, Catalan) while preserving the exact emotional interpretation, eliminating the need to re-record or re-direct for each language.
  • Google AI Studio's new Build feature now generates five alternative UI designs before writing any code, making it trivial to create a custom ElevenLabs-style voice application entirely for free on top of the Gemini TTS model.
  • The presenter identifies a structural pattern: AI model updates are systematically absorbing entire categories of specialized software (Figma features, Adobe tools, ElevenLabs), with ownership of those models concentrated in only four or five companies in the US and a similar number in China — framing each celebrated update as 'an application layer that disappears and a little more power concentrated at the top.'

Topics

Gemini 2.5 Flash TTS capabilities and emotional directionComparison with ElevenLabs and free vs. paid voice toolsBuilding a custom voice app in Google AI StudioMultilingual voice generation with consistent emotional toneAI models absorbing specialized software and concentrating market power

Full transcript available for MurmurCast members

Sign Up to Access

Get AI summaries like this delivered to your inbox daily

Get AI summaries delivered to your inbox

MurmurCast summarizes your YouTube channels, podcasts, and newsletters into one daily email digest.