Master Nano Banana 2 Instantly – The SECRET JSON Prompt Engineering Formula
This tutorial explains how to use JSON-structured prompts with Nano Banana 2 (an AI image generation model) to achieve precise, reproducible control over individual image elements. The presenter demonstrates how separating prompt elements into labeled JSON fields prevents unwanted changes during image editing. The video covers use cases including style transfer, character consistency, lighting control, object swapping, camera perspective transfer, and video generation.
Summary
The video opens by demonstrating the core problem with traditional text prompts for AI image generation: when all elements are mashed into a single text blob, changing one detail (like a chair color) causes unintended changes elsewhere. The presenter argues that JSON (JavaScript Object Notation) solves this by assigning each scene element its own labeled field, allowing the AI to modify one slot without disturbing others.
The foundational workflow begins with dropping an existing image into Gemini and running the prompt 'Extract all the information from this image and convert it into structured JSON.' This reverse-engineers an image into its component parts — objects, materials, colors, positions — each with a stable ID that can be directly edited. The presenter notes this approach works well with Nano Banana 2 because it runs on Gemini's reasoning architecture and understands relationships between elements, unlike older diffusion models. Midjourney is explicitly called out as a poor fit for JSON prompting, since it is built for aesthetic exploration rather than structured control.
Practical access and pricing details are provided: NB2 is free in the Gemini app (up to 20 images/day, 1K resolution), with 2K for paid subscribers and 4K via API. API costs are approximately $0.08/image for NB2 and $0.15 for Nano Banana Pro.
The presenter then walks through several advanced use cases. For style transfer, a 'photography technique JSON' is extracted from a reference image — capturing lighting setup, color grading, lens character, film stock, and post-processing style. Specific hardware names like 'Hasselblad' and 'Kodak Portra 400' are emphasized as meaningful, since NB2 was trained on real photographs and recognizes these references as distinct visual priors.
For character consistency, the presenter introduces the concept of a 'character bible' — a locked JSON block describing a character's fixed attributes. This block is pasted verbatim into multiple scene prompts, with only scene, lighting, and outfit fields changing. NB2 natively supports up to five characters and 14 objects using up to 14 reference images, though the presenter advises capping reference photo uploads at six, as more than six degrades structural accuracy due to conflicting signals. Minimal text description alongside reference images is also recommended to prevent the two sources from conflicting.
For environmental changes, the presenter explains how a lighting-focused JSON differs from an object JSON by targeting light sources, color temperature, shadow direction, and weather — leaving furniture untouched. A cautionary example is given: a field labeled 'exterior weather visible: true' caused the AI to remove curtains to expose the window. Deleting that single field resolved the issue.
Object swapping is demonstrated by extracting separate JSONs for the original room and a replacement chair, then merging them in Studio Assistant. The result correctly inherits the room's lighting, shadows, and pillow placement — something the presenter says would require 20 wasted text-prompt generations to achieve otherwise.
Camera perspective transfer is described as the most advanced technique: isolating only the camera schema (focal length, aperture, perspective distortion, focal point) from one image and applying it to a different scene. A fisheye lens transfer example is shown, where the AI hallucinated edges outside the original frame while maintaining accurate perspective distortion.
The video closes with an extension of the JSON method to video generation via VO3.1, adding fields for motion, camera movement, duration, and audio. Timestamp-based scene prompting is mentioned as a community convention rather than an official API feature. A safety note is included: VO3.1 mutes audio when very young characters or baby animals are present, and aging subjects to young adults restores sound.
Throughout, the presenter promotes two paid platforms — Venngage for creating JSON cheat sheets and AI Master (the presenter's own platform) for watermark-free 4K generation and template storage.
Key Insights
- The presenter argues that Nano Banana 2 works well with JSON because it runs on Gemini's reasoning architecture and understands relationships between elements, meaning JSON is 'the model's native language' — unlike older diffusion models that merely match keywords to pixels.
- The presenter claims that uploading more than six reference images for character consistency actually degrades structural accuracy because the model receives conflicting signals, making six clean references more effective than fourteen mediocre ones.
- The presenter explains that specific camera and film hardware names like 'Hasselblad' and 'Kodak Portra 400' are not decorative — NB2 was trained on millions of real photographs, so these names unlock exact visual priors the model learned from that data, producing fundamentally different results than generic terms.
- The presenter warns that JSON fields containing words like 'visible,' 'dramatic,' or 'exposed' can force the AI to alter composition to prove the change — in their example, 'exterior weather visible: true' caused the model to remove curtains to expose the window.
- The presenter states that VO3.1 has a safety protocol that mutes audio whenever very young children or baby animals are present in a scene, and that aging subjects up to young adults immediately restores the audio.
Topics
Full transcript available for MurmurCast members
Sign Up to Access