AI Video Quality Framework

Creating structure where none existed — a multi-axis quality assessment system for AI-generated advertising

Role: Lead Product Designer — Strategy, Design Guidelines, Framework Development Team: Junior-to-mid IC designers led, XFN with Ads Engineering, AI/ML, Creative Tools, Policy Scope: Instagram, Facebook — AI-generated short-form video ads Context: Competitive pressure from Google and TikTok AI creative tools

Executive Summary

Meta's AI-powered creative tools were generating video ads for advertisers — but without a shared design framework for what makes a composed video effective. The system could produce assets, but had no principled way to evaluate whether those assets would capture attention, communicate a message, or feel professionally crafted. Existing guidelines, built for static images, didn't account for the temporal, sequential, and multi-sensory demands of video.

I led the design strategy to solve this: defining what "good" looks like for AI-generated video at the system level. The core contribution was reframing the problem from a list of compositional elements to a sequential dependency chain of five cognitive judgments — Capture → Comprehension → Coherence → Retention → Polish — where each judgment is prerequisite to the next. I structured prioritization around diagnosability rather than severity, and mapped every compositional component to the specific judgment it serves, giving engineering teams a principled build sequence and giving the organization a shared language for quality that scaled beyond any single ad format.

The Problem

Static guidelines in a dynamic medium

Meta's advertiser creative tools were expanding from static image generation into short-form video. The existing design guidelines — covering text overlays, visual hierarchy, and brand safety — were built for a single-frame context. They severely lacked the nuances required for video: temporal sequencing, pacing, audio-visual coordination, hook architecture, scene transitions, motion coherence, and the critical relationship between what a viewer sees and when they see it.

Three tensions defined the landscape

The problem wasn't simply "write better guidelines." Three structural tensions shaped every decision:

Comprehensiveness vs. speed to ship. The competitive window was closing. Google and TikTok were shipping AI creative tools. We needed guidelines robust enough to prevent quality failures but scoped tightly enough to ship a credible v1.

SMB simplicity vs. agency control. Small business advertisers needed the system to make good decisions for them. Agencies needed the system to respect their creative intent. These are fundamentally different design problems sharing the same surface.

Creative freedom vs. brand safety. Generative AI introduces compositional risks that manual creative tools don't — hallucinated text, unnatural motion, visual incoherence between scenes. The guidelines had to create a quality floor without imposing a creative ceiling.

Strategic Reframe

From elements to judgments

The initial approach — cataloguing every compositional element (text, audio, animation, CTA, transitions, color grading, motion graphics) and writing rules for each — was necessary but insufficient. It produced a taxonomy without a hierarchy. Engineering teams couldn't prioritize. Designers couldn't sequence. The question "what matters most?" had no answer.

The deeper problem: elements don't have a single priority level. Pacing serves comprehension at one level (scenes held long enough to process), retention at another (tempo variation sustains engagement), and polish at a third (micro-pacing creates organic motion). Prioritizing "pacing" is meaningless without knowing which cognitive job it's serving in a given build phase.

I reframed the problem around the distinct cognitive judgments the system must make — organizing Axis 1 (Attention Architecture) by the cognitive job each element performs for the viewer, rather than by the element itself. This enabled evaluation of which job matters most, because elements serve multiple jobs and jobs require multiple elements.

Five judgments, sequentially dependent

The framework defines five cognitive judgments, each prerequisite to the next:

Capture — Did they stop? A binary gate at 0–1.5 seconds. The viewer's visual system decides before conscious processing begins. Every percentage of failed capture is a percentage of total audience permanently lost. All downstream judgments never operate.

Comprehension — Do they understand? Attention without comprehension is wasted. The viewer watches but retains nothing and takes no action. Operates across the full duration with highest intensity at 1.5–5 seconds. This is a gradient — the viewer can understand category but miss value prop, grasp message but miss CTA.

Coherence — Does it all belong together? The viewer's assessment that this is a unified piece, not a collection of individually competent fragments. Operates as a threshold — below it, the ad feels broken. Above it, the viewer accepts it as unified. This is the signature failure mode of AI-generated content.

Retention — Do they stay through the end? Sustained engagement that carries the viewer to the CTA. Visible in metrics — view-through rates provide a clear signal.

Polish — Does it feel crafted? The amplifier. Polish enhances all other judgments but nothing depends on it absolutely. Context-dependent — matters more for luxury brands, less for utility services.

The dependency chain is strict:

Capture → Comprehension → Coherence → Retention → Polish

Each judgment is prerequisite to the next. Failure at any point means downstream judgments cannot fully operate. This isn't a priority list — it's a dependency graph that dictates build sequencing.

Prioritization through diagnosability

Beyond the dependency chain, I introduced a second lens for prioritization: diagnosability — how easy is it to detect that something went wrong, and identify what to fix?

Comprehension failures are the most dangerous because they're the hardest to detect. An ad can look polished, have decent view-through rates, and still completely fail to communicate its core message. The advertiser blames targeting before suspecting composition. These are silent killers that erode trust without producing a clear signal.

Retention failures are visible. View-through rates drop. The signal is clear, the fix is identifiable. The advertiser can say "people aren't watching" and iterate.

Coherence failures trigger one-trial rejection. An advertiser who receives an incoherent ad concludes the system doesn't understand their business. Trust recovery is extremely difficult. Early adopters churn and become vocal detractors. Critically, coherence failures can't be fixed by the advertiser without rebuilding the ad — the tool becomes "generates a starting point I have to redo" rather than "generates something I can use."

Polish failures are the most obvious and least damaging. Easy to see, easy to fix, rarely cause an advertiser to reject the system entirely.

Failures you can't see are failures you can't iterate on. Low-diagnosability failures erode advertiser trust quietly — the worst kind of churn driver.

Framework Architecture

Three-axis evaluation model

Beyond the cognitive judgment hierarchy, I developed a three-axis model that held the full complexity of the problem space:

Axis 1 — Attention Architecture (Viewer): What the viewer experiences. The five cognitive judgments — capture, comprehension, coherence, retention, and polish — evaluated across distinct time windows within the video.

Axis 2 — Platform Discoverability (System): What the platform needs. Text serves multiple masters simultaneously — viewer communication, Instagram's discovery algorithms, Meta's ad delivery optimization, and external search engines. These create optimization tensions that needed explicit prioritization decisions.

Axis 3 — Advertiser Experience (Business): What the advertiser needs. Speed to usable output, message fidelity, format adaptability across placements, and confidence that the system won't damage their brand. This axis is load-bearing for adoption — without it, the other two don't matter.

Time windows as engineering constraints

A key architectural insight: the model cannot treat all frames equally. First-second decisions are existentially important in a way seventh-second decisions are not. If there is a quality budget — some frames more carefully composed than others — it should be spent at judgment transition zones, not distributed uniformly across duration.

The hook window (0–1.5s) is governed by capture. The transition to body (1–3s) is the highest-risk handoff — capture to comprehension. The middle section (3–5s) is where coherence crystallizes and retention begins. The closing is governed by comprehension (CTA) and polish.

Transition points between judgments are the highest-risk moments in any video. Structuring guidelines around these windows — rather than around elements — gave engineering teams actionable implementation targets and made performance data directly mappable to specific judgment failures.

Component-to-judgment mapping

I built a comprehensive matrix mapping every compositional component to its primary and secondary judgments, with build-phase designations for each. Components like pacing, narrative arc, and audio-visual sync appear multiple times across the matrix because they serve different cognitive jobs at different levels of sophistication:

Pacing serves comprehension at MVP (scene duration logic — scenes held long enough to process), retention at mid-term (tempo variation — dynamic rhythm sustains engagement), and polish at long-term (micro-pacing — natural easing and organic motion).

Narrative arc serves comprehension at MVP (basic logical structure — a followable sequence) and retention at mid-term (sophisticated forward pull — unresolved tension sustains viewing).

Audio-visual sync serves comprehension at MVP (sound-off completeness — visual carries full message), retention at MVP (emotional valence matching — audio mood matches visual), and polish at long-term (rhythmic alignment — frame-accurate beat-sync).

This mapping proved the thesis of the reframe: element-based prioritization was fundamentally incomplete because the same element has different priority levels depending on which cognitive job it's serving.

Design Execution

Build sequencing by judgment

The dependency chain directly determined build sequencing:

MVP — Capture + Comprehension + Coherence. The model produces ads that stop scrolls, communicate clearly, and feel like a unified piece from a specific brand. Output is viable and trustworthy even if not sophisticated. Dropping any of these three creates a trust-breaking failure mode. Eighteen core components mapped to this phase.

Mid-term — Retention. The model produces ads that sustain engagement through the full duration. Viewers reach the CTA. Pacing feels dynamic. Cuts have editorial motivation. Eleven components mapped to this phase.

Long-term — Polish. The model produces ads indistinguishable from work by skilled human editors. Every detail feels intentional and crafted. Seven components mapped to this phase.

V1 scope decisions

SMB-first. Agency features were architecturally planned but deferred. The rationale: raising the quality floor matters more than raising the ceiling in a nascent AI creative tool. SMBs have the most to gain and the least tolerance for configuration complexity.

Quality floor, not creative ceiling. V1 guidelines focused on preventing the failures that cause advertisers to reject the system on first generation — not on enabling the most sophisticated creative outputs.

Six priorities for v1: Quality floor, message fidelity, attention architecture basics, policy compliance by design, format adaptability for core placements, and speed to usable output.

Text overlay decision tree

I identified text overlays as the highest-risk compositional element — the area where existing image guidelines most severely underserved video. I developed a decision tree that categorized text by function (primary message delivery, reinforcing audio, call-to-action, data/proof points, branding) and mapped each to platform-specific constraints, timing strategies, and accessibility requirements. This replaced a flat set of rules with a contextual decision model.

Hook mechanics evaluation

The team's initial prioritization focused on text hooks. I challenged this by evaluating the full breadth of hook mechanics — visual pattern interrupts, motion techniques, sound design, face-in-frame timing, thumbnail-to-hook continuity — to determine whether depth on text hooks or breadth across hook types would produce better outcomes for v1.

Multi-scene quality assessment

I contributed to the evaluation framework for multi-scene video quality, specifically identifying composition risks that the existing assessment missed: the distinction between technical correctness (text legibility, segmentation) and production quality concerns that actually drive advertiser rejection (weak hooks, visual-copy mismatches, brand disconnect, unnatural motion, poor scene pacing).

Key Design Decisions

1. Guidelines as decision records, not reference manuals

Given a team that skews junior-to-mid, I structured guidelines as decision records — documenting the judgment, the tradeoff, and the rationale — rather than as comprehensive reference documentation. This made the frameworks usable by designers who hadn't been in the room when the decisions were made.

2. Portability by default

A potentially controversial recommendation: I advocated for creative portability — reducing platform lock-in rather than enforcing it. The rationale: advertisers manage creative across multiple platforms. Increasing trust through portability drives adoption more effectively than attempting to lock creative into Meta's ecosystem. In a competitive market, the platform that earns trust wins.

3. Coherence as an MVP judgment

Coherence was not in the original framework, and initially wasn't scoped for MVP. I argued it up for three reasons: incoherence is the most recognizable marker of AI-generated video (the signature failure mode); advertiser trust is one-trial — one incoherent ad and they conclude the system doesn't understand their business; and unlike legibility or pacing issues, coherence failures can't be fixed by the advertiser without rebuilding the ad. The system either produces coherent output or it doesn't earn trust.

4. Quality budget at transitions, not uniform

Rather than distributing compositional intelligence uniformly across a video's duration, I recommended concentrating quality budget at judgment transition zones — the handoff from capture to comprehension (~1–2s), comprehension to coherence (~2–4s), and coherence to retention (~3–5s). These are the highest-risk moments where failure cascades downstream.

Impact

Shared quality language established. The five-judgment hierarchy and dependency chain gave engineering, design, and product teams a common framework for evaluating AI-generated video quality — replacing ad hoc quality conversations with structured evaluation.

Prioritization unblocked. The diagnosability lens and judgment-level build sequencing resolved months of debate about what to build first by introducing decision criteria that aligned engineering feasibility with advertiser trust.

Component mapping operationalized the framework. The component-to-judgment matrix gave engineering teams specific, phase-appropriate implementation targets rather than an undifferentiated list of requirements.

V1 strategy shipped. The SMB-first, quality-floor approach — scoped to three MVP judgments (Capture, Comprehension, Coherence) — provided a credible competitive response while preserving architectural space for Retention and Polish in subsequent releases.

Guidelines adopted as team infrastructure. The decision-record format enabled junior and mid-level designers to apply frameworks consistently without requiring senior oversight on every call.

Reflection

The hardest problem on this project wasn't compositional — it was epistemological. The question "what makes a good AI-generated video ad?" doesn't have a stable answer. It depends on who's watching, what they're watching for, what the advertiser intended, and what the platform needs. My job was to make that complexity navigable without flattening it.

The key insight was that elements don't have fixed priorities — they have different priorities depending on which cognitive job they're serving. That's why element-based guidelines failed: they couldn't answer "what matters most?" because the answer changes depending on which judgment you're building for. The five-judgment dependency chain solved this by giving every element a home and every build phase a clear scope.

The insight I carry forward: when designing for AI systems, the most valuable contribution a designer can make isn't defining what the output should look like — it's defining the judgment hierarchy the system should use to evaluate its own outputs. That's where design leadership meets AI product strategy.