AI Companion in the Art Basel App

How I built a quality evaluation framework for Art Basel's AI Companion grounded in failure patterns.

AI Evaluation & Quality

Jun 14, 2025

The Problem — The tool was live. But was it working?

Art Basel fairs are large, dense, and time-bound experiences. Visitors arrive with strong interests but limited time, and must navigate hundreds of galleries, events, talks, and on-site activities. The AI Companion was introduced in the Art Basel app as a conversational GenAI experience to help visitors plan, navigate, and make the most of their visit.

But being live is not the same as being trusted. A tool that fails to answer well enough, consistently enough, loses users before adoption has a chance to take hold.

The challenge was to define what "good" looked like for this tool, based on evidence of where and how it was failing real users.

The Approach — Learn from failure, not just success

I reviewed app content and worked in close collaboration with the Digital Team (Product and Engineering). My specific responsibility was to define what good GenAI content looked like for this context.

Operating constraints:

I had direct input but not direct control: product changes required sign-off from the Digital Team. The Digital Team owned the product, which naturally shaped how decisions were framed and prioritised. Analysis had to be actionable for two different audiences: content editors and product engineers.

The fair itself is time-bound, which meant evaluation had to happen fast and recommendations had to be implementable quickly.

01 — Define "good" from user behavior, not editorial instinct

Rather than starting from a content quality checklist, I grounded the evaluation in real user behavior — specifically in the questions users asked and the answers that failed them. I requested access to logs of user questions and corresponding answers after the fair, then conducted a gap and fallback analysis using one guiding question: "Are users getting actionable answers? If not, where are we failing?" This reframed quality as a user trust problem.

02 — Identify failure patterns, not isolated errors

I analyzed failures by identifying common unanswered questions and recurring fallback patterns — cases where the AI acknowledged its limitations but stopped user momentum rather than redirecting it. The key distinction was between an AI that doesn't know the answer and an AI that doesn't know the answer but still moves the user forward. A good fallback response is helpful; a poor one is just an error message. This taxonomy became the foundation of the evaluation framework.

03 — Translate analysis into two separate recommendation tracks

Based on the fallback analysis, I made recommendations across two tracks: content and product. Content recommendations included filling knowledge base gaps for the highest-fallback query types, enriching existing content where responses were too thin or generic, and introducing improvements based on real user phrasing. Product recommendations included introducing dynamic event listings for timely, actionable information, and defining alternative path patterns to guide users forward when the AI couldn't answer directly. Both tracks were presented to the Digital Team for prioritization.

04 — Turn the framework into a repeatable process

The fallback analysis was formalized into a standard evaluation process for app content. This gave the Content Marketing team a method for assessing AI output after each fair, identifying what improved, and surfacing new failure patterns as user behavior evolved.

Results — Better answers. More trust. Measurable improvement.

25→ 20%: Average fallback rate across key query clusters, before and after
61: Content and product recommendations made and presented to the Digital Team
4: Query clusters analysed, each with measurable fallback rate improvement

Fallback rates improved across all four query clusters analysed. Every unanswered question represents a user the tool failed to retain. A reduction at this scale is a meaningful gain in trust.

Content improvements were implemented directly. Some product recommendations were actioned; others remain in the pipeline, subject to the Digital Team's prioritisation cycle.

The Bigger Insight

"Trust in an AI tool builds answer by answer. Looking at where it fails is how you drive adoption."

What I'd Do Differently

The Digital Team built their own analysis framework in parallel. We each presented our approach, but never merged them. They ran side by side, driving separate improvements.

In hindsight, that was a missed opportunity. Their framework leaned product-first. Mine took a broader view of the same data. Merging them would have aligned on what success looks like, which is the foundation of any AI adoption effort.