New research from SparkToro reveals a striking inconsistency in how AI tools generate brand recommendations. The study found that generative AI platforms like ChatGPT and Google's AI in Search rarely produce the same list of brands, even when given identical prompts, challenging traditional notions of AI visibility and tracking.
The comprehensive study, led by SparkToro co-founder Rand Fishkin and Patrick O'Donnell of Gumshoe.ai, involved hundreds of volunteers running 2,961 prompts across major AI platforms including ChatGPT, Claude, and Google Search AI Overviews (and AI Mode). Conducted over November and December, the research aimed to understand the repeatability of AI-generated content.
Key Findings on AI Recommendation Variability
The researchers tested 12 distinct prompts, seeking brand recommendations across diverse categories such as chef's knives, headphones, cancer care hospitals, digital marketing consultants, and science fiction novels. Each prompt was executed 60 to 100 times per platform. The results consistently showed that nearly every response was unique, differing in the specific brands recommended, their order, and even the total number of items provided.
Rand Fishkin succinctly summarized this core finding:
"If you ask an AI tool for brand/product recommendations a hundred times nearly every response will be unique."
While Claude demonstrated slightly better consistency in producing identical lists, it still struggled with maintaining the same recommendation order. Ultimately, none of the platforms met the authors' criteria for reliable repeatability.
The Challenge of User Prompt Variability
The study also delved into how real users craft prompts. When 142 participants were tasked with writing prompts for headphone recommendations for a traveling family member, the diversity was striking. The semantic similarity score across these human-written prompts was a mere 0.081, which Fishkin humorously likened to the relationship between "Kung Pao Chicken and Peanut Butter." This indicated that while users shared a core intent, their phrasing varied significantly.
Interestingly, despite this wide prompt diversity, the AI tools tended to draw from a relatively consistent "consideration set" of brands. For instance, top brands like Bose, Sony, Sennheiser, and Apple appeared in 55-77% of the 994 responses to the varied headphone queries.
Redefining AI Visibility Tracking for Marketers
These findings have significant implications for digital marketing and SEO professionals attempting to track "AI visibility." Fishkin strongly asserted that any metric claiming to provide a definitive "ranking position in AI" is unreliable. Instead, the research suggests that a more consistent and valuable metric would be the frequency with which a brand appears across numerous runs of similar prompts.
The study observed that in highly defined categories, such as cloud computing providers, top brands consistently appeared in most responses. Conversely, broader categories like science fiction novels yielded more scattered results. This aligns with previous research, including a December report from Ahrefs, which found Google's AI Mode and AI Overviews cited different sources 87% of the time for identical queries. The overarching pattern across these studies is clear: AI recommendations exhibit significant variability, whether comparing across platforms, features within a platform, or repeated queries to the same feature.
Methodology and Future Considerations
The research was conducted in partnership with Gumshoe.ai, an AI tracking startup. Rand Fishkin transparently disclosed this, noting his initial hypothesis was that AI tracking might prove "pointless." The full methodology and raw data have been made publicly available on a public mini-site. The researchers intentionally allowed survey respondents to use their normal AI tool settings without standardization, aiming to capture genuine real-world variation.
It's important to note that this report is not peer-reviewed academic research, and Fishkin acknowledged certain methodological limitations, advocating for more extensive, larger-scale follow-up studies.
What This Means for AI Tracking Tools
The study raises critical questions for the future of AI visibility tracking, including how many prompt runs are truly necessary for reliable data and whether API calls exhibit similar variability to manual prompts. For businesses considering AI tracking solutions, the findings underscore the importance of due diligence. Fishkin advises:
"Before you spend a dime tracking AI visibility, make sure your provider answers the questions we’ve surfaced here and shows their math."








