IL-2023-000007 - [2025] EWHC 2863 (Ch)

Fecha: 04-Nov-2025

The Getty Watermark Experiments and Annex 8H

(iii)

The Getty Watermark Experiments and Annex 8H

168.

The Getty Watermark Experiments were undertaken by Professor Farid in conjunction with Fieldfisher. They involved text prompts being inputted into v1.2, 1.3, 1.4, 2.0, 2.1, XL 1.0 and v1.6 of the Model. The text prompts used were (i) verbatim prompts (i.e. broadly, prompts which were taken verbatim from captions on the Getty Images Websites which were chosen randomly by Ms Varty from captions used in a complaint filed by the First Claimant against Stability’s parent company in the United States); (ii) re-worded prompts (prompts generated by Professor Farid asking ChatGPT to re-word the original prompt); (iii) invented prompts (created by Ms Varty using the words “news photo” and “vector art” which she thought might produce watermarked* images); and (iv) prompts loosely inspired by other prompts (invented by Ms Varty and based on events that she was aware of, or were imagined by her).

169.

The results of the Getty Watermark Experiments relied upon at Annex 8H, show outputs containing watermarks* being generated by:

V1.2: five images using verbatim prompts; nine images using re-worded prompts; and five images using “other prompts”. In respect of the latter, three of the images were generated using a prompt which included the words “news photo”, while one was generated using a prompt which included the words “vector art”. The image generated using the words “vector art” is the only image which includes an iStock watermark* as opposed to a Getty Images watermark*.

ii)

V1.3: five images using verbatim prompts; six images using re-worded prompts and five images using “other prompts”. In respect of the latter, three of the images were generated using a prompt which included the words “news photo”, while one was generated using a prompt which included the words “vector art”. The image generated using the words “vector art” is the only image which includes an iStock watermark* as opposed to a Getty Images watermark*.

iii)

V1.4: seven images using verbatim prompts; nine images using re-worded prompts; and six images using “other prompts”. In respect of the latter, three of the images were generated using a prompt which included the words “news photo” and one was generated using a prompt which included the words “vector art”. The image generated using the words “vector art” is the only image which includes an iStock watermark* as opposed to a Getty Images watermark*.

iv)

V2.0: eight images using verbatim prompts and eight images using “other prompts”. In respect of the latter, two of the images were generated using a prompt that included the words “news photo”. All of the marks generated were Getty Images watermarks*. None of the images shows an iStock watermark*.

V2.1: nine images using verbatim prompts; nine images using re-worded prompts and four images using “other prompts”. In respect of the latter, three of the images were generated using the words “news photo”, including an image responding to a prompt including reference to Miley Cyrus which is marked “Not Safe For Work”. All of the marks generated were Getty Images watermarks*. None of the images shows an iStock watermark*.

170.

Although Professor Farid accepted in cross examination that attempts had been made to generate images bearing watermarks* using SD XL and v1.6, the Getty Watermark Experiments do not identify any synthetic images bearing watermarks* for these Models (notwithstanding that 2,600 efforts were made to do so using prompts designed for the purpose). Professor Farid also accepted that as one progressed through the Models “the frequency with which we saw watermarks was diminished over time”.

171.

Nevertheless, a very small number of synthetic images bearing watermarks* were generated from v1.6 and SD XL by Getty Images during additional experiments that it undertook (as recorded in its NoE) in support of the Outputs Claim. Specifically:

SD XL 1.0: one image of Donald Glover (included in Annex 8H) (“the Donald Glover Image”) bearing a distorted Getty watermark* (produced using a verbatim prompt);

ii)

V1.6: three images of The Gabba in Brisbane, Australia bearing Getty watermarks* (produced using a verbatim prompt, a re-worded prompt and a prompt which ends with the words “Getty Images stock photo”. This latter image does not appear to fall within the scope of Getty Images’ confirmation in the Order of 1 November 2023 that it would not rely upon prompts that contain signs corresponding to the Marks) (“the Gabba Images”). By way of example, the image produced using the verbatim prompt (“the First Gabba Image”) appears below:

172.

Professor Farid explained in his oral evidence that the purpose of the Getty Watermark Experiments was to understand when and if Stable Diffusion would generate images which included watermarks*. As he explains in his report, these experiments “are not capable of determining the precise probability of a user generating an output image similar to a training image or an output image with a visible watermark from the universe of possible output images”. He goes on to say that to determine precise probability he would need to generate a much larger number of images per prompt and run experiments across an enormous range of user prompts.

173.

I did not understand there to be anything between the Experts on this point. Professor Brox described the Getty Watermark Experiments in his report as an “adversarial attack” and likened them to academic research (including in the Carlini study) in which the researchers had tried to identify whether a model is capable of being manipulated into generating specific outputs. In cross examination he explained that his use of the phrase “adversarial attack” was not a criticism and that the Getty Watermark Experiments were valid as “a proof of existence” – in other words an experiment to determine whether it is possible to find prompts that generate watermarks*. He rejected the suggestion that the Getty Watermark Experiments say anything about the probability of the Models generating images bearing watermarks*.

174.

In their Second Joint Statement, the Experts agreed that “the Getty prompts likely over-estimate the general prevalence of watermarks”. When asked about this in cross-examination, Professor Farid accepted that “it is perfectly reasonable to say the reason why we saw so many watermarks is that we were prompting the model with Getty captions and because we know that the Stable Diffusion models were trained on Getty assets, it was more likely to produce a watermark”. He accepted that he had insufficient data or information to estimate the extent of the influence of using Getty Images captions on the likelihood of generating a synthetic watermark*.

175.

Professor Brox explained in his report that “[t]he very small number of synthetic images generated using Stable Diffusion 1.6 and XL which have been provided suggest that, even under the (likely) adversarial conditions of the [Getty] Watermark Experiment, these models rarely generate synthetic images bearing watermarks”. He expressed the view that he would expect that few watermarked images were contained in the training data, either “because a more effective watermark filter was used or because some other technique was used to identify and remove images containing Getty watermarks”. This view appears to be borne out by the available contemporaneous evidence in the form of the internal Stability Chat of 4 March 2023 to which I have already referred during which there is discussion of a “de-watermarking” process prior to the release of SD XL.

176.

Turning to the arguments of the parties on the significance of the Getty Watermark Experiments, I understand it to be common ground that the outputs generated by the Getty Watermark Experiments included in Annex 8H are only of assistance in the context of the Trade Mark Infringement Claim if the types of prompts used to generate those outputs can be shown to be probative, or representative of real world use by real world users. That they establish a “possibility” of watermarks* being generated using particular types of prompts is not enough.

177.

Stability contends that the Getty Watermark Experiments should be disregarded in their entirety on the threshold question, essentially because there is no evidence that Getty Images’ hand-picked, “contrived”, prompts have ever been employed by real-world users in the UK, that Getty Images has no case on the likelihood of watermarks* appearing (even when hand-picked prompts are used), that there are no experiments going to likelihood and no statistical analysis to support a case on probability and that most of the prompts represent an eccentric, out of scope use of the Models (Footnote: ⁹) which is incompatible with the objectives of real world users. Stability’s pleaded case is that in normal use, users will seek to avoid generating images bearing watermarks* and that captions which correspond wholly or substantially to captions or alt-text for images from a Getty Images Website are inherently unlikely to be input by such a user by chance or during normal usage.

178.

In support of this case, Stability relies primarily upon evidence contained in:

a random prompt sample of 10,000 prompts submitted by real world users of Stable Diffusion XL 1.0 and v1.6 through the Stability Developer Platform API or DreamStudio on three dates: 20 March, 2 and 5 April 2025 (“the Stability Prompt Sample”). The protocol for this exercise was agreed by Getty Images which also selected the specific dates;

ii)

the first Stability Watermark Experiments addressed in its first NoE involving a selection of 1,000 text prompts from what is known as the “Diffusion DB 2M dataset”; namely a dataset of 2 million synthetic images and associated prompts from use of one of versions 1.1-1.4 of the Model. The randomly selected text prompts were input into v1.4, v2.1, XL 1.0 and v1.6 with 4000 inference requests being made in each case;

iii)

the second Stability Watermark Experiments addressed in its second NoE which used 525 verbatim text prompts taken from the prompts identified in AJG-10, an exhibit to Ms Gagliano’s evidence (“AJG-10”), to generate 2100 synthetic images for each of v1.4, v2.1 and XL 1.0 and 2096 images for v1.6; and

iv)

its analysis of the (lack of) evidential value of the Annex 8H images.

179.

Getty Images accept that they have no case on likelihood and that they have not attempted to conduct a statistical analysis, but as I have already indicated, they say there is no need to prove their case by reference to probabilities. They reject the suggestion that the prompts used for the Getty Watermark Experiments are contrived and assert that those prompts are representative of the type of prompts that a real world user might use, as well as being illustrative of the fact that any kind of prompt (of varying lengths) can generate a synthetic image with a watermark*. Getty Images assert that using verbatim (or substantially verbatim) prompts is something that a reasonable user of Stable Diffusion would do, because, for example, that user may wish to generate an image which is the same or similar to Getty Images’ Content without paying a licence fee. They also contend that many prompts created by users without reference to Getty Images’ Content are likely to correspond substantially to captions for such content “since these captions describe the Content, much of which is based on real places, people and events”.

180.

In addition to the expert evidence to which I have already referred and their own analysis of the Annex 8H images, Getty Images rely upon:

a review carried out by Professor Farid of the Midjourney Discord Channel (a public online forum where users create and share the results of their image synthesis using an AI image generator called “Midjourney”, a competitor of Stable Diffusion);

ii)

evidence of verbatim prompts input by users of GAI since the beginning of 2024 produced by Ms Gagliano at AJG-10 from an analysis of 470,000 prompts; and

iii)

their own analysis of the Stability Prompt Sample.

181.

I begin by observing that (beyond Professor Brox’s acknowledgement that any kind of prompt may generate a synthetic image with a watermark* and Professor Farid’s evidence as to his review of the Midjourney Discord Channel, to which I shall come in a moment), the expert evidence to which I have already referred is of little real assistance to Getty Images on the question of the significance of the Getty Watermark Experiments. The Experts agree that these experiments show that it is possible for users of Stable Diffusion Models to produce watermarks*, but they also agree that the experiments say nothing about the likelihood of this happening. It would have been possible (as Professor Farid acknowledged) to run experiments (based on a large number of user prompts and images) designed to address the question of probability, but that has not been done.

182.

The Experts’ view that the Getty Watermark Experiments were designed only to establish “proof of existence” is supported by Ms Varty’s evidence as to the means she employed in creating the Annex 8H prompts. In cross examination she confirmed that her only interest was in getting the Models to generate watermarks*. She was not concerned with whether the synthetic images looked good or aesthetically interesting and if a particular prompt looked like it was generating absurd images that were ugly or unrepresentative or ridiculous, that would not have deterred her from continuing to prompt the Model with it if it was producing watermarks*. Ms Varty accepted that she had generated images for the purposes of the proceedings only and that (with the exception of the “vector art” prompts) all of her prompts were “out of scope” for use on the Models in the sense that the Model Cards explain that they were “not trained to be factual or true representations of people or events” and that accordingly using the Model to generate such images (as she had done for most of the images in Annex 8H) is out of scope for its abilities.

183.

On the subject of the use made of Stable Diffusion by users, Ms Varty accepted that people may want images for a presentation or a project or an artwork and that they “want images which look good” and which are attractive and interesting. She also accepted that outputs which get human features comically wrong (as is the case with various of the outputs generated by Ms Varty) were likely to be a source of frustration and ridicule for people using the Models.

184.

Notwithstanding that it is clear from Ms Varty’s evidence that there was no intention to reflect real world use in the Annex 8H prompts, the question remains whether that evidence is probative of the way in which real world users generate images using the Models such that it assists on the threshold question. To consider that question, I need to consider each of the different types of prompt used by Ms Varty with a view to determining whether, on balance, these prompts are probative or representative of real world use.

Verbatim Prompts:

185.

Ms Varty chose these prompts from captions in the US proceedings as “an easily accessible public source of captions” which were believed to be in the LAION datasets used to train Stable Diffusion. They are all extremely lengthy and complicated, although, as Ms Cameron confirmed in her evidence, it is not unusual for Getty Images captions to be lengthy when they relate to editorial content (as all of the verbatim prompts do). One example which generated a Getty Images watermark* when using v1.2 is:

“U.S. President Barack Obama looks on during a review of military troops at a welcoming ceremony for French President Francois Hollande on the South Lawn at the White House on February 11, 2014 in Washington, DC. Hollande who arrived yesterday for a three day state visit, visited Thomas Jefferson's Monticello estate and will be the guest of honor for a state dinner tonight”

186.

The image generated is set out below:

187.

Would a real world user laboriously copy (or more realistically, perhaps, cut and paste) such a prompt into the Model in the hope of generating a similar (free, or much cheaper) image?

188.

Professor Farid gave some evidence in his report that “[t]he experiments show that this reproduction [of watermarks*] happens without overly contrived or unexpected prompting by the user” and he went on to say that in his experience of working with AI models, and in the AI industry more broadly, “using captions from a stock image website is not an anomalous way to interact with an image generator”. It was by way of illustration of this proposition that he referred to his review of the Midjourney Discord Channel, observing that he had found “some examples of Getty Images captions used as prompts by users” and then providing an example in the form of an image of Ratan Tata, chairman of Tata Steel Ltd (“the Tata Image”).

189.

Although Getty Images contended in opening that this evidence establishes users in the real world adopting verbatim prompts, Professor Farid’s candid responses to cross-examination on this point undermined that submission. Professor Farid very fairly accepted that the Tata Image was not an example of an image produced using a verbatim caption and that he thought he had not in fact found any examples of verbatim captions in his review of the Midjourney Discord Channel and that he could not say whether the caption had simply been re-written using ChatGPT. He also accepted that Midjourney is not a “particularly useful data point” and he explained that he could give no evidence on the frequency with which prompting of this type might be encountered.

190.

I also consider that Getty Images’ reliance upon AJG-10 was seriously undermined during cross examination. AJG-10 runs to 15 pages of verbatim prompts entered by users of GAI (i.e. subscribers to the Getty Images service) – a total of 690 prompts (approximately 0.15% of the total prompts analysed). The exhibit lists the English word prompts entered by users since the beginning of 2024 “where five or more words match the words in an image caption associated with Getty Images’ content on Getty Images’ Websites”. Ms Gagliano confirmed in her evidence that the purpose of this exhibit was to demonstrate that there is a practice amongst users of GAI of entering Getty captions as prompts. She explained that a five word lower limit had been imposed because anything shorter than five words would not fairly have indicated that the text had been derived from a caption on the Getty Images Websites.

191.

Stability accepts that AJG-10 supports the proposition that users of GAI do, on occasion, input Getty Images captions as prompts, and thereby also supports the general observation made by Professor Farid that using captions from a stock image website is not an anomalous way to interact with an image generator. However, it contends that AJG-10 provides no basis for an inference that users of Stable Diffusion behave in a similar way. On the contrary, Stability says that a comparison between AJG-10 and the Stability Prompt Sample illustrates that Stable Diffusion users behave differently from GAI users.

192.

Stability’s case was effectively accepted by Ms Gagliano in cross examination. I need not address this in any great detail as it was not addressed by Getty Images in closing (beyond a passing reference to Ms Gagliano’s witness statement in Ms Lane’s reply) and certainly Getty Images did not seek to gainsay the analysis that was put to Ms Gagliano during cross examination. Accordingly I can only assume that it is not in dispute.

193.

For present purposes I record that Ms Gagliano accepted a scaling exercise that was put to her to the effect that (i) AJG-10 shows (for 10,000 prompts): 15 verbatim prompts of 5 words or more, 8 of 10 words or more, 5 of 15 words or more and 3 of 20 words or more; and (ii) by comparison, the Stability Prompt Sample data shows (per 10,000 prompts): 0 verbatim prompts with 10, 15 or 20 words or more and only 1 verbatim prompt with 5 words or more (“a flower in the middle of the desert”). Ms Gagliano also accepted that because the 470,000 prompts used to conduct the analysis for the purposes of AJG-10 covered all languages and that 690 prompts in AJG-10 are all in the English language, the figure of 0.15% referred to above underestimates the prevalence of English Getty Images captions amongst English GAI prompts such that on a like for like comparison the number of GAI verbatim prompts per 10,000 prompts would likely be higher. Ms Gagliano said nothing in re-examination to undermine these conclusions.

194.

As for the one verbatim caption ‘matched’ prompt in the Stability Prompt Samples of 5 words or more, I accept Stability’s submission that “A flower in the middle of the desert” is a trite phrase which is in no way unique or original to Getty Images. I also accept that it provides no proper basis for any inference of Getty Images as its source and thus does not support the proposition that real world users of Stable Diffusion use verbatim prompts copied from Getty Images Websites. I also did not understand this to be disputed by Getty Images.

195.

Accordingly, as I was invited to do by Stability, I accept Ms Gagliano’s evidence that these results support the proposition that Stable Diffusion users behave differently from GAI users and that, on the basis of the analysis that had been put to her, the proposition that her evidence was intended to support was “demonstrably wrong”. It fails to assist Getty Images in establishing that verbatim prompts are in fact used “in the wild” by real world users of Stable Diffusion.

196.

In her reply to Stability’s closing submissions, Ms Lane submitted that the Stability Prompt Sample was too small to be of any assistance and that this point had been made by Fieldfisher before any testing was carried out in a letter dated 21 March 2025. I note however, that although the letter says that there will be a “limit to the probative value” of the Stability Prompt Sample and expresses Getty Images’ concerns over whether it can properly be said to be representative of the “normal use” of Stable Diffusion, the letter nevertheless agrees to it “in the interests of pragmatism and proportionality”. This is perhaps unsurprising given that Fieldfisher itself had proposed 5 sets of 2,000 prompts split into different word lengths in an earlier letter of 7 March 2025. Getty Images made no attempt to propose a larger sample size prior to the protocol being approved by the Court. In the circumstances, I accept Stability’s submission that Getty Images cannot now sensibly maintain that the criteria approved by the Court are not probative.

197.

The Stability Watermark Experiments in respect of the Diffusion DB 2M dataset take matters little further on this issue. Stability asserts in its NoE that none of the 1,000 randomly selected prompts was a verbatim caption or a re-worded prompt. Getty Images declined to admit either allegation on the basis that “it is not reasonable, proportionate and/or practical” for them to do so. I am not prepared to find (as Stability invited me to do in opening) that on balance none of the 1,000 selected prompts was a verbatim caption or re-worded prompt. There is no evidence on which I could safely arrive at that conclusion and, as Getty Images correctly point out in their Reply NoE, it is unclear on what basis Stability has asserted that none of the selected prompts was a verbatim or re-worded prompt. No explanation is provided by Stability in its NoE as to any steps that it may have taken to establish the accuracy of this assertion and the mere fact that Getty Images have chosen not to conduct any reply experiments or any survey of the Diffusion DB database does not seem to me to be determinative. I note that Stability chose to say nothing about the Stability Watermark Experiments in connection with verbatim prompts in its closing submissions.

198.

Insofar as the Stability Watermark Experiments illustrate that using a sample of 1,000 prompts generated from the Diffusion DB 2M dataset in respect of each of the Models v1.4, v2.1, SD XL 1.0 and v1.6 produced no images with watermarks*, I note and accept Professor Brox’s evidence that it was an attempt to estimate probability albeit that “the estimate is not super-precise because the number of samples was not too large. I agree it would be better if you used a million samples to do this”. This is entirely consistent with Professor Farid’s evidence in his report that determining a precise probability of reproducing a Getty Images watermark would require “much larger experiments” than those carried out by Stability. Professor Brox agreed that this experiment does not prove that real-world prompts do not produce watermarks* and that the images generated by the 1,000 prompts “likely under-estimate the general prevalence of watermarks”. He explained this in evidence on the grounds that “this Diffusion DB database may be a little biased towards artists”, a view taken also by Professor Farid in light of the content of the Model Card for the Diffusion DB 14M dataset on Github.

199.

I accept the evidence of both Experts that the Stability Watermark Experiments are of extremely limited value in determining likely probabilities and certainly do not indicate that watermarks* will not be generated by real world prompts. They are of little assistance on the question of whether real world users in the UK have in fact generated watermarks* from any of the Models in issue in this case, just as they are of little assistance on the question of whether real world users in fact use verbatim prompts.

200.

I can arrive at no determination as to the statistical probability of watermarks* being generated in any given situation based on any of the experiments undertaken by the parties. In closing, Stability sought to rely upon the second Stability Watermark Experiments whose intention appears to have been to establish that the AJG-10 verbatim prompts did not generate watermarks*. With the possible exception of one image discussed by Professor Farid in his report, the experiment did indeed draw a blank on watermarks* in respect of the Models covered, as Getty Images admitted in its Reply to Stability’s NoE. However, the experiment establishes no more than that a small sample of verbatim caption prompts did not generate watermarks* for any of these Models. Professor Brox confirmed in his evidence that this has value as an independent sample from the distribution, while at the same time accepting that it would have been better to try a much higher number of prompts. Given that Stability accepts that the Models can generate watermarks* from various types of prompts, the only real value of the experiment is perhaps to highlight the absence of any more statistically significant experiment.

201.

Tying the strands of this evidence together, beyond Professor Farid’s general assertion in his report that the use of verbatim prompts is not “anomalous” in the context of interactions with an image generator (which finds no support in the example he then used of the Tata Image) I consider there to be no real evidence to support Getty Images’ case that users of Stable Diffusion will copy Getty Images captions and paste these into the Model or, therefore, that users will have done this in the real world and generated watermarks*. Professor Farid’s evidence finds support in AJG-10 in connection with the use of GAI, but the Stability Prompts Sample evidence shows that users of Stable Diffusion interact with it in a different way. I accept Stability’s submissions that while there is evidence that the Stable Diffusion Models can be manipulated to produce watermarks* using verbatim prompts, there is no evidence whatever that this has happened in real life anywhere, including in the United Kingdom. None of the press articles, or third party materials, to which my attention has been drawn by Getty Images (and to which I shall come later) refers to users generating images using verbatim prompts and there is no evidence from even a single user in the UK that he or she has done so, much less that having used a verbatim prompt to conjure a synthetic image, a watermark* has in fact been generated.