Think of any topic vaguely related to raising kids imaginable, and thereâs probably a post about it on Mumsnet, the long-running, enormously popular, controversy-spurring UK-based parenting forum for mothers. Over its more than two decade-long history, Mumsnet has amassed an archive of more than six billion words written by its highly engaged user base, on topics such as dirty diapers and lazy husbands. (Not to mention a bonkers rant about dolphins.)
This spring, after Mumsnet discovered that AI companies were scraping its data, the company says it decided to try to strike licensing deals with some of the major players in the space, including OpenAI, which initially expressed willingness to explore an arrangement after Mumsnet first reached out. After talks with OpenAI fell apart, Mumsnet in July announced its intention to pursue legal action.
According to Mumsnet, during those early conversations, an OpenAI strategic partnership lead told the company that datasets over 1 billion words were of interest to the AI giant. Mumsnetâs leadership was excited. âWe spent quite some time in a back-and-forth with them,â Mumsnet founder and CEO Justine Roberts tells WIRED. âWe had to sign some NDAs, and they wanted a lot of information from us.â
However, over a month later, OpenAI told Mumsnet that the company was no longer interested in partnering at that time, according to an email exchange reviewed by WIRED. When asked why, the OpenAI staffer characterized Mumsnetâs 6 billion word dataset as too small to warrant a licensing arrangement, Roberts says. They also noted that OpenAI is primarily interested in large datasets that the public cannot already access online, and that it wanted datasets that captured broad human experience.
This sentiment was echoed by the company when asked for comment from WIRED. âWe pursue partnerships for large-scale datasets that reflect human society and do not pursue partnerships solely for publicly available information,â says OpenAI spokesperson Kayla Wood. âWe support publisher and creator choice, offering them ways to express their preferences about how their sites and content work with AI in search results and training generative AI foundation models.â
Roberts says she was âirritatedâ by this development. She recalls that OpenAI at first had seemed especially interested in Mumsnet because of the platformâs heavily female-written content. âItâs very high-quality conversational data,â she says. âItâs 90 percent female conversation, which is quite unusual.â
OpenAI has struck a variety of data-licensing deals with media outlets and platforms in the past year, entering into agreements with Vox Media, the Atlantic, Axel Springer, Time, and WIRED parent company CondĂ© Nast, as well as platforms filled with user-generated content like Reddit. (Automattic, the owner of WordPress.com and Tumblr, was also said to be in licensing talks earlier this year.) As the particulars of those deals havenât been revealed, itâs not clear what the size of their respective corpuses are.
When WIRED asked about the size of datasets it will consider for commercial licensing, OpenAI declined to share that information. But spokesperson Kayla Wood emphasizes that the companyâs partnerships with publishers are âfocused on displaying their content in our products and driving traffic to them.â