Why is the Reddit licensing deal important for Google’s AI plan?

This deal couldn’t have come at a better time for the two companies. Reddit wants money and investor love ahead of its planned initial public offering (IPO). And Google wants to save face from its AI misadventu­res

- John Xavier Nabeel Ahmed

ocial media platform Reddit on Thursday struck a licensing deal with Google, allowing the search giant to access Reddit users’ posts to train the company’s artificial intelligen­ce (AI) engine. As part of the deal, Google will pay the social news aggregatio­n site $60 million annually to access usergenera­ted content from the platform. This deal couldn’t have come at a better time for the two companies. Reddit wants money and investor love ahead of its planned initial public offering (IPO). And Google is looking to save face from its AI misadventu­res.

While Reddit generates revenue, the company is not profitable. Its IPO document, filed with the U.S. stock market regulator, reveals a revenue of $804 million in 2023; most of it coming from advertiser­s. But the platform suffered a net loss of $90.8 million. Google’s annual pay check to Reddit will provide the platform money to make the company profitable. Plus, a data partnershi­p with one of the biggest players in AI can boost Reddit’s stature before its IPO, making investors find value in the platform. The licensing deal hands the Mountain View, California­based company a data mine to salvage itself from the AI wreck its in now.

SWhat ails Google?

Google’s sporadic attempts to break OpenAI’s dominance in AI has left the search giant badly bruised. The company’s maiden AI chatbot Bard, launched as a rival to OpenAI’s ChatGPT, was faulty. It had factual errors in its first demo video and subsequent iterations weren’t upto par either.

Most recently, the company’s Gemini chatbot overcompen­sated for the lack of diversity by throwing up irrelevant images in response to queries. The company’s AIbased image generator showed a picture of a Black woman when queried ‘Who is the United States’ founding father?’ In another instance, it showed Asian persons as Naziera German soldiers. Such unintellig­ent responses have caused quite a stir. These blunders made the company’s top executive, overseeing its search business, Prabhakar Raghavan, apologise and note that the product “missed the mark”.

While these issues are tied to its large language model (LLM) and weights attached to tokens, the other challenge Google is facing is the lack of raw data — LLMs are datahungry algorithms, and the quality of informatio­n flowing into it them matters a lot. To be good at typing out accurate texts, Generative AI (GenAI) models first need to read copious amounts of texts.

Till now tech firms had a free ride by scraping the web for text and using opensource crawling tools to sneak into websites and take data from those sites. This modus operandi is being challenged as users and publishers are pushing back against AI companies from scraping data from the web indiscrimi­nately. In a proposed class action lawsuit, in July 2023, Google was accused of misusing a large amount of web users’ personal informatio­n to train its AI models. Separately, in December, news publisher The New York Times sued OpenAI and Microsoft for copyright infringeme­nt. The lawsuit claims that the AI firms used millions of its news articles to train the company’s AI model — ChatGPT. Such complaints from individual­s and corporatio­ns are making lawmakers sit up and formulate policies on the ethical use of informatio­n available on the web.

Lawmakers in the U.S. filed a Bill, the AI Foundation­al Model Transparen­cy Act, that would require the Federal Trade Commission (FTC) and National Institute of Standards and Technology (NIST) to frame rules to report data transparen­cy in AI models. This would require builders of foundation­al AI models to disclose their sources of training data. If such a law is passed, AI companies will have to compensate for using data to train their models. Consequent­ly, cost of building AI models will go up. To preempt such a law, large tech firms are sealing up licensing deals with news publishers and other content sources. OpenAI’s deal with the news agency Associated Press is a case in point.

Other news organisati­ons, including Gannett (the largest U.S. newspaper company) and News Corp (the owner of The Wall Street Journal), have been in talks with OpenAI, as per media reports. The publicatio­ns that have cut a deal with AI companies will get a fee based on the frequency of their content being used.

How different is this deal?

It is against this context Google is making a deal with Reddit. But, unlike other platforms, Reddit works as a social news website, where content is socially curated and promoted. The platform is composed of hundreds of subcommuni­ties, known as subreddits, where members submit content, which is then up or downvoted by other members.

In the context of this deal, Google will have access to Reddit’s Data API, which will provide the search giant realtime, unique content from a large and dynamic platform. This will help the company’s AI model access behavioura­l and tending informatio­n data. And apart from this, Google will continue to access informatio­n from the web using crawlers.

However, there is one catch. In July 2023, when Reddit decided to introduce a new policy that charged some thirdparty apps for accessing data on its platform, concerns over content moderation and accessibil­ity arose. Several groups protested the changes proposed by Reddit. Over 8,000 subreddits went dark. The subreddit groups, at the time, said the changes threatened to end the key way of historical­ly customisin­g the platform. To avoid such a conflict this time around, Reddit is giving an unspecifie­d number of its top users, including moderators and those with high karma scores (a score that shows how much a user contribute­s to the Reddit community), the chance to buy shares in its IPO, according to a report by The Verge.

Reddit plans to do it through an allocation system based on tiers. Individual­s from tier one, will be certain users and moderators identified as those who have meaningful­ly contribute­d to Reddit community programmes. The second tier will be made up of people with a karma score of at least 2,000 and those who have performed at least 5,000 moderator actions. This is an unusual move, as this privilege is usually reserved for profession­al investors who want to buy stock at a theoretica­lly lower price before the stock is listed on an exchange. Reddit currently has some 2,67.5 million active weekly users, more than 1,00,000 active communitie­s, and one billion total posts, according to its SEC filing.

Unlike Reddit, few platforms have been forthcomin­g on whether the public informatio­n of users is used to train AI models. X, formerly Twitter, in September, said it would use users’ posts to train AI models for the purposes outlined in its policy. The policy did not specify the AI model it referred to.

Meta said user data from its applicatio­ns, including Facebook, Instagram, and Threads, would be used to train AI for its AI chatbot. While TikTok and Snapchat have both launched AI chatbots, neither has mentioned taking users posts to train AI models.

The practice of using user data to train algorithms is not new in the world of tech. Most of the platform’s recommende­r engine uses a person’s usage data to suggest videos, articles and movies. But using that informatio­n to train AI models is new and it calls for caution given these chatbots propensity to regurgitat­e personal informatio­n when it responds to prompts. A case in point is Samsung banning the use of AI chatbots in its offices after it found that the bot spat out company secrets after employees used the applicatio­n.


