I think this whole kerfuffle started a few decades ago:Where is it now? I'm predicting (guessing) large databases will be soon locked from common free access.
Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean)
Google's original vision (circa 2002) was to create a digital library of all books
This initiative was later expanded (in 2004) with the announcement of the Library Project, which involved partnerships with major university and public libraries...
Google publicly stated its intention to scan all 129,864,880 known books within a decade (in 2010), reaffirming its commitment to digitizing as much printed material as possible.
Then, the project got ugly and quickly!
A 2017 media proclamation summarized "What Happened to Google's Effort to Scan Millions of University Library Books", like this:
TL&DR: Hidden in the bowels of Google's data centers, there is a database containing >25 million books but nobody is allowed to read them.... Google helped create this db and uses it as a dataset they can query, even if they can’t consume full texts....It’s a pillar of the humanities’ growing engagement with Big Data...
... yet the promised library of everything hasn’t come into being...
...An epic legal battle was ultimately dismissed the case (in 2013), handing Google a victory that allowed it to keep on scanning. Yet, the dream of easy and full access to all those works remains just that...
I had suspected [not being a history fan] that this Google db (e.g. Google Books) may have made it the universe's gold-standard text corpora; for Google's LLM training.
Gemini answers my suspicion thusly:
...is highly likely to have been used in training Google's large language models (LLMs), though the company has not confirmed this explicitly.