There are lots of voices bemoaning the use of copyrighted materials in the training of large language models, how much energy and water is consumed by their creation, and how they are enclosing a commons to create a new generation of robber barons.
Congress could, if they were a functioning part of human civilization, enact a law which ordered the Library of Congress to create corpora specifically for the purpose of training large language models.
They could furthermore specify that including copyrighted materials in them is the greater promotion of “Progress of Science and useful Arts” than keeping them out. Then they could order that when these “LoC corpora” are used to train LLMs, both the resulting model and the state of the neural network at the end of training are in the public domain and may not be capitalized.
Then they could order, and fund, the National Institute of Science and Technology to assume the role as the lead agency in the creation of trained LLMs for both the government and the public good, assigning personnel, patents, and materiel to consolidate power, water, silicon, and expertise into a new series of national computing centers to optimize the environmental impact and industrial waste of having multiple training efforts for LLMs.
And finally, they could raise the penalty for each act of copyright infringement done by a corporation to be 10% of their top-line revenue to promote their use by industry.
All of this is unambiguously within the power of Congress, and doing this would immediately resolve multiple issues both in law and in reality. It’s an easy win.