Once a backwater filled with speculation, artificial intelligence is now a burning, “hair on fire” conflagration of both hopes and fears about the revolutionary technological transformation. A profound uncertainty surrounds these intelligent systems—which already surpass human capabilities in some domains—and their regulation. Making the right choices for how to protect or control the technology is the only way that hopes about the benefits of AI—for science, medicine and better lives overall—will win out over persistent apocalyptic fears.

Public introduction of AI chatbots such as OpenAI’s ChatGPT over the past year has led to outsize warnings. They range from one given by Senate Majority Leader Chuck Schumer of New York State, who said AI will “usher in dramatic changes to the workplace, the classroom, our living rooms—to virtually every corner of life,” to another asserted by Russian president Vladimir Putin, who said, “Whoever becomes the leader in this sphere will become the ruler of the world.” Such fears also include warnings of dire consequences of unconstrained AI from industry leaders.

Legislative efforts to address these issues have already begun. On June 14 the European Parliament voted to approve a new Artificial Intelligence Act, after adopting 771 amendments to a 69-page proposal by the European Commission,. The act requires “generative” AI systems like ChatGPT to implement a number of safeguards and disclosures, such as on the use of a system that “deploys subliminal techniques beyond a person’s consciousness” or “exploits and of the vulnerabilities of a specific group of persons due to their age, physical or mental disability,” as well as to avoid “foreseeable risks to health, safety, fundamental rights, the environment and democracy and the rule of law.”

A pressing question worldwide is whether the data used to train AI systems requires consent from authors or performers, who are also seeking attribution and compensation for the use of their works.

Several governments have created special text and data mining exceptions to copyright law to make it easier to collect and use information for training AI. These allow some systems to train on online texts, images and other work that is owned by other people. These exceptions have been met with opposition recently, particularly from copyright owners and critics with more general objections who want to slow down or degrade the services. They add to the controversies raised by an explosion of reporting on AI risks in recent months related to the technology’s potential to pose threats of bias, social manipulation, losses of income and employment, disinformation, fraud and other risks, including catastrophic predictions about “the end of the human race.”

Recent U.S. copyright hearings echoed a common refrain from authors, artists and performers—that AI training data should be subject to the “three C’s” of consent, credit and compensation. Each C has its own practical challenges that run counter to the most favorable text and data mining exceptions embraced by some nations.

The national approaches to the intellectual property associated with training data are diverse and evolving. The U.S. is dealing with multiple lawsuits to determine to what extent the fair use exception to copyright applies. A 2019 European Union (E.U.) Directive on copyright in the digital single market included exceptions for text and data mining, including a mandatory exception for research and cultural heritage organizations, while giving copyright owners the right to prevent the use of their works for commercial services. In 2022 the U.K. proposed a broad exception that would apply to commercial uses, though it was then put on hold earlier this year. In 2021 Singapore created an exception in its copyright law for computational data analysis, which applies to text and data mining, data analytics and machine learning. Singapore’s exception requires lawful access to the data but cannot be overridden by contracts. China has issued statements suggesting it will exclude from training data “content infringing intellectual property rights.” In an April article from Stanford University’s DigiChina project, Helen Toner of Georgetown University’s Center for Security and Emerging Technology described this as “somewhat opaque, given that the copyright status of much of the data in question—typically scraped at massive scale from a wide range of online sources—is murky.” Many countries have no specific exception for text and data mining but have not yet staked out a position. Indian officials have indicated they are not prepared to regulate AI at this time, but like many other countries, India is keen to support a domestic industry.

As laws and regulations emerge, care should be exercised to avoid a one-size-fits-all approach, in which the rules that apply to recorded music or art also carry over to the scientific papers and data used for medical research and development.

Previous legislative efforts on databases illustrate the need for caution. In the 1990s proposals circulated to automatically confer rights to information extracted from databases, including statistics and other noncopyrighted elements. One example was a treaty proposed by the World Intellectual Property Organization (WIPO) in 1996. In the U.S., a diverse coalition of academics, libraries, amateur genealogists and public interest groups opposed the treaty proposal. But probably more consequential was the opposition by U.S. companies such as Bloomberg, Dun & Bradstreet and STATS that came to see the database treaty as both unnecessary and onerous because it would increase the burden of licensing the data that they needed to acquire and provide to customers and, in some cases, would create unwanted monopolies. The WIPO database treaty failed at a 1996 diplomatic conference, as did subsequent efforts to adopt a law in the U.S. but the E.U. proceeded to implement a directive on the legial protection of databases. In the decades since the U.S. has seen a proliferation of investments in databases, and the E.U. has sought to weaken its directive through court decisions. In 2005 its internal evaluations found that this “instrument has had no proven impact on the production of databases.”

Sheer practicality points to another caveat. The scale of data in large language models can be difficult to comprehend. The first release of Stable Diffusion, which generates images from text, required training on 2.3 billion images. GPT-2, an earlier version of the model that powers ChatGPT, was trained on 40 gigabytes of data. The subsequent version GPT-3 was trained on 45 terabytes of data, more than 1,000 times larger. OpenAI, faced with litigation over its use of data, has not publicly disclosed the specific size of the dataset used for training the latest version, GPT-4. Clearing rights to copyrighted work can be difficult even for simple projects, and for very large projects or platforms, the challenges of even knowing who owns the rights is nearly impossible, given the practical requirements of locating metadata and evaluating contracts between authors or performers and publishers. In science, requirements for getting consent to use copyrighted work could give publishers for scientific articles considerable leverage over which companies could use the data, even though most authors are not paid.

Differences between who owns what matter. It’s one thing to have the copyright holder of a popular music recording opt out of a database; it’s another if an important scientific paper is left out over licensing disputes. When AI is used in hospitals and in gene therapy, do you really want to exclude relevant information from the training database?

Beyond consent, the other two c’s, credit and compensation, have their own challenges, as illustrated even now with the high cost of litigation regarding infringements of copyright or patents. But one can also imagine datasets and uses in the arts or biomedical research where a well-managed AI program could be helpful to implement benefit sharing, such as the proposed open-source dividend for seeding successful biomedical products.

In some cases, data used to train AI can be decentralized, with a number of safeguards. They include implementing privacy protection, avoiding unwanted monopoly control and using the “dataspaces” approaches now being built for some scientific data.

All of this raises the obvious challenge to any type of IP rights assigned to training data: the rights are essentially national, while the race to develop AI services is global. AI programs can be run anywhere there is electricity and access to the Internet. You don’t need a large staff or specialized laboratories. Companies operating in countries that impose expensive or impractical obligations on the acquisition and use of data to train AI will compete against entities that operate in freer environments.

If anyone else thinks like Vladimir Putin about the future of AI, this is food for thought.

This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.