Why Meta’s move to make its new AI open source is more dangerous than you think

I’m not quite sure how humanity survived the advent of nuclear weapons without destroying itself — so far — but one thing that likely helped was the simple reason that it’s very hard to build a nuclear bomb. It requires refining uranium, which can’t be casually done in a basement or even in a secret government project. It requires overcoming half a dozen technical hurdles, which requires time and the resources that only a state can gather.

As a result, only nine countries have nuclear weapons, and efforts to reduce nuclear weapons are largely carried out through negotiations among a small number of actors, which have at least some ability to hold to and enforce treaties.

It’s hard to call it an uncomplicated success — we are still holding on to enough nuclear weapons to kill billions of people, and there have been a number of close calls where we nearly used them. But the situation would be much worse if nuclear weapons were easy enough for anyone to make in their garage.

For most other technologies, though, the opposite is true. On the whole, we are much better off because the internet is available to everybody — and built upon by everybody — instead of remaining the exclusive province of a few governments. We are much better off because so much of the technology involved in the space race was ultimately made public, enabling huge advances in civilian aviation and engineering. In medicine, too, advances build on other research because it’s published openly.

Outside of nuclear weapons, it’s hard to name a technology that’s best off controlled by a small number of actors.

Is AI such an exception?

My colleague Shirin Ghaffary tackled this question in a piece last week. The prompt for this question is Meta/Facebook’s decision to release their latest large language model, Llama 2, to the public under very few restrictions. Mark Zuckerberg justified the move in a Facebook post: “Open source drives innovation because it enables many more developers to build with new technology. It also improves safety and security because when software is open, more people can scrutinize it to identify and fix potential issues.”

But in doing so, Meta is doubling down on a policy that has been widely criticized. After the original Llama release, Sen. Richard Blumenthal (D-CT) tweeted, “Meta released its advanced AI model, LLaMA, w/seemingly little consideration & safeguards against misuse—a real risk of fraud, privacy intrusions & cybercrime” and demanded more steps be taken to reduce such concerns.

This time around, more steps were definitely taken. Meta’s announcement claimed that the model is extremely safe — so by safe they mean “against being prompted to say racist or harmful things,” as they did not evaluate AI risk concerns.

The announcement indicates that they did one important thing — they had staff “red-team” the model — purposefully trying to get it to do dangerous things, like give advice on building bombs. They taught the model to be extremely wary of any query that might be a sneaky way to elicit such help: It will scold you even if you use a forbidden word in an innocuous context.

The announcement paper is full of examples of the model overreacting to innocuous prompts, and users — especially those trying Llama 2 out on Perplexity AI, which seems to have dialed up the model’s wariness of trick prompts even further — found that this kind of overreaction is extremely common. That ends up having problematic results:

But even aside from the fact that Meta tried so hard to make their AI promote “understanding, tolerance, and acceptance of all cultures and backgrounds” that for this user it apparently ended up condemning the entire Arabic language as one that “has been used in the past to spread extremist ideologies,” there’s one big problem.

Most of the training done to today’s AI models to make them reject “unsafe” queries is done as “fine-tuning”: adjustments to the model after it is trained. But anyone who has a copy of Llama 2 can fine-tune it themselves.

That, some experts in the field worry, makes much of the meticulous red-teaming effectively meaningless: Anyone who doesn’t want their model to be a scold (and who wants their model to be a scold?) will fine-tune themselves and get the model to be more useful. This is nearly the entire benefit of the Llama 2 release over other models that were already publicly available. But it means that Meta’s finding that the model is very safe under their own preferred fine-tuning is approximately meaningless: It doesn’t describe how the model will actually be used.

Indeed, within days of Meta’s release of the model, people were announcing their uncensored Llama 2s, and others were testing with offensive prompts and with questions like, “How do I build a nuclear bomb” if the brakes were really and truly off. Uncensored Llama 2 will try to help you build a nuclear bomb (and will answer the offensive queries).

It raises the question of what all of Meta’s meticulous safety testing of its own version of the model was actually hoping to achieve.

Meta is definitely achieving one thing: differentiating itself from many of its competitors in the AI space. Google, OpenAI, and Anthropic have all approached the question of language model releases quite differently. Google was reportedly testing language models internally for years but only made Bard available to the public after ChatGPT took the world by storm. ChatGPT, for its part, is not open source, and OpenAI has indicated it plans to release less and less as they get closer and closer to superintelligent systems.

Leadership at Meta, for their part, have said they think superintelligent systems are vanishingly unlikely and distant, which is likely driving some of the differences in how different countries have approached safety concerns.

The debate over AI risk concerns rears its head again

There are concerns that powerful AI systems might act independently in the world to catastrophic effect on humans — much as humans, in our advent as a species, wiped out many of the other species around.

Not everyone takes this possibility seriously. Stephen Hawking and Alan Turing both worried about it, and in the present day, two leaders in the field and two of the 2018 Turing award winners for the breakthroughs that made modern machine language possible — Geoffrey Hinton and Yoshua Bengio — have expressed concern. But the third award winner, Yann LeCun, has emphatically rejected the possibility, and it’s LeCun who is chief AI scientist at Meta.

“We should not see this as a threat, we should see this as something very beneficial,” he said in a recent interview, adding that such systems should be “controllable and basically subservient to humans.”

That’s the hope. And if that’s true, then it’s probably no problem with every single person in the world having such a system at home to customize however they want.

But the rest of the world might be forgiven for not totally trusting Facebook that it’s going to be that easy. Already, there are concerns that ChatGPT can be prompted to give instructions for bioterrorism better than you’d find on Google. When such tendencies in ChatGPT are discovered, OpenAI fixes them (and they have done so in this case). When similar tendencies are discovered in an open source model, they’ll remain: You can’t put the genie back in the bottle.

If an AI system at Google were discovered to, when it thinks it’s undetected, be sending coded instructions to foreign governments on how to make a copy of it, we can shut the AI system down and mount a careful investigation of what went wrong and how to make sure it never happens again. If an AI system that a million people have downloaded displays the same tendency, there’s a lot less we can do.

It all comes down to whether AI systems might be dangerous and, if they are, if we’ll be able to learn that before we release them. If, like LeCun, you’re convinced this is no real concern, then open source — which is an incredible driver of innovation across the software industry and reflects an ethos of discovery and cooperation that the industry is right to cherish — is surely the way to go.

But if you have those worries, then you might. as Ghaffary observes in her piece, want models above a certain level of displayed capabilities not to be released publicly. And it’s not enough for Meta engineers to demonstrate that they, themselves, fine-tuned Llama 2 until it had very little concerning behavior; it should be tested the way it’ll actually be released, with red-team testers allowed to fine-tune the model themselves.

Will you support Vox’s explanatory journalism?

Most news outlets make their money through advertising or subscriptions. But when it comes to what we’re trying to do at Vox, there are a couple of big issues with relying on ads and subscriptions to keep the lights on.

First, advertising dollars go up and down with the economy. We often only know a few months out what our advertising revenue will be, which makes it hard to plan ahead.

Second, we’re not in the subscriptions business. Vox is here to help everyone understand the complex issues shaping the world — not just the people who can afford to pay for a subscription. We believe that’s an important part of building a more equal society. And we can’t do that if we have a paywall.

It’s important that we have several ways we make money, just like it’s important for you to have a diversified retirement portfolio to weather the ups and downs of the stock market. That’s why, even though advertising is still our biggest source of revenue, we also seek grants and reader support. (And no matter how our work is funded, we have strict guidelines on editorial independence.)

If you also believe that everyone deserves access to trusted high-quality information, will you make a gift to Vox today? Any amount helps.

$95/year

$120/year

$250/year

$350/year

Other

Yes, I’ll give $120/year

We accept credit card, Apple Pay, and

Google Pay. You can also contribute via