How your social media posts are training AI: What you need to know

15:11 pm on 29 August 2024

Katie Kenny, Digital Explainer Editor

@kennykatie katie.kenny@rnz.co.nz

Photo: AFP

When we use social media or search engines, we're often paying for the privilege by divulging personal information. The business model of big technology companies has, in effect, given personal data an economic value that can be bought, sold, and traded.

Now, the exploding generative artificial intelligence market is further complicating the landscape.

Generative AI refers to deep-learning models that can generate high-quality text, images and other content. The models 'learn' by analysing vast amounts of data.

This data might be biomedical abstracts, in the case of a small model designed to answer medical questions. Or, in the case of OpenAI's ChatGPT, every reservoir of reputable English-language text on the internet, followed by transcriptions of more than one million hours of YouTube videos.

The race to lead AI has become a desperate hunt for the digital data needed to advance it. To obtain that data, companies such as OpenAI, Google, and Meta have cut corners, ignored corporate policies, and debated bending the law, according to the New York Times.

Meta, the owner of Facebook, Instagram, and WhatsApp, has confirmed it's been using customers' social media posts and interactions to train its AI systems. It's likely New Zealanders' posts have been used for this purpose since at least September, 2023.

Meanwhile, European customers were given a heads up about the relevant privacy policy change, and the opportunity to push back against it, thanks to their data protection regulations.

What's happening?

Meta AI's open-source AI model is called Llama. It was trained by processing texts publicly available online as well as some public social media information, explained Dr Kenneth Johnson, a senior lecturer at Auckland University of Technology's computer science and software engineering department.

"Although Meta said in 2023 it's only using public posts to train its AI, its privacy policy says it can use any of the contents on its platforms."

He listed some of the information we consciously - and unconsciously - share with the social media giant: Posts, photos, messages, apps, purchases, interactions, connections, devices, internet service provider, language, location information, and more.

"Any of this information can be used to train the AI, but whether it is or not is hard to say. It's challenging to pin down exactly what's happening."

Users may argue they didn't agree to this. That doesn't really matter, Johnson said.

"If you're using the platform, you've agreed to Meta's terms and conditions. As time goes on, [policies are updated]. You've implicitly agreed to them, without having to click a button."

And it's not just Meta: "All digital services will collect some sort of data about you. I think it's important for the average New Zealander to realise their data is valuable."

Even deleting social media accounts "doesn't fully protect you from having your data harvested in future", AI lead at public interest think tank Brainbox Allyn Robins told RNZ.

"More and more, every single online interaction is being mined for data to train AI."

While individuals can push back by minimising what they post online, avoiding businesses and platforms that engage in unethical behaviour, and using tools such as Nightshade that render data unsuitable for model training, ultimately "these are big, systemic problems that require big, systemic solutions".

"The most effective way to have an impact is probably to do what you can to make pursuing those solutions more appealing to governments and international bodies."

The EU pushes back

Europe's General Data Protection Regulation (GDPR), possibly the world's toughest data protection rules, has created obstacles for Meta and other companies looking to improve their models with user-generated material.

In June, Meta confirmed it would delay training its large language models on Facebook and Instagram content from its users in the European Union and United Kingdom.

Ireland's Data Protection Commission, which oversees Meta's compliance with GDPR said it welcomed the decision and would continue to engage with the company.

"The GDPR is kind of a gift in this aspect, because the EU is a large and powerful enough economic force that large companies tend to comply with their regulations rather than abandoning the European market - which means they already have compliance processes for the GDPR built into their systems," Robins said.

But legal solutions are imperfect, he added, noting ongoing lawsuits arguing big chunks of the early data gathering for big AI models like OpenAI's ChatGPT were against not only the GDPR, but a lot of copyright law.

"There's a strong incentive to just grab whatever you can and try to move quickly enough that by the time people start paying attention, you're too big and influential to meaningfully punish for any indiscretions you may have committed on your way to the top."

NZ laws and regulations

In its briefing to the incoming Justice Minister in December, 2023, the Office of the Privacy Commissioner argued for "further modernising the Privacy Act and better resourcing the privacy regulator", saying it was "losing alignment with like-minded countries".

"The Privacy Act is based on policies agreed in 2013, and this past decade has witnessed the development and widespread adoption of significant new technologies such as biometrics and AI, and does not account for new risks to children's privacy."

In an email to RNZ, a spokesperson for the Office of the Privacy Commissioner said: "AI tools work based on gathering huge amounts of training data and that does create some new privacy challenges. There are risks of people's information being leaked when these tools answer questions or generate images, and they can also be used in bad ways to facilitate scams and impersonation."

While many widely used AI models have blocks designed to prevent them from sharing identifying information about individuals, some models have been tricked into leaking data.

Researchers have found popular image generation models, for example, could be prompted to produce identifiable photos of real people. They could also regurgitate exact copies of medical images and copyrighted work by artists.

"One step we could take is updating our privacy law to offer better protections and support safe and trusted use of these tools," the spokesperson said. "It is not a case of regulating these technologies out of existence. Taking simple steps to think about privacy before rolling out of AI and biometric technologies gives better results for everyone."

Minister of Science, Innovation and Technology Judith Collins met with Privacy Commissioner Michael Webster on Wednesday.

Afterwards, in an emailed response to RNZ, she said: "New Zealand has existing regulatory frameworks, including for privacy, consumer protection and human rights, that can provide applicable rights and remedies for harms arising from AI."

If regulatory intervention is needed, "we need to take a flexible approach, using our existing regulatory frameworks and learning from our peers".

AI fever drives Nvidia’s rise to world’s most valuable company

Dan's the man: Why women in China are looking to ChatGPT for love

How much energy (and water) does it take to train an AI model?

What's happening?

The EU pushes back

NZ laws and regulations

Next Article