In the race to make each new AI model better than the previous one, companies are resorting to unethical data collection techniques to stand ahead in the competition. Our private data, including medical records, photographs, social media content, etc., are all making their way into data sets for training AI models.
Your data is being stolen
Are you willing to give up your privacy for convenience?
Reading all the privacy policies you encounter in a year from big tech companies would take 30 full working days of your life.
Source: The cost of reading privacy policies
Home is our safe space, but what happens when the appliances in our household start to leak our data? Kashmir Hill and Surya Mattu, investigative data journalists, revealed how the smart devices in our homes are doing precisely this. At first, it might sound mundane that your electric toothbrush is routinely sending data to its parent company. However, they reveal in their 2018 TED talk how some of this collected data can come back to haunt us. For example, your dental insurance provider can buy your data from the toothbrush company and charge a higher premium if you miss brushing your teeth at night.
Data sets used for training image synthesis AIs are made by scraping images on the internet, which copyright holders and subjects may or may not have given their permission to be used. Even private medical records of patients end up as training data for AI models. Lapine, an artist from California, discovered that her medical record photos taken by her doctor in 2013 were included in the LAION-5B image set, a data set used by Stable Diffusion and Google Imagen. She discovered this through the tool: Have I Been Trained, a project by artist Holly Herndon that allows anyone to check whether their pictures have been used to train AI models.
The LAION-5B dataset, which has more than 5 billion images, includes photoshopped celebrity porn, hacked and stolen nonconsensual porn, and graphic images of ISIS beheadings. More mundanely, they include living artists’ artwork, photographers’ photos, medical imagery, and photos of people who presumably did not believe that their images would suddenly end up as the basis to be trained by an AI.
Source: AI Is Probably Using Your Images and It’s Not Easy to Opt Out, Vice
Compromised identities
Research proves that ai-generated faces can be reverse-engineered to expose actual humans that inspired it
AI-generated faces are mainstream now; designers are using them as models for their product shoots or in fake personas. The idea was that since these were not real people, they wouldn’t need consent. However, these generated faces are not as unique as they seem. In 2021, researchers were able to take a GAN-generated face and backtrack it to the original human faces from the dataset that inspired it. The generated faces resemble the original ones with minor changes, thus exposing the actual identities of people in these datasets.
Unlike GANs, which generate images that closely resemble the training samples like above, diffusion models (DALL-E or Midjourney) are thought to produce more realistic images that differ significantly from those in the training set. By generating novel images, they offered a way to preserve the privacy of people in the dataset. However, a paper titled Extracting Training Data from Diffusion Models shows how diffusion models memorize individual images from their training data and regenerate them at runtime. The popular notion of how AI models are “black boxes” which reveal nothing inside is being revisited through these experiments.
AI surveillance
Supercharging the surveillance society with AI
One of the most controversial cases of intelligent surveillance occurred during the Hong Kong protests in 2019. The police used facial recognition technology to identify the protestors and penalize them individually. The protestors realized this and aimed their hacked laser pointers at the cameras to burn their image sensors.