Wallington, New Jersey
AI is currently learning from free, publicly available data. In this new era of artificial intelligence, what happens if AI data contaminates the data it is learning from? Could this spell the end for AI models?
Currently, we teach AI software using free public internet data. We feed this data into the AI, which then learns the data patterns. For instance, when the AI receives the handwriting of multiple individuals, it learns the strokes and letter sizes for attempts to replicate them. The output of this algorithm is not always perfect; it will contain mistakes, such as making a couple of the letters too big. If you plug this AI-generated data into the software later, the AI will replicate its initial mistakes, ultimately resulting in handwriting that is completely unreadable. Ilia Shumailov, a machine learning research specialist at the University of Oxford, documented this exact handwriting experiment using the language model OPT-125m. This dilemma is why AI models need new human-produced data to work accurately. The AI trains on data that has undergone similar changes. As more AI data is published on the internet, AI has now begun to train on AI-produced data, leading to numerous issues.
Mainly, the risk of the AI collapsing is raised as it grows increasingly detached from the initial human-produced data it learned from. As the AI continues to train itself, its vocabulary starts to diminish. The data becomes increasingly incorrect and distant from its original goal. Preventing this data “contamination” is problematic as well with it becoming increasingly difficult to distinguish AI-written data, such as essays, from human-generated data in today's day and age. Differentiating the data is a large and daunting task that would require cooperation from many in the industry. The use of AI in medicine, specifically in medical chatbots, poses a significant challenge. The medical chatbot will lose vocabulary as it continues to process AI-generated material, thereby reducing the range of potential medical treatments and diagnoses it can generate.
Researchers suggest implementing a publicly available data set to enable AI models to function. This data would be free and verified to be human-produced. However, the data demands for AI are constantly increasing, and the amount of free public data may not be able to keep up. The presentation also suggested utilizing pre-AI data. There's a worry that if we feed the data into the AI, it won't produce results that align with the current state. Again, the ideal scenario would be to distinguish current human-produced data from AI data.
At the current rate of innovation, experts predict that AI will exhaust human-produced data within the next decade. Researchers are rushing to find solutions to the presented problems before the collapse of AI models begins to manifest.
Sources
https://www.theatlantic.com/technology/archive/2024/02/artificial-intelligence-self-learning/677484/
Enjoyed this article? Here's more from STEM-E!
Comments