No Data to Train AI?

AI companies are fast running out of data needed to train their AI models.

No Data to Train AI?

How is AI trained using data?

When an [[artificial intelligence (AI) model::A computer program or system trained to perform some human-like tasks such as answering questions, creating videos, etc. One such example is ChatGPT.]] is created, it has no knowledge. AI companies like OpenAI teach it using online resources such as images, information from Wikipedia, books, etc. They do this to ensure that the AI works properly.

Suppose you show the AI images of cats and dogs and point out their differences. The AI will identify patterns and learn to distinguish them. You may learn more about this here.

But, recent news shows that AI companies are running out of data to train their models.

Oh no! What’s the issue?

AI companies need high-quality data to train their AI models. Think of the data as having a fantastic textbook full of useful information that helps you learn.

However, these companies say that they will run out of useful data in the next two years.

Imagine you've finished all your school books and have nothing new to read. What would you do? Would you stop learning or find other ways to learn? So, AI companies are also looking for new ways to train their models.

What are the companies doing?

A person using ChatGPT. Photo by Solen Feyissa on Pexels.

OpenAI is using YouTube videos and [[podcasts::A podcast is a digital discussion (audio or video) series that is episodic and may be played or downloaded from the internet. These podcasts can cover various topics, such as news, entertainment, sports, etc.]] to train its latest AI model. Google also uses them and plans to use data from Google Docs and Sheets. Meta plans to purchase a [[publishing business::A company that produces and distributes books, magazines, and other literary texts.]] and wants to use their books to train its AI model.

Some are even thinking of using [[synthetic data::It refers to data that has been generated or created by AI.]]. But, these solutions face certain challenges.

AI companies are getting [[criticised::Show disapproval of something or someone.]] for using public data to train their AIs. People want AI companies to take their permission or pay them before using their data.

Also, using synthetic data to train AI models can cause them to repeat the same mistakes. This can cause the models to malfunction, as those mistakes might not get corrected.

Quick Revision

  • AI learns from lots of data like images, books, videos, and websites.

  • Companies say they may run out of good data soon.

  • So, they are using YouTube, podcasts, documents, and books to train AI.

  • People are upset because companies use data without permission, and synthetic data can make AI repeat mistakes.

Knowledge checkpoint

Guess the word