Skip to Main Content

Artificial Intelligence

A guide to resources for learning basics of artificial intelligence

Data in AI

Except for GOFAI, all artificial intelligence algorithms need data in order to work. Most of them need a lot of data—hundreds of thousands or millions of data points. The field of research that works with such large datasets is often called "big data." This page explores what AI algorithms do with all that data, where it comes from, and some of the problems presented by data in AI.


How Does AI Use Data?

There are many different types of AI algorithm, but at their core, they all work by taking a dataset and finding patterns within it. Computer vision algorithms find patterns of what organizations of pixels represent a picture of a dog, a car, or a human. Recommender system algorithms find patterns of what user behavior is related to like romantic comedy movies or horror movies. Generative AI algorithms find patterns of what word is most likely to follow from the phrase, "as per my last."


AI algorithms are built to look for patterns, and sometimes they go overboard. An algorithm can become so good at finding patterns in the dataset that it was trained on that it gets worse at making predictions about new data.

Tweet from user afraidofwasps. Text: Guy who has only seen The Boss Baby, watching his second movie: Getting a lot of Boss Baby vides from this...

A comical example of overfitting in the wild.

This problem is called overfitting, and it happens when algorithms pay too much attention to small random trends in their sample data set which are not representative of the patterns in the real world. There are methods of preventing this error, although they often have the tradeoff of making the algorithm less sensitive to real patterns.

A scatter plot divided into slightly overlapping clusters of red and blue dots. A very wavy green line through the two clusters divides them completely. A smooth black line through the clusters divides them significantly, but imperfectly.

The green line has overfit the classification; it won't predict new data well.The black line
will generalize better to new data. (Image credit Chabacano, CC BY-SA 4.0)

In order to make sure that an algorithm isn't overfitting, AI researchers usually split their data into a training set and a testing set. The algorithm is trained—learns patterns—from the training set. It is then run on the testing set, to check how its predictions perform on new data whose ground truth the researchers know.


Where Does AI Data Come From?

As mentioned above, most AI algorithms require a lot of training data to work. Creating and finding very large data sets usually requires a lot of work. Some of the most common sources from which AI researchers get their data are:

  • User-generated data: Social media, ecommerce, and streaming media sites generate millions of data points by capturing data about how people use their sites. For instance, when a user likes a post on social media, that like is added to a data set of user interactions. Websites use these data sets for their own internal algorithms.
  • Web scraping: The internet is full of data that has been freely and publicly posted. Social media websites like Twitter publish vast swaths of text data; YouTube is home to millions of hours of video; DeviantArt and Flickr host user-created artwork and images. These data are sometimes made available to the public via an official application programming interface (API). However, even when they are not, they can often be "scraped" using various tools. Many well-known AI applications like ChatGPT and DALL-E were trained on public data from the internet.
  • Government data: The U.S. (and other) governments routinely collect large amounts of data about their citizens, and some of that data is publicly available. The U.S. Census is the most famous of those datasets, but many others are used by AI researchers. Criminologists use the Bureau of Justice Statistics' National Crime Victims Survey to find patterns in crime. The MNIST dataset developed by the National Institute of Standards and Technology is a commonly used dataset for training optical character recognition algorithms that work on handwritten text.

Because of ethical concerns about privacy and bias, and legal concerns about copyright, it's important to understand where the data used for an AI algorithm comes from.


Commonly Used AI Datasets

Because AI datasets can be so difficult and time-consuming to create, there are a number of large datasets that are commonly used by and shared between researchers working in the same subfield. A list of some of the most frequently used AI datasets is included below, along with the size and source of the data, and the applications it is most commonly used for.

Dataset Description Size Source Applications
Iris data A small data set containing the dimensions of various irises, each row labeled with the type of iris. 150 rows, four variables Compiled by biologist Ronald Fisher in 1936 for an early paper on classification. Commonly packaged in statistical software for use on toy classification problems.
MNIST A large dataset of handwritten digits (0-9) from many different writers, labeled with the correct digit.

60,000 digits (training set)

10,000 digits (testing set)

Collected via mailed questionnaire by the National Institute of Standards and Technology. Training optical character recognition algorithms.
ImageNet A large dataset of URLS of images available on the internet, hand-annotated to label objects contained in the images. 14 million images Public image links are compiled by the ImageNet research team; annotations are crowdsourced. Training object detection and other computer vision algorithms.
FERET A large dataset of photographs of human faces in a semi-controlled set of positions and lighting. 14,126 images of 1,199 faces The photographs were taken by Harry Weschler on behalf of the Department of Defense's Facial Recognition Technology (FERET) program. Photographs were taken of volunteers in a studio. Training facial recognition algorithms.
THUMOS A large data set of videos labeled with the human actions they portray.

254 hours of video

25 million frames

Videos were collected from YouTube. Training action recognition algorithms.