Except for GOFAI, all artificial intelligence algorithms need data in order to work. Most of them need a lot of data—hundreds of thousands or millions of data points. The field of research that works with such large datasets is often called "big data." This page explores what AI algorithms do with all that data, where it comes from, and some of the problems presented by data in AI.
There are many different types of AI algorithm, but at their core, they all work by taking a dataset and finding patterns within it. Computer vision algorithms find patterns of what organizations of pixels represent a picture of a dog, a car, or a human. Recommender system algorithms find patterns of what user behavior is related to like romantic comedy movies or horror movies. Generative AI algorithms find patterns of what word is most likely to follow from the phrase, "as per my last."
AI algorithms are built to look for patterns, and sometimes they go overboard. An algorithm can become so good at finding patterns in the dataset that it was trained on that it gets worse at making predictions about new data.
A comical example of overfitting in the wild.
This problem is called overfitting, and it happens when algorithms pay too much attention to small random trends in their sample data set which are not representative of the patterns in the real world. There are methods of preventing this error, although they often have the tradeoff of making the algorithm less sensitive to real patterns.
The green line has overfit the classification; it won't predict new data well.The black line
will generalize better to new data. (Image credit Chabacano, CC BY-SA 4.0)
In order to make sure that an algorithm isn't overfitting, AI researchers usually split their data into a training set and a testing set. The algorithm is trained—learns patterns—from the training set. It is then run on the testing set, to check how its predictions perform on new data whose ground truth the researchers know.
As mentioned above, most AI algorithms require a lot of training data to work. Creating and finding very large data sets usually requires a lot of work. Some of the most common sources from which AI researchers get their data are:
Because of ethical concerns about privacy and bias, and legal concerns about copyright, it's important to understand where the data used for an AI algorithm comes from.
Because AI datasets can be so difficult and time-consuming to create, there are a number of large datasets that are commonly used by and shared between researchers working in the same subfield. A list of some of the most frequently used AI datasets is included below, along with the size and source of the data, and the applications it is most commonly used for.
Dataset | Description | Size | Source | Applications |
---|---|---|---|---|
Iris data | A small data set containing the dimensions of various irises, each row labeled with the type of iris. | 150 rows, four variables | Compiled by biologist Ronald Fisher in 1936 for an early paper on classification. | Commonly packaged in statistical software for use on toy classification problems. |
MNIST | A large dataset of handwritten digits (0-9) from many different writers, labeled with the correct digit. |
60,000 digits (training set) 10,000 digits (testing set) |
Collected via mailed questionnaire by the National Institute of Standards and Technology. | Training optical character recognition algorithms. |
ImageNet | A large dataset of URLS of images available on the internet, hand-annotated to label objects contained in the images. | 14 million images | Public image links are compiled by the ImageNet research team; annotations are crowdsourced. | Training object detection and other computer vision algorithms. |
FERET | A large dataset of photographs of human faces in a semi-controlled set of positions and lighting. | 14,126 images of 1,199 faces | The photographs were taken by Harry Weschler on behalf of the Department of Defense's Facial Recognition Technology (FERET) program. Photographs were taken of volunteers in a studio. | Training facial recognition algorithms. |
THUMOS | A large data set of videos labeled with the human actions they portray. |
254 hours of video 25 million frames |
Videos were collected from YouTube. | Training action recognition algorithms. |