Machine learning and data mining |
---|
These datasets are used for machine-learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.[2][3][4][5]
Datasets consisting primarily of images or videos for tasks such as object detection, facial recognition, and multi-label classification
In computer vision, face images have been used extensively to develop facial recognition systems, face detection, and many other projects that use images of faces.
Dataset name | Brief description | Preprocessing | Instances | Format | Default task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
FERET (facial recognition technology) | 11338 images of 1199 individuals in different positions and at different times. | None. | 11,338 | Images | Classification, face recognition | 2003 | [6][7] | United States Department of Defense |
CMU Pose, Illumination, and Expression (PIE) | 41,368 color images of 68 people in 13 different poses. | Images labeled with expressions. | 41,368 | Images, text | Classification, face recognition | 2000 | [8][9] | R. Gross et al. |
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) | 7,356 video and audio recordings of 24 professional actors. 8 emotions each at two intensities. | Files labelled with expression. Perceptual validation ratings provided by 319 raters. | 7,356 | Video, sound files | Classification, face recognition, voice recognition | 2018 | [10][11] | S.R. Livingstone and F.A. Russo |
SCFace | Color images of faces at various angles. | Location of facial features extracted. Coordinates of features given. | 4,160 | Images, text | Classification, face recognition | 2011 | [12][13] | M. Grgic et al. |
YouTube Faces DB | Videos of 1,595 different people gathered from YouTube. Each clip is between 48 and 6,070 frames. | Identity of those appearing in videos and descriptors. | 3,425 videos | Video, text | Video classification, face recognition | 2011 | [14][15] | L. Wolf et al. |
300 videos in-the-Wild | 114 videos annotated for facial landmark tracking. The 68 landmark mark-up is applied to every frame. | None | 114 videos, 218,000 frames. | Video, annotation file. | Facial landmark tracking. | 2015 | [16] | Shen, Jie et al. |
Grammatical Facial Expressions Dataset | Grammatical Facial Expressions from Brazilian Sign Language. | Microsoft Kinect features extracted. | 27,965 | Text | Facial gesture recognition | 2014 | [17] | F. Freitas et al. |
CMU Face Images Dataset | Images of faces. Each person is photographed multiple times to capture different expressions. | Labels and features. | 640 | Images, Text | Face recognition | 1999 | [18][19] | T. Mitchell |
Yale Face Database | Faces of 15 individuals in 11 different expressions. | Labels of expressions. | 165 | Images | Face recognition | 1997 | [20][21] | J. Yang et al. |
Cohn-Kanade AU-Coded Expression Database | Large database of images with labels for expressions. | Tracking of certain facial features. | 500+ sequences | Images, text | Facial expression analysis | 2000 | [22][23] | T. Kanade et al. |
FaceScrub | Images of public figures scrubbed from image searching. | Name and m/f annotation. | 107,818 | Images, text | Face recognition | 2014 | [24][25] | H. Ng et al. |
BioID Face Database | Images of faces with eye positions marked. | Manually set eye positions. | 1521 | Images, text | Face recognition | 2001 | [26][27] | BioID |
Skin Segmentation Dataset | Randomly sampled color values from face images. | B, G, R, values extracted. | 245,057 | Text | Segmentation, classification | 2012 | [28][29] | R. Bhatt. |
Bosphorus | 3D Face image database. | 34 action units and 6 expressions labeled; 24 facial landmarks labeled. | 4652 |
Images, text |
Face recognition, classification | 2008 | [30][31] | A Savran et al. |
UOY 3D-Face | neutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised. | labeling. | 5250 |
Images, text |
Face recognition, classification | 2004 | [32][33] | University of York |
CASIA | Expressions: Anger, smile, laugh, surprise, closed eyes. | None. | 4624 |
Images, text |
Face recognition, classification | 2007 | [34][35] | Institute of Automation, Chinese Academy of Sciences |
CASIA | Expressions: Anger Disgust Fear Happiness Sadness Surprise | None. | 480 | Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per second | Face recognition, classification | 2011 | [36] | Zhao, G. et al. |
BU-3DFE | neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted. | None. | 2500 | Images, text | Facial expression recognition, classification | 2006 | [37] | Binghamton University |
Face Recognition Grand Challenge Dataset | Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data. | None. | 4007 | Images, text | Face recognition, classification | 2004 | [38][39] | National Institute of Standards and Technology |
Gavabdb | Up to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images. | None. | 549 | Images, text | Face recognition, classification | 2008 | [40][41] | King Juan Carlos University |
3D-RMA | Up to 100 subjects, expressions mostly neutral. Several poses as well. | None. | 9971 | Images, text | Face recognition, classification | 2004 | [42][43] | Royal Military Academy (Belgium) |
SoF | 112 persons (66 males and 46 females) wear glasses under different illumination conditions. | A set of synthetic filters (blur, occlusions, noise, and posterization ) with different level of difficulty. | 42,592 (2,662 original image × 16 synthetic image) | Images, Mat file | Gender classification, face detection, face recognition, age estimation, and glasses detection | 2017 | [44][45] | Afifi, M. et al. |
IMDB-WIKI | IMDB and Wikipedia face images with gender and age labels. | None | 523,051 | Images | Gender classification, face detection, face recognition, age estimation | 2015 | [46] | R. Rothe, R. Timofte, L. V. Gool |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Human Motion DataBase (HMDB51) | 51 action categories, each containing at least 101 clips, extracted from a range of sources. | None. | 6,766 video clips | video clips | Action classification | 2011 | [47] | H. Kuehne et al. |
TV Human Interaction Dataset | Videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none. | None. | 6,766 video clips | video clips | Action prediction | 2013 | [48] | Patron-Perez, A. et al. |
UT Interaction | People acting out one of 6 actions (shake-hands, point, hug, push, kick, and punch) sometimes with multiple groups in the same video clip. | None. | 120 video clips | video clips | Action prediction | 2009 | [49] | Ryoo, M. S. et al. |
UT Kinect | 10 different people performing one of 6 actions (walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands and clap hands) in an office setting. | None. | 200 video clips with depth information at 15 frames per second | video clips with depth information | Action classification | 2012 | [50] | Xia, L. et al. |
SBU Interact | Seven participants performing one of 8 actions together (approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands) in an office setting. | None. | Around 300 interactions | video clips with depth information | Action classification | 2012 | [51] | Yun, K. et al. |
Berkeley Multimodal Human Action Database (MHAD) | Recordings of a single person performing 12 actions | MoCap pre-processing | 660 action samples | 8 PhaseSpace Motion Capture, 2 Stereo Cameras, 4 Quad Cameras, 6 accelerometers, 4 microphones | Action classification | 2013 | [52] | Ofli, F. et al. |
UCF 101 Dataset | Self described as "a dataset of 101 human actions classes from videos in the wild." Dataset is large with over 27 hours of video. | Actions classified and labeled. | 13,000 | Video, images, text | Classification, action detection | 2012 | [53][54] | K. Soomro et al. |
THUMOS Dataset | Large video dataset for action classification. | Actions classified and labeled. | 45M frames of video | Video, images, text | Classification, action detection | 2013 | [55][56] | Y. Jiang et al. |
Activitynet | Large video dataset for activity recognition and detection. | Actions classified and labeled. | 10,024 | Video, images, text | Classification, action detection | 2015 | [57] | Heilbron et al. |
MSP-AVATAR | Improvised scenarios annotated for discourse functions: contrast, confirmation/negation, question, uncertainty, suggest, giving orders, warn, inform, size description, using pronouns. | Actions classified and labeled. | 74 sessions | Motion-captured video, audio | Classification, action detection | 2015 | [58] | Sadoughi, N. et al. |
LILiR Twotalk Corpus | Video datasets for non-verbal communication activity recognition: agreement, thinking, asking and understanding. | Actions classified and labeled. | 527 | Video | Action detection | 2011 | [59] | Sheerman-Chase et al. |
MEXAction2 | Video dataset for action localization and spotting | Actions classified and labeled. | 1000 | Video | Action detection | 2014 | [60] | Stoian et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Visual Genome | Images and their description | 108,000 | images, text | Image captioning | 2016 | [61] | R. Krishna et al. | |
DAVIS: Densely Annotated VIdeo Segmentation 2017 | 150 video sequences containing 10459 frames with a total of 376 objects annotated. | Dataset released for the 2017 DAVIS Challenge with a dedicated workshop co-located with CVPR 2017. The videos contain several types of objects and humans with a high quality segmentation annotation.In each video sequence multiple instances are annotated. | 10,459 | Frames annotated | Video object segmentation | 2017 | [62] | Pont-Tuset, J. et al. |
DAVIS: Densely Annotated VIdeo Segmentation 2016 | 50 video sequences containing 3455 frames with a total of 50 objects annotated. | Dataset released with the CVPR 2016 paper. The videos contain several types of objects and humans with a high quality segmentation annotation. In each video sequence a single instance is annotated. | 3,455 | Frames annotated | Video object segmentation | 2016 | [63] | Perazzi, F. et al. |
T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects | 30 industry-relevant objects. 39K training and 10K test images from each of three sensors. Two types of 3D models for each object. | 6D poses for all modeled objects in all images. Per-pixel labelling can be obtained by rendering of the object models at the ground truth poses. | 49,000 | RGB-D images, 3D object models | 6D object pose estimation, object detection | 2017 | [64] | T. Hodan et al. |
Berkeley 3-D Object Dataset | 849 images taken in 75 different scenes. About 50 different object classes are labeled. | Object bounding boxes and labeling. | 849 | labeled images, text | Object recognition | 2014 | [65][66] | A. Janoch et al. |
Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500) | 500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300. | Each image segmented by five different subjects on average. | 500 | Segmented images | Contour detection and hierarchical image segmentation | 2011 | [67] | University of California, Berkeley |
Microsoft Common Objects in Context (COCO) | complex everyday scenes of common objects in their natural context. | Object highlighting, labeling, and classification into 91 object types. | 2,500,000 | Labeled images, text | Object recognition | 2015 | [68][69] | T. Lin et al. |
SUN Database | Very large scene and object recognition database. | Places and objects are labeled. Objects are segmented. | 131,067 | Images, text | Object recognition, scene recognition | 2014 | [70][71] | J. Xiao et al. |
ImageNet | Labeled object image database, used in the ImageNet Large Scale Visual Recognition Challenge | Labeled objects, bounding boxes, descriptive words, SIFT features | 14,197,122 | Images, text | Object recognition, scene recognition | 2009 (2014) | [72][73][74] | J. Deng et al. |
Open Images | A Large set of images listed as having CC BY 2.0 license with image-level labels and bounding boxes spanning thousands of classes. | Image-level labels, Bounding boxes | 9,178,275 | Images, text | Classification, Object recognition | 2017 | [75] | |
TV News Channel Commercial Detection Dataset | TV commercials and news broadcasts. | Audio and video features extracted from still images. | 129,685 | Text | Clustering, classification | 2015 | [76][77] | P. Guha et al. |
Statlog (Image Segmentation) Dataset | The instances were drawn randomly from a database of 7 outdoor images and hand-segmented to create a classification for every pixel. | Many features calculated. | 2310 | Text | Classification | 1990 | [78] | University of Massachusetts |
Caltech 101 | Pictures of objects. | Detailed object outlines marked. | 9146 | Images | Classification, object recognition. | 2003 | [79][80] | F. Li et al. |
Caltech-256 | Large dataset of images for object classification. | Images categorized and hand-sorted. | 30,607 | Images, Text | Classification, object detection | 2007 | [81][82] | G. Griffin et al. |
SIFT10M Dataset | SIFT features of Caltech-256 dataset. | Extensive SIFT feature extraction. | 11,164,866 | Text | Classification, object detection | 2016 | [83] | X. Fu et al. |
LabelMe | Annotated pictures of scenes. | Objects outlined. | 187,240 | Images, text | Classification, object detection | 2005 | [84] | MIT Computer Science and Artificial Intelligence Laboratory |
Cityscapes Dataset | Stereo video sequences recorded in street scenes, with pixel-level annotations. Metadata also included. | Pixel-level segmentation and labeling | 25,000 | Images, text | Classification, object detection | 2016 | [85] | Daimler AG et al. |
PASCAL VOC Dataset | Large number of images for classification tasks. | Labeling, bounding box included | 500,000 | Images, text | Classification, object detection | 2010 | [86][87] | M. Everingham et al. |
CIFAR-10 Dataset | Many small, low-resolution, images of 10 classes of objects. | Classes labelled, training set splits created. | 60,000 | Images | Classification | 2009 | [73][88] | A. Krizhevsky et al. |
CIFAR-100 Dataset | Like CIFAR-10, above, but 100 classes of objects are given. | Classes labelled, training set splits created. | 60,000 | Images | Classification | 2009 | [73][88] | A. Krizhevsky et al. |
Fashion-MNIST | A MNIST-like fashion product database | Classes labelled, training set splits created. | 60,000 | Images | Classification | 2017 | [89] | Zalando SE |
notMNIST | Some publicly available fonts and extracted glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A-J taken from different fonts. | Classes labelled, training set splits created. | 500,000 | Images | Classification | 2011 | [90] | Yaroslav Bulatov |
German Traffic Sign Detection Benchmark Dataset | Images from vehicles of traffic signs on German roads. These signs comply with UN standards and therefore are the same as in other countries. | Signs manually labeled | 900 | Images | Classification | 2013 | [91][92] | S Houben et al. |
KITTI Vision Benchmark Dataset | Autonomous vehicles driving through a mid-size city captured images of various areas using cameras and laser scanners. | Many benchmarks extracted from data. | >100 GB of data | Images, text | Classification, object detection | 2012 | [93][94] | A Geiger et al. |
Linnaeus 5 dataset | Images of 5 classes of objects. | Classes labelled, training set splits created. | 8000 | Images | Classification | 2017 | [95] | Chaladze & Kalatozishvili |
FieldSAFE | Multi-modal dataset for obstacle detection in agriculture including stereo camera, thermal camera, web camera, 360-degree camera, lidar, radar, and precise localization. | Classes labelled geographically. | >400 GB of data | Images and 3D point clouds | Classification, object detection, object localization | 2017 | [96] | M. Kragh et al. |
11K Hands | 11,076 hand images (1600 x 1200 pixels) of 190 subjects, of varying ages between 18 – 75 years old, for gender recognition and biometric identification. | None | 11,076 hand images | Images and (.mat, .txt, and .csv) label files | Gender recognition and biometric identification | 2017 | [97] | M Afifi |
CORe50 | Specifically designed for Continuous/Lifelong Learning and Object Recognition, is a collection of more than 500 videos (30fps) of 50 domestic objects belonging to 10 different categories. | Classes labelled, training set splits created based on a 3-way, multi-runs benchmark. | 164,866 RBG-D images | images (.png or .pkl)
and (.pkl, .txt, .tsv) label files |
Classification, Object recognition | 2017 | [98] | V. Lomonaco and D. Maltoni |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Artificial Characters Dataset | Artificially generated data describing the structure of 10 capital English letters. | Coordinates of lines drawn given as integers. Various other features. | 6000 | Text | Handwriting recognition, classification | 1992 | [99] | H. Guvenir et al. |
Letter Dataset | Upper case printed letters. | 17 features are extracted from all images. | 20,000 | Text | OCR, classification | 1991 | [100][101] | D. Slate et al. |
Character Trajectories Dataset | Labeled samples of pen tip trajectories for people writing simple characters. | 3-dimensional pen tip velocity trajectory matrix for each sample | 2858 | Text | Handwriting recognition, classification | 2008 | [102][103] | B. Williams |
Chars74K Dataset | Character recognition in natural images of symbols used in both English and Kannada | 74,107 | Character recognition, handwriting recognition, OCR, classification | 2009 | [104] | T. de Campos | ||
UJI Pen Characters Dataset | Isolated handwritten characters | Coordinates of pen position as characters were written given. | 11,640 | Text | Handwriting recognition, classification | 2009 | [105][106] | F. Prat et al. |
Gisette Dataset | Handwriting samples from the often-confused 4 and 9 characters. | Features extracted from images, split into train/test, handwriting images size-normalized. | 13,500 | Images, text | Handwriting recognition, classification | 2003 | [107] | Yann LeCun et al. |
MNIST database | Database of handwritten digits. | Hand-labeled. | 60,000 | Images, text | Classification | 1998 | [108][109] | National Institute of Standards and Technology |
Optical Recognition of Handwritten Digits Dataset | Normalized bitmaps of handwritten data. | Size normalized and mapped to bitmaps. | 5620 | Images, text | Handwriting recognition, classification | 1998 | [110] | E. Alpaydin et al. |
Pen-Based Recognition of Handwritten Digits Dataset | Handwritten digits on electronic pen-tablet. | Feature vectors extracted to be uniformly spaced. | 10,992 | Images, text | Handwriting recognition, classification | 1998 | [111][112] | E. Alpaydin et al. |
Semeion Handwritten Digit Dataset | Handwritten digits from 80 people. | All handwritten digits have been normalized for size and mapped to the same grid. | 1593 | Images, text | Handwriting recognition, classification | 2008 | [113] | T. Srl |
HASYv2 | Handwritten mathematical symbols | All symbols are centered and of size 32px x 32px. | 168233 | Images, text | Classification | 2017 | [114] | Martin Thoma |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Aerial Image Segmentation Dataset | 80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0. | Images manually segmented. | 80 | Images | Aerial Classification, object detection | 2013 | [115][116] | J. Yuan et al. |
KIT AIS Data Set | Multiple labeled training and evaluation datasets of aerial images of crowds. | Images manually labeled to show paths of individuals through crowds. | ~ 150 | Images with paths | People tracking, aerial tracking | 2012 | [117][118] | M. Butenuth et al. |
Wilt Dataset | Remote sensing data of diseased trees and other land cover. | Various features extracted. | 4899 | Images | Classification, aerial object detection | 2014 | [119][120] | B. Johnson |
Forest Type Mapping Dataset | Satellite imagery of forests in Japan. | Image wavelength bands extracted. | 326 | Text | Classification | 2015 | [121][122] | B. Johnson |
Overhead Imagery Research Data Set | Annotated overhead imagery. Images with multiple objects. | Over 30 annotations and over 60 statistics that describe the target within the context of the image. | 1000 | Images, text | Classification | 2009 | [123][124] | F. Tanner et al. |
SpaceNet | SpaceNet is a corpus of commercial satellite imagery and labeled training data. | GeoTiff and GeoJSON files containing building footprints. | >17533 | Images | Classification, Object Identification | 2017 | [125][126][127] | DigitalGlobe, Inc. |
UC Merced Land Use Dataset | These images were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the US. | This is a 21 class land use image dataset meant for research purposes. There are 100 images for each class. | 2,100 | Image chips of 256x256, 30 cm (1 foot) GSD | Land cover classification | 2010 | [128] | Yi Yang and Shawn Newsam |
SAT-4 Airborne Dataset | Images were extracted from the National Agriculture Imagery Program (NAIP) dataset. | SAT-4 has four broad land cover classes, includes barren land, trees, grassland and a class that consists of all land cover classes other than the above three. | 500,000 | Images | Classification | 2015 | [129] | S. Basu et al. |
SAT-6 Airborne Dataset | Images were extracted from the National Agriculture Imagery Program (NAIP) dataset. | SAT-6 has six broad land cover classes, includes barren land, trees, grassland, roads, buildings and water bodies. | 405,000 | Images | Classification | 2015 | [129] | S. Basu et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Quantum simulations of an electron in a two dimensional potential well | Labelled images of raw input to a simulation of 2d Quantum mechanics | Raw data (in HDF5 format) and output labels from quantum simulation | 1.3 million images | Labeled images | Regression | 2017 | [130] | K. Mills et al. |
MPII Cooking Activities Dataset | Videos and images of various cooking activities. | Activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling. | 881,755 frames | Labeled video, images, text | Classification | 2012 | [131][132] | M. Rohrbach et al. |
FAMOS Dataset | 5,000 unique microstructures, all samples have been acquired 3 times with two different cameras. | Original PNG files, sorted per camera and then per acquisition. MATLAB datafiles with one 16384 times 5000 matrix per camera per acquisition. | 30,000 | Images and .mat files | Authentication | 2012 | [133] | S. Voloshynovskiy, et al. |
PharmaPack Dataset | 1,000 unique classes with 54 images per class. | Class labeling, many local descriptors, like SIFT and aKaZE, and local feature agreators, like Fisher Vector (FV). | 54,000 | Images and .mat files | Fine-grain classification | 2017 | [134] | O. Taran and S. Rezaeifar, et al. |
Stanford Dogs Dataset | Images of 120 breeds of dogs from around the world. | Train/test splits and ImageNet annotations provided. | 20,580 | Images, text | Fine-grain classification | 2011 | [135][136] | A. Khosla et al. |
The Oxford-IIIT Pet Dataset | 37 categories of pets with roughly 200 images of each. | Breed labeled, tight bounding box, foreground-background segmentation. | ~ 7,400 | Images, text | Classification, object detection | 2012 | [136][137] | O. Parkhi et al. |
Corel Image Features Data Set | Database of images with features extracted. | Many features including color histogram, co-occurrence texture, and colormoments, | 68,040 | Text | Classification, object detection | 1999 | [138][139] | M. Ortega-Bindenberger et al. |
Online Video Characteristics and Transcoding Time Dataset. | Transcoding times for various different videos and video properties. | Video features given. | 168,286 | Text | Regression | 2015 | [140] | T. Deneke et al. |
Microsoft Sequential Image Narrative Dataset (SIND) | Dataset for sequential vision-to-language | Descriptive caption and storytelling given for each photo, and photos are arranged in sequences | 81,743 | Images, text | Visual storytelling | 2016 | [141] | Microsoft Research |
Caltech-UCSD Birds-200-2011 Dataset | Large dataset of images of birds. | Part locations for birds, bounding boxes, 312 binary attributes given | 11,788 | Images, text | Classification | 2011 | [142][143] | C. Wah et al. |
YouTube-8M | Large and diverse labeled video dataset | YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities | 8 million | Video, text | Video classification | 2016 | [144][145] | S. Abu-El-Haija et al. |
YFCC100M | Large and diverse labeled image and video dataset | Flickr Videos and Images and associated description, titles, tags, and other metadata (such as EXIF and geotags) | 100 million | Video, Image, Text | Video and Image classification | 2016 | [146][147] | B. Thomee et al. |
Discrete LIRIS-ACCEDE | Short videos annotated for valence and arousal. | Valence and arousal labels. | 9800 | Video | Video emotion elicitation detection | 2015 | [148] | Y. Baveye et al. |
Continuous LIRIS-ACCEDE | Long videos annotated for valence and arousal while also collecting Galvanic Skin Response. | Valence and arousal labels. | 30 | Video | Video emotion elicitation detection | 2015 | [149] | Y. Baveye et al. |
MediaEval LIRIS-ACCEDE | Extension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films. | Violence, valence and arousal labels. | 10900 | Video | Video emotion elicitation detection | 2015 | [150] | Y. Baveye et al. |
Leeds Sports Pose | Articulated human pose annotations in 2000 natural sports images from Flickr. | Rough crop around single person of interest with 14 joint labels | 2000 | Images plus .mat file labels | Human pose estimation | 2010 | [151] | S. Johnson and M. Everingham |
Leeds Sports Pose Extended Training | Articulated human pose annotations in 10,000 natural sports images from Flickr. | 14 joint labels via crowdsourcing | 10000 | Images plus .mat file labels | Human pose estimation | 2011 | [152] | S. Johnson and M. Everingham |
MCQ Dataset | 6 different real multiple choice-based exams (735 answer sheets and 33,540 answer boxes) to evaluate computer vision techniques and systems developed for multiple choice test assessment systems. | None | 735 answer sheets and 33,540 answer boxes | Images and .mat file labels | Development of multiple choice test assessment systems | 2017 | [153][154] | Afifi, M. et al. |
Surveillance Videos | Real surveillance videos cover a large surveillance time (7 days with 24 hours each). | None | 19 surveillance videos (7 days with 24 hours each). | Videos | Data compression | 2016 | [155] | Taj-Eddin, I. A. T. F. et al. |
Can We See Photosynthesis? | 32 videos for eight live and eight dead leaves recorded under both DC and AC lighting conditions. | None | 32 videos | Videos | Liveness detection of plants | 2017 | [156] | Taj-Eddin, I. A. T. F. et al. |
Datasets consisting primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Stanford Sentiment Treebank | Positive and negative movie reviews | 11k | Text | Sentiment analysis | 2005 | [1] | Bo Pang & Lillian Lee | |
IMDB Reviews | Positive and negative movie reviews | 50k | Text | Sentiment analysis | 2011 | [2] | Maas et al. | |
Movie Review Data | Positive and negative movie reviews | 2k | Text | Sentiment analysis | 2005 | [3] | Bo Pang & Lillian Lee | |
Amazon reviews | US product reviews from Amazon.com. | None. | ~ 82M | Text | Classification, sentiment analysis | 2015 | [157] | McAuley et al. |
OpinRank Review Dataset | Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively. | None. | 42,230 / ~259,000 respectively | Text | Sentiment analysis, clustering | 2011 | [158][159] | K. Ganesan et al. |
MovieLens | 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users. | None. | ~ 22M | Text | Regression, clustering, classification | 2016 | [160] | GroupLens Research |
Yahoo! Music User Ratings of Musical Artists | Over 10M ratings of artists by Yahoo users. | None described. | ~ 10M | Text | Clustering, regression | 2004 | [161][162] | Yahoo! |
Car Evaluation Data Set | Car properties and their overall acceptability. | Six categorical features given. | 1728 | Text | Classification | 1997 | [163][164] | M. Bohanec |
YouTube Comedy Slam Preference Dataset | User vote data for pairs of videos shown on YouTube. Users voted on funnier videos. | Video metadata given. | 1,138,562 | Text | Classification | 2012 | [165][166] | |
Skytrax User Reviews Dataset | User reviews of airlines, airports, seats, and lounges from Skytrax. | Ratings are fine-grain and include many aspects of airport experience. | 41396 | Text | Classification, regression | 2015 | [167] | Q. Nguyen |
Teaching Assistant Evaluation Dataset | Teaching assistant reviews. | Features of each instance such as class, class size, and instructor are given. | 151 | Text | Classification | 1997 | [168][169] | W. Loh et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
NYSK Dataset | English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn. | Filtered and presented in XML format. | 10,421 | XML, text | Sentiment analysis, topic extraction | 2013 | [170] | Dermouche, M. et al. |
The Reuters Corpus Volume 1 | Large corpus of Reuters news stories in English. | Fine-grain categorization and topic codes. | 810,000 | Text | Classification, clustering, summarization | 2002 | [171] | Reuters |
The Reuters Corpus Volume 2 | Large corpus of Reuters news stories in multiple languages. | Fine-grain categorization and topic codes. | 487,000 | Text | Classification, clustering, summarization | 2005 | [172] | Reuters |
Thomson Reuters Text Research Collection | Large corpus of news stories. | Details not described. | 1,800,370 | Text | Classification, clustering, summarization | 2009 | [173] | T. Rose et al. |
Saudi Newspapers Corpus | 31,030 Arabic newspaper articles. | Metadata extracted. | 31,030 | JSON | Summarization, clustering | 2015 | [174] | M. Alhagri |
RE3D (Relationship and Entity Extraction Evaluation Dataset) | Entity and Relation marked data from various news and government sources. Sponsored by Dstl | Filtered, categorisation using Baleen types | not known | JSON | Classification, Entity and Relation recognition | 2017 | [175] | Dstl |
ABC Australia News Corpus | Entire news corpus of ABC Australia from 2003 to 2017 | Publish date and headlines | 1,082,477 | CSV | Clustering, Events, Sentiment | 2017 | [176] | R. Kulkarni |
Examiner Pseudo-News Corpus | Clickbait, spam, crowd-sourced headlines from 2010 to 2015 | Publish date and headlines | 3,089,781 | CSV | Clustering, Events, Sentiment | 2017 | [177] | R. Kulkarni |
Worldwide News - Aggregate of 20K Feeds | One week snapshot of all online headlines in 20+ languages | Publish time, URL and headlines | 1,398,431 | CSV | Clustering, Events, Language Detection | 2017 | [178] | R. Kulkarni |
Reuters News Wire Headline | 11+ Years of timestamped events published on the news-wire | Publish time, Headline Text | 16,121,000 | CSV | NLP, Computational Linguistics, Events | 2018 | [179] | R. Kulkarni |
The Irish Times The Irish Times IRS | 12 Years of Events From Ireland | Publish time, Headline Text | 1,422,000 | CSV | NLP, Computational Linguistics, Events | 2018 | [180] | R. Kulkarni |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Enron Email Dataset | Emails from employees at Enron organized into folders. | Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com. | ~ 500,000 | Text | Network analysis, sentiment analysis | 2004 (2015) | [181][182] | Klimt, B. and Y. Yang |
Ling-Spam Dataset | Corpus containing both legitimate and spam emails. | Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled. | Text | Classification | 2000 | [183][184] | Androutsopoulos, J. et al. | |
SMS Spam Collection Dataset | Collected SMS spam messages. | None. | 5574 | Text | Classification | 2011 | [185][186] | T. Almeida et al. |
Twenty Newsgroups Dataset | Messages from 20 different newsgroups. | None. | 20,000 | Text | Natural language processing | 1999 | [187] | T. Mitchell et al. |
Spambase Dataset | Spam emails. | Many text features extracted. | 4601 | Text | Spam detection, classification | 1999 | [188] | M. Hopkins et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
MovieTweetings | Movie rating dataset based on public and well-structured tweets | ~710,000 | Text | Classification, regression | 2018 | [189] | S. Dooms | |
Twitter100k | Pairs of images and tweets | 100,000 | Text and Images | Cross-media retrieval | 2017 | [190][191] | Y. Hu, et al. | |
Sentiment140 | Tweet data from 2009 including original text, time stamp, user and sentiment. | Classified using distant supervision from presence of emoticon in tweet. | 1,578,627 | Tweets, comma, separated values | Sentiment analysis | 2009 | [192][193] | A. Go et al. |
ASU Twitter Dataset | Twitter network data, not actual tweets. Shows connections between a large number of users. | None. | 11,316,811 users, 85,331,846 connections | Text | Clustering, graph analysis | 2009 | [194][195] | R. Zafarani et al. |
SNAP Social Circles: Twitter Database | Large Twitter network data. | Node features, circles, and ego networks. | 1,768,149 | Text | Clustering, graph analysis | 2012 | [196][197] | J. McAuley et al. |
Twitter Dataset for Arabic Sentiment Analysis | Arabic tweets. | Samples hand-labeled as positive or negative. | 2000 | Text | Classification | 2014 | [198][199] | N. Abdulla |
Buzz in Social Media Dataset | Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites. | Data is windowed so that the user can attempt to predict the events leading up to social media buzz. | 140,000 | Text | Regression, Classification | 2013 | [200][201] | F. Kawala et al. |
Paraphrase and Semantic Similarity in Twitter (PIT) | This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled. | tokenization, part-of-speech and named entity tagging | 18,762 | Text | Regression, Classification | 2015 | [202][203] | Xu et al. |
Geoparse Twitter benchmark dataset | This dataset contains tweets during different news events in different countries. Manually labeled location mentions. | location annotations added to JSON metadata | 6,386 | Tweets, JSON | Classification, Information Extraction | 2014 | [204][205] | S.E. Middleton et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
NPS Chat Corpus | Posts from age-specific online chat rooms. | Hand privacy masked, tagged for part of speech and dialogue-act. | ~ 500,000 | XML | NLP, programming, linguistics | 2007 | [206] | Forsyth, E., Lin, J., & Martell, C. |
Twitter Triple Corpus | A-B-A triples extracted from Twitter. | 4,232 | Text | NLP | 2016 | [207] | Sordini, A. et al. | |
UseNet Corpus | UseNet forum postings. | Anonymized e-mails and URLs. Omitted documents with lengths <500 words or >500,000 words, or that were <90% English. | 7 billion | Text | 2011 | [208] | Shaoul, C., & Westbury C. | |
NUS SMS Corpus | SMS messages collected between two users, with timing analysis. | ~ 10,000 | XML | NLP | 2011 | [209] | KAN, M | |
Reddit All Comments Corpus | All Reddit comments (as of 2015). | ~ 1.7 billion | JSON | NLP, research | 2015 | [210] | Stuck_In_the_Matrix | |
Ubuntu Dialogue Corpus | Dialogues extracted from Ubuntu chat stream on IRC. | CSV | Dialogue Systems Research | 2015 | [211] | Lowe, R. et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Web of Science Dataset | Hierarchical Datasets for Text Classification | None. | 46,985 | Text | Classification,
Categorization |
2017 | [212][213] | K. Kowsari et al. |
Legal Case Reports | Federal Court of Australia cases from 2006 to 2009. | None. | 4,000 | Text | Summarization,
citation analysis |
2012 | [214][215] | F. Galgani et al. |
Blogger Authorship Corpus | Blog entries of 19,320 people from blogger.com. | Blogger self-provided gender, age, industry, and astrological sign. | 681,288 | Text | Sentiment analysis, summarization, classification | 2006 | [216][217] | J. Schler et al. |
Social Structure of Facebook Networks | Large dataset of the social structure of Facebook. | None. | 100 colleges covered | Text | Network analysis, clustering | 2012 | [218][219] | A. Traud et al. |
Dataset for the Machine Comprehension of Text | Stories and associated questions for testing comprehension of text. | None. | 660 | Text | Natural language processing, machine comprehension | 2013 | [220][221] | M. Richardson et al. |
The Penn Treebank Project | Naturally occurring text annotated for linguistic structure. | Text is parsed into semantic trees. | ~ 1M words | Text | Natural language processing, summarization | 1995 | [222][223] | M. Marcus et al. |
DEXTER Dataset | Task given is to determine, from features given, which articles are about corporate acquisitions. | Features extracted include word stems. Distractor features included. | 2600 | Text | Classification | 2008 | [224] | Reuters |
Google Books N-grams | N-grams from a very large corpus of books | None. | 2.2 TB of text | Text | Classification, clustering, regression | 2011 | [225][226] | |
Personae Corpus | Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays. | In addition to normal texts, syntactically annotated texts are given. | 145 | Text | Classification, regression | 2008 | [227][228] | K. Luyckx et al. |
CNAE-9 Dataset | Categorization task for free text descriptions of Brazilian companies. | Word frequency has been extracted. | 1080 | Text | Classification | 2012 | [229][230] | P. Ciarelli et al. |
Sentiment Labeled Sentences Dataset | 3000 sentiment labeled sentences. | Sentiment of each sentence has been hand labeled as positive or negative. | 3000 | Text | Classification, sentiment analysis | 2015 | [231][232] | D. Kotzias |
BlogFeedback Dataset | Dataset to predict the number of comments a post will receive based on features of that post. | Many features of each post extracted. | 60,021 | Text | Regression | 2014 | [233][234] | K. Buza |
Stanford Natural Language Inference (SNLI) Corpus | Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs. | Entailment class labels, syntactic parsing by the Stanford PCFG parser | 570,000 | Text | Natural language inference/recognizing textual entailment | 2015 | [235] | S. Bowman et al. |
DSL Corpus Collection (DSLCC) | A multilingual collection of short excerpts of journalistic texts in similar languages and dialects. | None | 294,000 phrases | Text | Discriminating between similar languages | 2017 | [236] | Tan, Liling et al. |
Urban Dictionary Dataset | Corpus of words, votes and definitions | User names anonymised | 2,606,522 | CSV | NLP, Machine comprehension | 2016-05 | [237] | Anonymous |
Datasets of sounds and sound features.
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Zero Resource Speech Challenge 2015 | Spontaneous speech (English), Read speech (Xitsonga). | raw wav | English: 5h, 12 speakers; Xitsonga: 2h30; 24 speakers | sound | Unsupervised discovery of speech features/subword units/word units | 2015 | [238][239]www.zerospeech.com/2015 | Versteegh et al. |
Parkinson Speech Dataset | Multiple recordings of people with and without Parkinson's Disease. | Voice features extracted, disease scored by physician using unified Parkinson's disease rating scale | 1,040 | Text | Classification, regression | 2013 | [240][241] | B. E. Sakar et al. |
Spoken Arabic Digits | Spoken Arabic digits from 44 male and 44 female. | Time-series of mel-frequency cepstrum coefficients. | 8,800 | Text | Classification | 2010 | [242][243] | M. Bedda et al. |
ISOLET Dataset | Spoken letter names. | Features extracted from sounds. | 7797 | Text | Classification | 1994 | [244][245] | R. Cole et al. |
Japanese Vowels Dataset | Nine male speakers uttered two Japanese vowels successively. | Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients. | 640 | Text | Classification | 1999 | [246][247] | M. Kudo et al. |
Parkinson's Telemonitoring Dataset | Multiple recordings of people with and without Parkinson's Disease. | Sound features extracted. | 5875 | Text | Classification | 2009 | [248][249] | A. Tsanas et al. |
TIMIT | Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. | Speech is lexically and phonemically transcribed. | 6300 | Text | Speech recognition, classification. | 1986 | [250][251] | J. Garofolo et al. |
Arabic Speech Corpus | A single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme level | Speech is orthographically and phonetically transcribed with stress marks. | ~1900 | Text, WAV | Speech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education. | 2016 | [252] | N. Halabi |
Persian Consonant Vowel Combination (PCVC) Speech Dataset | Persian phonemes Speech Dataset. | raw mat: Speech samples with phoneme level exact label | Persian: 1794 (To be continued...) | Speech sound samples | Speech phoneme and speaker classification | 2018 | [253] PCVC on GitHub | S.Malekzadeh et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Geographical Original of Music Data Set | Audio features of music samples from different locations. | Audio features extracted using MARSYAS software. | 1,059 | Text | Geographical classification, clustering | 2014 | [254][255] | F. Zhou et al. |
Million Song Dataset | Audio features from one million different songs. | Audio features extracted. | 1M | Text | Classification, clustering | 2011 | [256][257] | T. Bertin-Mahieux et al. |
Free Music Archive | Audio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text. | Raw audio and audio features. | 106,574 | Text, MP3 | Classification, recommendation | 2017 | [258] | M. Defferrard et al. |
Bach Choral Harmony Dataset | Bach chorale chords. | Audio features extracted. | 5665 | Text | Classification | 2014 | [259][260] | D. Radicioni et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
UrbanSound | Labeled sound recordings of sounds like air conditioners, car horns and children playing. | Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file. | 1,059 | Sound
(WAV) |
Classification | 2014 | [261][262] | J. Salamon et al. |
AudioSet | 10-second sound snippets from YouTube videos, and an ontology of over 500 labels. | 128-d PCA'd VGG-ish features every 1 second. | 2,084,320 | Text (CSV) and TensorFlow Record files | Classification | 2017 | [263] | J. Gemmeke et al., Google |
Bird Audio Detection challenge | Audio from environmental monitoring stations, plus crowdsourced recordings | 17,000+ | Classification | 2016 (2018) | [264][265] | Queen Mary University and IEEE Signal Processing Society |
Datasets containing electric signal information requiring some sort of Signal processing for further analysis.
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Witty Worm Dataset | Dataset detailing the spread of the Witty worm and the infected computers. | Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers. | 55,909 IP addresses | Text | Classification | 2004 | [266][267] | Center for Applied Internet Data Analysis |
Cuff-Less Blood Pressure Estimation Dataset | Cleaned vital signals from human patients which can be used to estimate blood pressure. | 125 Hz vital signs have been cleaned. | 12,000 | Text | Classification, regression | 2015 | [268][269] | M. Kachuee et al. |
Gas Sensor Array Drift Dataset | Measurements from 16 chemical sensors utilized in simulations for drift compensation. | Extensive number of features given. | 13,910 | Text | Classification | 2012 | [270][271] | A. Vergara |
Servo Dataset | Data covering the nonlinear relationships observed in a servo-amplifier circuit. | Levels of various components as a function of other components are given. | 167 | Text | Regression | 1993 | [272][273] | K. Ullrich |
UJIIndoorLoc-Mag Dataset | Indoor localization database to test indoor positioning systems. Data is magnetic field based. | Train and test splits given. | 40,000 | Text | Classification, regression, clustering | 2015 | [274][275] | D. Rambla et al. |
Sensorless Drive Diagnosis Dataset | Electrical signals from motors with defective components. | Statistical features extracted. | 58,508 | Text | Classification | 2015 | [276][277] | M. Bator |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Wearable Computing: Classification of Body Postures and Movements (PUC-Rio) | People performing five standard actions while wearing motion tackers. | None. | 165,632 | Text | Classification | 2013 | [278][279] | Pontifical Catholic University of Rio de Janeiro |
Gesture Phase Segmentation Dataset | Features extracted from video of people doing various gestures. | Features extracted aim at studying gesture phase segmentation. | 9900 | Text | Classification, clustering | 2014 | [280][281] | R. Madeo et a |
Vicon Physical Action Data Set Dataset | 10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker. | Many parameters recorded by 3D tracker. | 3000 | Text | Classification | 2011 | [282][283] | T. Theodoridis |
Daily and Sports Activities Dataset | Motor sensor data for 19 daily and sports activities. | Many sensors given, no preprocessing done on signals. | 9120 | Text | Classification | 2013 | B. Barshan et al. | |
Human Activity Recognition Using Smartphones Dataset | Gyroscope and accelerometer data from people wearing smartphones and performing normal actions. | Actions performed are labeled, all signals preprocessed for noise. | 10,299 | Text | Classification | 2012 | [284][285] | J. Reyes-Ortiz et al. |
Australian Sign Language Signs | Australian sign language signs captured by motion-tracking gloves. | None. | 2565 | Text | Classification | 2002 | [286][287] | M. Kadous |
Weight Lifting Exercises monitored with Inertial Measurement Units | Five variations of the biceps curl exercise monitored with IMUs. | Some statistics calculated from raw data. | 39,242 | Text | Classification | 2013 | [288][289] | W. Ugulino et al. |
sEMG for Basic Hand movements Dataset | Two databases of surface electromyographic signals of 6 hand movements. | None. | 3000 | Text | Classification | 2014 | [290][291] | C. Sapsanis et al. |
REALDISP Activity Recognition Dataset | Evaluate techniques dealing with the effects of sensor displacement in wearable activity recognition. | None. | 1419 | Text | Classification | 2014 | [291][292] | O. Banos et al. |
Heterogeneity Activity Recognition Dataset | Data from multiple different smart devices for humans performing various activities. | None. | 43,930,257 | Text | Classification, clustering | 2015 | [293][294] | A. Stisen et al. |
Indoor User Movement Prediction from RSS Data | Temporal wireless network data that can be used to track the movement of people in an office. | None. | 13,197 | Text | Classification | 2016 | [295][296] | D. Bacciu |
PAMAP2 Physical Activity Monitoring Dataset | 18 different types of physical activities performed by 9 subjects wearing 3 IMUs. | None. | 3,850,505 | Text | Classification | 2012 | [297] | A. Reiss |
OPPORTUNITY Activity Recognition Dataset | Human Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms. | None. | 2551 | Text | Classification | 2012 | [298][299] | D. Roggen et al. |
Real World Activity Recognition Dataset | Human Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors. | None. | 3,150,000 (per sensor) | Text | Classification | 2016 | [300] | T. Sztyler et al. |
Toronto Rehab Stroke Pose Dataset | 3D human pose estimates (Kinect) of stroke patients and healthy participants performing a set of tasks using a stroke rehabilitation robot. | None. | 10 healthy person and 9 stroke survivors (3500-6000 frames per person) | CSV | Classification | 2017 | [301][302][303] | E. Dolatabadi et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Wine Dataset | Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. | 13 properties of each wine are given | 178 | Text | Classification, regression | 1991 | [304][305] | M. Forina et al. |
Combined Cycle Power Plant Data Set | Data from various sensors within a power plant running for 6 years. | None | 9568 | Text | Regression | 2014 | [306][307] | P. Tufekci et al. |
Datasets from physical systems
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
HIGGS Dataset | Monte Carlo simulations of particle accelerator collisions. | 28 features of each collision are given. | 11M | Text | Classification | 2014 | [308][309][310] | D. Whiteson |
HEPMASS Dataset | Monte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise. | 28 features of each collision are given. | 10,500,000 | Text | Classification | 2016 | [309][310][311] | D. Whiteson |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Yacht Hydrodynamics Dataset | Yacht performance based on dimensions. | Six features are given for each yacht. | 308 | Text | Regression | 2013 | [312][313] | R. Lopez |
Robot Execution Failures Dataset | 5 data sets that center around robotic failure to execute common tasks. | Integer valued features such as torque and other sensor measurements. | 463 | Text | Classification | 1999 | [314] | L. Seabra et al. |
Pittsburgh Bridges Dataset | Design description is given in terms of several properties of various bridges. | Various bridge features are given. | 108 | Text | Classification | 1990 | [315][316] | Y. Reich et al. |
Automobile Dataset | Data about automobiles, their insurance risk, and their normalized losses. | Car features extracted. | 205 | Text | Regression | 1987 | [317][318] | J. Schimmer et al. |
Auto MPG Dataset | MPG data for cars. | Eight features of each car given. | 398 | Text | Regression | 1993 | [319] | Carnegie Mellon University |
Energy Efficiency Dataset | Heating and cooling requirements given as a function of building parameters. | Building parameters given. | 768 | Text | Classification, regression | 2012 | [320][321] | A. Xifara et al. |
Airfoil Self-Noise Dataset | A series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections. | Data about frequency, angle of attack, etc., are given. | 1503 | Text | Regression | 2014 | [322] | R. Lopez |
Challenger USA Space Shuttle O-Ring Dataset | Attempt to predict O-ring problems given past Challenger data. | Several features of each flight, such as launch temperature, are given. | 23 | Text | Regression | 1993 | [323][324] | D. Draper et al. |
Statlog (Shuttle) Dataset | NASA space shuttle datasets. | Nine features given. | 58,000 | Text | Classification | 2002 | [325] | NASA |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Volcanoes on Venus – JARtool experiment Dataset | Venus images returned by the Magellan spacecraft. | Images are labeled by humans. | not given | Images | Classification | 1991 | [326][327] | M. Burl |
MAGIC Gamma Telescope Dataset | Monte Carlo generated high-energy gamma particle events. | Numerous features extracted from the simulations. | 19,020 | Text | Classification | 2007 | [327][328] | R. Bock |
Solar Flare Dataset | Measurements of the number of certain types of solar flare events occurring in a 24-hour period. | Many solar flare-specific features are given. | 1389 | Text | Regression, classification | 1989 | [329] | G. Bradshaw |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Volcanoes of the World | Volcanic eruption data for all known volcanic events on earth. | Details such as region, subregion, tectonic setting, dominant rock type are given. | 1535 | Text | Regression, classification | 2013 | [330] | E. Venzke et al. |
Seismic-bumps Dataset | Seismic activities from a coal mine. | Seismic activity was classified as hazardous or not. | 2584 | Text | Classification | 2013 | [331][332] | M. Sikora et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Concrete Compressive Strength Dataset | Dataset of concrete properties and compressive strength. | Nine features are given for each sample. | 1030 | Text | Regression | 2007 | [333][334] | I. Yeh |
Concrete Slump Test Dataset | Concrete slump flow given in terms of properties. | Features of concrete given such as fly ash, water, etc. | 103 | Text | Regression | 2009 | [335][336] | I. Yeh |
Musk Dataset | Predict if a molecule, given the features, will be a musk or a non-musk. | 168 features given for each molecule. | 6598 | Text | Classification | 1994 | [337] | Arris Pharmaceutical Corp. |
Steel Plates Faults Dataset | Steel plates of 7 different types. | 27 features given for each sample. | 1941 | Text | Classification | 2010 | [338] | Semeion Research Center |
Datasets from biological systems.
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
EEG Database | Study to examine EEG correlates of genetic predisposition to alcoholism. | Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second. | 122 | Text | Classification | 1999 | [339][340] | H. Begleiter |
P300 Interface Dataset | Data from nine subjects collected using P300-based brain-computer interface for disabled subjects. | Split into four sessions for each subject. MATLAB code given. | 1,224 | Text | Classification | 2008 | [341][342] | U. Hoffman et al. |
Heart Disease Data Set | Attributed of patients with and without heart disease. | 75 attributes given for each patient with some missing values. | 303 | Text | Classification | 1988 | [343][344] | A. Janosi et al. |
Breast Cancer Wisconsin (Diagnostic) Dataset | Dataset of features of breast masses. Diagnoses by physician is given. | 10 features for each sample are given. | 569 | Text | Classification | 1995 | [345][346] | W. Wolberg et al. |
National Survey on Drug Use and Health | Large scale survey on health and drug use in the United States. | None. | 55,268 | Text | Classification, regression | 2012 | [347] | United States Department of Health and Human Services |
Lung Cancer Dataset | Lung cancer dataset without attribute definitions | 56 features are given for each case | 32 | Text | Classification | 1992 | [348][349] | Z. Hong et al. |
Arrhythmia Dataset | Data for a group of patients, of which some have cardiac arrhythmia. | 276 features for each instance. | 452 | Text | Classification | 1998 | [350][351] | H. Altay et al. |
Diabetes 130-US hospitals for years 1999–2008 Dataset | 9 years of readmission data across 130 US hospitals for patients with diabetes. | Many features of each readmission are given. | 100,000 | Text | Classification, clustering | 2014 | [352][353] | J. Clore et al. |
Diabetic Retinopathy Debrecen Dataset | Features extracted from images of eyes with and without diabetic retinopathy. | Features extracted and conditions diagnosed. | 1151 | Text | Classification | 2014 | [354][355] | B. Antal et al. |
Diabetic Retinopathy Messidor Dataset | Methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology (MESSIDOR) | Features retinopathy grade and risk of macular edema | 1200 | Images,Text | Classification, Segmentation | 2008 | [356][357] | Messidor Project |
Liver Disorders Dataset | Data for people with liver disorders. | Seven biological features given for each patient. | 345 | Text | Classification | 1990 | [358][359] | Bupa Medical Research Ltd. |
Thyroid Disease Dataset | 10 databases of thyroid disease patient data. | None. | 7200 | Text | Classification | 1987 | [360][361] | R. Quinlan |
Mesothelioma Dataset | Mesothelioma patient data. | Large number of features, including asbestos exposure, are given. | 324 | Text | Classification | 2016 | [362][363] | A. Tanrikulu et al. |
Parkinson's Vision-Based Pose Estimation Dataset | 2D human pose estimates of Parkinson's patients performing a variety of tasks. | Camera shake has been removed from trajectories. | 134 | Text | Classification, regression | 2017 | [364][365][366] | M. Li et al. |
KEGG Metabolic Reaction Network (Undirected) Dataset | Network of metabolic pathways. A reaction network and a relation network are given. | Detailed features for each network node and pathway are given. | 65,554 | Text | Classification, clustering, regression | 2011 | [367] | M. Naeem et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Abalone Dataset | Physical measurements of Abalone. Weather patterns and location are also given. | None. | 4177 | Text | Regression | 1995 | [368] | Marine Research Laboratories – Taroona |
Zoo Dataset | Artificial dataset covering 7 classes of animals. | Animals are classed into 7 categories and features are given for each. | 101 | Text | Classification | 1990 | [369] | R. Forsyth |
Demospongiae Dataset | Data about marine sponges. | 503 sponges in the Demosponge class are described by various features. | 503 | Text | Classification | 2010 | [370] | E. Armengol et al. |
Splice-junction Gene Sequences Dataset | Primate splice-junction gene sequences (DNA) with associated imperfect domain theory. | None. | 3190 | Text | Classification | 1992 | [349] | G. Towell et al. |
Mice Protein Expression Dataset | Expression levels of 77 proteins measured in the cerebral cortex of mice. | None. | 1080 | Text | Classification, Clustering | 2015 | [371][372] | C. Higuera et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Forest Fires Dataset | Forest fires and their properties. | 13 features of each fire are extracted. | 517 | Text | Regression | 2008 | [373][374] | P. Cortez et al. |
Iris Dataset | Three types of iris plants are described by 4 different attributes. | None. | 150 | Text | Classification | 1936 | [375][376] | R. Fisher |
Plant Species Leaves Dataset | Sixteen samples of leaf each of one-hundred plant species. | Shape descriptor, fine-scale margin, and texture histograms are given. | 1600 | Text | Classification | 2012 | [377][378] | J. Cope et al. |
Mushroom Dataset | Mushroom attributes and classification. | Many properties of each mushroom are given. | 8124 | Text | Classification | 1987 | [379] | J. Schlimmer |
Soybean Dataset | Database of diseased soybean plants. | 35 features for each plant are given. Plants are classified into 19 categories. | 307 | Text | Classification | 1988 | [380] | R. Michalski et al. |
Seeds Dataset | Measurements of geometrical properties of kernels belonging to three different varieties of wheat. | None. | 210 | Text | Classification, clustering | 2012 | [381][382] | Charytanowicz et al. |
Covertype Dataset | Data for predicting forest cover type strictly from cartographic variables. | Many geographical features given. | 581,012 | Text | Classification | 1998 | [383][384] | J. Blackard et al. |
Abscisic Acid Signaling Network Dataset | Data for a plant signaling network. Goal is to determine set of rules that governs the network. | None. | 300 | Text | Causal-discovery | 2008 | [385] | J. Jenkens et al. |
Folio Dataset | 20 photos of leaves for each of 32 species. | None. | 637 | Images, text | Classification, clustering | 2015 | [386][387] | T. Munisami et al. |
Oxford Flower Dataset | 17 category dataset of flowers. | Train/test splits, labeled images, | 1360 | Images, text | Classification | 2006 | [137][388] | M-E Nilsback et al. |
Plant Seedlings Dataset | 12 category dataset of plant seedlings. | Labelled images, segmented images, | 5544 | Images | Classification, detection | 2017 | [389] | Giselsson et al. |
Fruits 360 Dataset | 81 Fruits: Apples (different varieties: Golden, Golden-Red, Granny Smith, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red), Cactus fruit, Cantaloupe (2 varieties), Carambula, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Clementine, Cocos, Dates, Granadilla, Grape (Pink, White, White2), Grapefruit (Pink, White), Guava, Huckleberry, Kiwi, Kaki, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine, Orange, Papaya, Passion fruit, Peach, Pepino, Pear (different varieties, Abate, Monster, Williams), Physalis (normal, with Husk), Pineapple (normal, Mini), Pitahaya Red, Plum, Pomegranate, Quince, Rambutan, Raspberry, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red), Walnut. | 100x100 pixels, White background. | 55244 | Images (jpg) | Classification | 2017 | [390][391] | Mihai Oltean, Horea Muresan |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Ecoli Dataset | Protein localization sites. | Various features of the protein localizations sites are given. | 336 | Text | Classification | 1996 | [392][393] | K. Nakai et al. |
MicroMass Dataset | Identification of microorganisms from mass-spectrometry data. | Various mass spectrometer features. | 931 | Text | Classification | 2013 | [394][395] | P. Mahe et al. |
Yeast Dataset | Predictions of Cellular localization sites of proteins. | Eight features given per instance. | 1484 | Text | Classification | 1996 | [396][397] | K. Nakai et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Tox21 Dataset | Prediction of outcome of biological assays. | Chemical descriptors of molecules are given. | 12707 | Text | Classification | 2016 | [398] | A. Mayr et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Numenta Anomaly Benchmark (NAB) | Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted. | None | 50+ files | Comma separated values | Anomaly detection | 2016 (continually updated) | [399] | Numenta |
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study | Most data files are adapted from UCI Machine Learning Repository data, some are collected from the literature. | treated for missing values, numerical attributes only, different percentages of anomalies, labels | 1000+ files | ARFF | Anomaly detection | 2016 (possibly updated with new datasets and/or results) | Campos et al. |
This section includes datasets that deals with structured data.
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
DBpedia Neural Question Answering (DBNQA) Dataset | A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase. | This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts. | 894,499 | Question-query pairs | Question Answering | 2018 | [401][402] | Hartmann, Soru, and Marx et al. |
Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Dow Jones Index | Weekly data of stocks from the first and second quarters of 2011. | Calculated values included such as percentage change and a lags. | 750 | Comma separated values | Classification, regression, Time series | 2014 | [403][404] | M. Brown et al. |
Statlog (Australian Credit Approval) | Credit card applications either accepted or rejected and attributes about the application. | Attribute names are removed as well as identifying information. Factors have been relabeled. | 690 | Comma separated values | Classification | 1987 | [405][406] | R. Quinlan |
eBay auction data | Auction data from various eBay.com objects over various length auctions | Contains all bids, bidderID, bid times, and opening prices. | ~ 550 | Text | Regression, classification | 2012 | [407][408] | G. Shmueli et al. |
Statlog (German Credit Data) | Binary credit classification into "good" or "bad" with many features | Various financial features of each person are given. | 690 | Text | Classification | 1994 | [409] | H. Hofmann |
Bank Marketing Dataset | Data from a large marketing campaign carried out by a large bank . | Many attributes of the clients contacted are given. If the client subscribed to the bank is also given. | 45,211 | Text | Classification | 2012 | [410][411] | S. Moro et al. |
Istanbul Stock Exchange Dataset | Several stock indexes tracked for almost two years. | None. | 536 | Text | Classification, regression | 2013 | [412][413] | O. Akbilgic |
Default of Credit Card Clients | Credit default data for Taiwanese creditors. | Various features about each account are given. | 30,000 | Text | Classification | 2016 | [414][415] | I. Yeh |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Cloud DataSet | Data about 1024 different clouds. | Image features extracted. | 1024 | Text | Classification, clustering | 1989 | [416] | P. Collard |
El Nino Dataset | Oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. | 12 weather attributes are measured at each buoy. | 178080 | Text | Regression | 1999 | [417] | Pacific Marine Environmental Laboratory |
Greenhouse Gas Observing Network Dataset | Time-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather. | None. | 2921 | Text | Regression | 2015 | [418] | D. Lucas |
Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory | Continuous air samples in Hawaii, USA. 44 years of records. | None. | 44 years | Text | Regression | 2001 | [419] | Mauna Loa Observatory |
Ionosphere Dataset | Radar data from the ionosphere. Task is to classify into good and bad radar returns. | Many radar features given. | 351 | Text | Classification | 1989 | [361][420] | Johns Hopkins University |
Ozone Level Detection Dataset | Two ground ozone level datasets. | Many features given, including weather conditions at time of measurement. | 2536 | Text | Classification | 2008 | [421][422] | K. Zhang et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Adult Dataset | Census data from 1994 containing demographic features of adults and their income. | Cleaned and anonymized. | 48,842 | Comma separated values | Classification | 1996 | [423] | United States Census Bureau |
Census-Income (KDD) | Weighted census data from the 1994 and 1995 Current Population Surveys. | Split into training and test sets. | 299,285 | Comma separated values | Classification | 2000 | [424][425] | United States Census Bureau |
IPUMS Census Database | Census data from the Los Angeles and Long Beach areas. | None | 256,932 | Text | Classification, regression | 1999 | [426] | IPUMS |
US Census Data 1990 | Partial data from 1990 US census. | Results randomized and useful attributes selected. | 2,458,285 | Text | Classification, regression | 1990 | [427] | United States Census Bureau |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Bike Sharing Dataset | Hourly and daily count of rental bikes in a large city. | Many features, including weather, length of trip, etc., are given. | 17,389 | Text | Regression | 2013 | [428][429] | H. Fanaee-T |
New York City Taxi Trip Data | Trip data for yellow and green taxis in New York City. | Gives pick up and drop off locations, fares, and other details of trips. | 6 years | Text | Classification, clustering | 2015 | New York City Taxi and Limousine Commission | |
Taxi Service Trajectory ECML PKDD | Trajectories of all taxis in a large city. | Many features given, including start and stop points. | 1,710,671 | Text | Clustering, causal-discovery | 2015 | [430][431] | M. Ferreira et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Webpages from Common Crawl 2012 | Large collection of webpages and how they are connected via hyperlinks | None. | 3.5B | Text | clustering, classification | 2013 | [432] | V. Granville |
Internet Advertisements Dataset | Dataset for predicting if a given image is an advertisement or not. | Features encode geometry of ads and phrases occurring in the URL. | 3279 | Text | Classification | 1998 | [433][434] | N. Kushmerick |
Internet Usage Dataset | General demographics of internet users. | None. | 10,104 | Text | Classification, clustering | 1999 | [435] | D. Cook |
URL Dataset | 120 days of URL data from a large conference. | Many features of each URL are given. | 2,396,130 | Text | Classification | 2009 | [436][437] | J. Ma |
Phishing Websites Dataset | Dataset of phishing websites. | Many features of each site are given. | 2456 | Text | Classification | 2015 | [438] | R. Mustafa et al. |
Online Retail Dataset | Online transactions for a UK online retailer. | Details of each transaction given. | 541,909 | Text | Classification, clustering | 2015 | [439] | D. Chen |
Freebase Simple Topic Dump | Freebase is an online effort to structure all human knowledge. | Topics from Freebase have been extracted. | large | Text | Classification, clustering | 2011 | [440][441] | Freebase |
Farm Ads Dataset | The text of farm ads from websites. Binary approval or disapproval by content owners is given. | SVMlight sparse vectors of text words in ads calculated. | 4143 | Text | Classification | 2011 | [442][443] | C. Masterharm et al. |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Poker Hand Dataset | 5 card hands from a standard 52 card deck. | Attributes of each hand are given, including the Poker hands formed by the cards it contains. | 1,025,010 | Text | Regression, classification | 2007 | [444] | R. Cattral |
Connect-4 Dataset | Contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced. | None. | 67,557 | Text | Classification | 1995 | [445] | J. Tromp |
Chess (King-Rook vs. King) Dataset | Endgame Database for White King and Rook against Black King. | None. | 28,056 | Text | Classification | 1994 | [446][447] | M. Bain et al. |
Chess (King-Rook vs. King-Pawn) Dataset | King+Rook versus King+Pawn on a7. | None. | 3196 | Text | Classification | 1989 | [448] | R. Holte |
Tic-Tac-Toe Endgame Dataset | Binary classification for win conditions in tic-tac-toe. | None. | 958 | Text | Classification | 1991 | [449] | D. Aha |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Housing Data Set | Median home values of Boston with associated home and neighborhood attributes. | None. | 506 | Text | Regression | 1993 | [450] | D. Harrison et al. |
The Getty Vocabularies | structured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials. | None. | large | Text | Classification | 2015 | [451] | Getty Center |
Yahoo! Front Page Today Module User Click Log | User click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page. | Conjoint analysis with a bilinear model. | 45,811,883 user visits | Text | Regression, clustering | 2009 | [452][453] | Chu et al. |
British Oceanographic Data Centre | Biological, chemical, physical and geophysical data for oceans. 22K variables tracked. | Various. | 22K variables, many instances | Text | Regression, clustering | 2015 | [454] | British Oceanographic Data Centre |
Congressional Voting Records Dataset | Voting data for all USA representatives on 16 issues. | Beyond the raw voting data, various other features are provided. | 435 | Text | Classification | 1987 | [455] | J. Schlimmer |
Entree Chicago Recommendation Dataset | Record of user interactions with Entree Chicago recommendation system. | Details of each users usage of the app are recorded in detail. | 50,672 | Text | Regression, recommendation | 2000 | [456] | R. Burke |
Insurance Company Benchmark (COIL 2000) | Information on customers of an insurance company. | Many features of each customer and the services they use. | 9,000 | Text | Regression, classification | 2000 | [457][458] | P. van der Putten |
Nursery Dataset | Data from applicants to nursery schools. | Data about applicant's family and various other factors included. | 12,960 | Text | Classification | 1997 | [459][460] | V. Rajkovic et al. |
University Dataset | Data describing attributed of a large number of universities. | None. | 285 | Text | Clustering, classification | 1988 | [461] | S. Sounders et al. |
Blood Transfusion Service Center Dataset | Data from blood transfusion service center. Gives data on donors return rate, frequency, etc. | None. | 748 | Text | Classification | 2008 | [462][463] | I. Yeh |
Record Linkage Comparison Patterns Dataset | Large dataset of records. Task is to link relevant records together. | Blocking procedure applied to select only certain record pairs. | 5,749,132 | Text | Classification | 2011 | [464][465] | University of Mainz |
Nomao Dataset | Nomao collects data about places from many different sources. Task is to detect items that describe the same place. | Duplicates labeled. | 34,465 | Text | Classification | 2012 | [466][467] | Nomao Labs |
Movie Dataset | Data for 10,000 movies. | Several features for each movie are given. | 10,000 | Text | Clustering, classification | 1999 | [468] | G. Wiederhold |
Open University Learning Analytics Dataset | Information about students and their interactions with a virtual learning environment. | None. | ~ 30,000 | Text | Classification, clustering, regression | 2015 | [469][470] | J. Kuzilek et al. |
Mobile phone records | Telecommunications activity and interactions | Aggregation per geographical grid cells and every 15 minutes. | large | Text | Classification, Clustering, Regression | 2015 | [471] | G. Barlacchi et al. |
As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.
|last2=
(help)