CLINC150
Donated on 5/7/2020
This is a intent classification (text classification) dataset with 150 in-domain intent classes. The main purpose of this dataset is to evaluate various classifiers on out-of-domain performance.
Dataset Characteristics
Text
Subject Area
Other
Associated Tasks
Classification
Feature Type
-
# Instances
23700
# Features
-
Dataset Information
Additional Information
There are 4 versions of the dataset: - data_full.json: each of the 150 in-domain intent classes have 100 train, 20 val, and 30 test samples. The out-of-domain class has 100 train, 100 val, and 1,000 test samples. Note that the out-of-domain class does not necessarily need to be used at training time. This is the main version of the dataset. - data_small.json: each of the 150 in-domain intent classes have 50 train, 20 val, and 30 test samples. The out-of-domain class has 100 train, 100 val, and 1,000 test samples. Note that the out-of-domain class does not necessarily need to be used at training time. - data_imbalanced.json: each of the 150 in-domain intent classes have either 25, 50, 75, or 100 train, 20 val, and 30 samples. The out-of-domain class has 100 train, 100 val, and 1,000 test samples. Note that the out-of-domain class does not necessarily need to be used at training time. - data_oos_plus.json: same as data_full.json except there are 250 out-of-domain training samples.
Has Missing Values?
No
Variable Information
All samples are in text format. No tokenization has been applied. Users of this dataset are free to use whatever sentence representation (e.g. bag-of-words, sentence embeddings) they choose.
Dataset Files
File | Size |
---|---|
clinc150_uci/data_oos_plus.json | 2.4 MB |
clinc150_uci/data_full.json | 2.4 MB |
clinc150_uci/data_imbalanced.json | 1.9 MB |
clinc150_uci/data_small.json | 1.6 MB |
clinc150_uci/LICENSE | 19 KB |
0 to 5 of 11
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset clinc150 = fetch_ucirepo(id=570) # data (as pandas dataframes) X = clinc150.data.features y = clinc150.data.targets # metadata print(clinc150.metadata) # variable information print(clinc150.variables)
CLINC150 [Dataset]. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5MP58.
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.