CLINC150

Donated on 5/7/2020

This is a intent classification (text classification) dataset with 150 in-domain intent classes. The main purpose of this dataset is to evaluate various classifiers on out-of-domain performance.

Dataset Characteristics

Text

Subject Area

Other

Associated Tasks

Classification

Feature Type

# Instances

23700

# Features

Dataset Information

Additional Information

There are 4 versions of the dataset: - data_full.json: each of the 150 in-domain intent classes have 100 train, 20 val, and 30 test samples. The out-of-domain class has 100 train, 100 val, and 1,000 test samples. Note that the out-of-domain class does not necessarily need to be used at training time. This is the main version of the dataset. - data_small.json: each of the 150 in-domain intent classes have 50 train, 20 val, and 30 test samples. The out-of-domain class has 100 train, 100 val, and 1,000 test samples. Note that the out-of-domain class does not necessarily need to be used at training time. - data_imbalanced.json: each of the 150 in-domain intent classes have either 25, 50, 75, or 100 train, 20 val, and 30 samples. The out-of-domain class has 100 train, 100 val, and 1,000 test samples. Note that the out-of-domain class does not necessarily need to be used at training time. - data_oos_plus.json: same as data_full.json except there are 250 out-of-domain training samples.

Has Missing Values?

Variable Information

All samples are in text format. No tokenization has been applied. Users of this dataset are free to use whatever sentence representation (e.g. bag-of-words, sentence embeddings) they choose.

Dataset Files

File	Size
clinc150_uci/data_oos_plus.json	2.4 MB
clinc150_uci/data_full.json	2.4 MB
clinc150_uci/data_imbalanced.json	1.9 MB
clinc150_uci/data_small.json	1.6 MB
clinc150_uci/LICENSE	19 KB

Rows per page

0 to 5 of 11

Reviews

There are no reviews for this dataset yet.

Download (1 MB)

0 citations

2537 views

DOI

10.24432/C5MP58

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.