CLINC150

Donated on 5/7/2020

This is a intent classification (text classification) dataset with 150 in-domain intent classes. The main purpose of this dataset is to evaluate various classifiers on out-of-domain performance.

Dataset Characteristics

Text

Subject Area

Other

Associated Tasks

Classification

Feature Type

-

# Instances

23700

# Features

-

Dataset Information

Additional Information

There are 4 versions of the dataset: - data_full.json: each of the 150 in-domain intent classes have 100 train, 20 val, and 30 test samples. The out-of-domain class has 100 train, 100 val, and 1,000 test samples. Note that the out-of-domain class does not necessarily need to be used at training time. This is the main version of the dataset. - data_small.json: each of the 150 in-domain intent classes have 50 train, 20 val, and 30 test samples. The out-of-domain class has 100 train, 100 val, and 1,000 test samples. Note that the out-of-domain class does not necessarily need to be used at training time. - data_imbalanced.json: each of the 150 in-domain intent classes have either 25, 50, 75, or 100 train, 20 val, and 30 samples. The out-of-domain class has 100 train, 100 val, and 1,000 test samples. Note that the out-of-domain class does not necessarily need to be used at training time. - data_oos_plus.json: same as data_full.json except there are 250 out-of-domain training samples.

Has Missing Values?

No

Variable Information

All samples are in text format. No tokenization has been applied. Users of this dataset are free to use whatever sentence representation (e.g. bag-of-words, sentence embeddings) they choose.

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download
0 citations
2781 views

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy