Gas Sensor Array Drift Dataset

Donated on 4/24/2012

This archive contains 13910 measurements from 16 chemical sensors utilized in simulations for drift compensation in a discrimination task of 6 gases at various levels of concentrations.

Dataset Characteristics

Multivariate

Subject Area

Computer Science

Associated Tasks

Classification

Feature Type

Real

# Instances

13910

# Features

Dataset Information

Additional Information

This archive contains 13910 measurements from 16 chemical sensors utilized in simulations for drift compensation in a discrimination task of 6 gases at various levels of concentrations. The goal is to achieve good performance (or as low degradation as possible) over time, as reported in the paper mentioned below in Section 2: Data collection. The primary purpose of providing this dataset is to make it freely accessible on-line to the chemo-sensor research community and artificial intelligence to develop strategies to cope with sensor/concept drift. The dataset can be used exclusively for research purposes. Commercial purposes are fully excluded. The dataset was gathered within January 2007 to February 2011 (36 months) in a gas delivery platform facility situated at the ChemoSignals Laboratory in the BioCircuits Institute, University of California San Diego. Being completely operated by a fully computerized environment â€”controlled by a LabVIEWâ€“National Instruments software on a PC fitted with the appropriate serial data acquisition boards. The measurement system platform provides versatility for obtaining the desired concentrations of the chemical substances of interest with high accuracy and in a highly reproducible manner, minimizing thereby the common mistakes caused by human intervention and making it possible to exclusively concentrate on the chemical sensors for compensating real drift. The resulting dataset comprises recordings from six distinct pure gaseous substances, namely Ammonia, Acetaldehyde, Acetone, Ethylene, Ethanol, and Toluene, each dosed at a wide variety of concentration values ranging from 5 to 1000 ppmv. See Tables 1 and 2 of the below cited manuscript for details on the gas identity name, concentration values, and time distribution sequence of the measurement recordings considered in this dataset. Batch10.dat was updated on 10/14/2013 to correct some corrupted values in the last 120 lines of the file. An extension of this dataset with the concentration values is available at Gas Sensor Array Drift Dataset at Different Concentrations Data Set http://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset+at+Different+Concentrations

Has Missing Values?

Variables Table

Variable Name	Role	Type	Description	Units	Missing Values
					no
					no
					no
					no
					no
					no
					no
					no
					no
					no

Rows per page

0 to 10 of 128

Additional Variable Information

The response of the said sensors is read-out in the form of the resistance across the active layer of each sensor; hence each measurement produced a 16-channel time series, each of which represented by an aggregate of features reflecting all the dynamic processes occurring at the sensor surface in reaction to the chemical substance being evaluated. In particular, two distinct types of features were considered in the creation of this dataset: (i) The so-called steady-state feature (Î”R), defined as the difference of the maximal resistance change and the baseline and its normalized version expressed by the ratio of the maximal resistance and the baseline values when the chemical vapor is present in the test chamber. And (ii), an aggregate of features reflecting the sensor dynamics of the increasing/decaying transient portion of the sensor response during the entire measurement procedure under controlled conditions, namely the exponential moving average (emaÎ±). These aggregate of features is a transform, borrowed from the field of econometrics originally introduced to the chemo-sensing community by Muezzinoglu et al. (2009), that converts the said transient portion into a real scalar, by estimating the maximum value â€”minimum for the decaying portion of the sensor responseâ€” of its exponential moving average (emaÎ±), with an initial condition set to zero and a scalar smoothing parameter of the operator, Î±, that defines both the quality of the feature and the time of its occurrence along the time series the scalar, set to range between 0 and 1. In particular, three different values for Î± were set to obtain three different feature values from the pre-recorded rising portion of the sensor response and three additional features with the same Î± values but for the decaying portion of the sensor response, covering thus the entire sensor response dynamics. For a more detailed analysis and discussion on these features as well as a graphical illustration of them please refer to Section 2.3 and Figure 2, respectively of the annotated manuscript. Once the abovementioned features are calculated, one is to form a feature vector containing the 8 features extracted from each particular sensor multiplied by the 16 sensors considered here. In the end, the resulting 128-dimensional feature vector containing all the features indicated above (8 features Ã— 16 sensors) is organized as follows: Î”R_1, |Î”R|_1, EMAi0.001_1, EMAi0.01_1, EMAi0.1_1, EMAd0.001_1, EMAd0.01_1, EMAd0.1_1, Î”R_2, |Î”R|_2, EMAi0.001_2, EMAi0.01_2, EMAi0.1_2, EMAd0.001_2, EMAd0.01_2, EMAd0.1_2,..., Î”R_16, |Î”R|_16, EMAi0.001_16, EMAi0.01_16, EMAi0.1_16, EMAd0.001_16, EMAd0.01_16, EMAd0.1_16, where: â€œÎ”R_1â€ and â€œ|Î”R|_1â€ is the Î”R and the normalized Î”R feature, respectively, â€œEMAi0.001_1â€, â€œEMAi0.01_1â€, and â€œEMAi0.1_1â€, the emaÎ± of the rising transient portion of the sensor response for Î± equals to 0.001, 0.01, and 0.1, respectively, and â€œEMAd0.001_1â€, â€œEMAd0.01_1â€, and â€œEMAd0.1_1â€, the emaÎ± of the decaying transient portion of the sensor response for Î± equals to 0.001, 0.01, and 0.1, respectively, all corresponding to sensor # 1; â€œÎ”R_2â€ and â€œ|Î”R|_2â€ is the Î”R and the normalized Î”R feature, respectively, â€œEMAi0.001_2â€, â€œEMAi0.01_2â€, and â€œEMAi0.1_2â€, the emaÎ± of the rising transient portion of the sensor response for Î± equals to 0.001, 0.01, and 0.1, respectively, and â€œEMAd0.001_2â€, â€œEMAd0.01_2â€, and â€œEMAd0.1_2â€, the emaÎ± of the decaying transient portion of the sensor response for Î± equals to 0.001, 0.01, and 0.1, respectively, all corresponding to sensor # 2; and so forth up until sensor # 16, forming thus the 128-dimensional feature vector that is to be fetched to the classifiers for training. For processing purposes, the data is organized into ten batches, each containing the number of measurements per class and month indicated in the table below. This reorganization of data was done to ensure having a sufficient and as uniformly distributed as possible number of experiments in each class and month when training the classifier. Dataset organization details. Each row corresponds to months that were combined to form a batch: Batch ID Month IDs Batch 1 Months 1 and 2 Batch 2 Months 3, 4, 8, 9 and 10 Batch 3 Months 11, 12, and 13 Batch 4 Months 14 and 15 Batch 5 Month 16 Batch 6 Months 17, 18, 19, and 20 Batch 7 Month 21 Batch 8 Months 22 and 23 Batch 9 Months 24 and 30 Batch 10 Month 36 The data format follows the same coding style as in libsvm, in which one indicates the class each data point belongs to (1: Ethanol; 2: Ethylene; 3:Ammonia; 4: Acetaldehyde; 5: Acetone; 6: Toluene), and, then, the collection of features in a format x:v, where x stands for the feature number and v for the actual value of the feature. For example, in 1 1:15596.162100 2:1.868245 3:2.371604 4:2.803678 5:7.512213 â€¦ 128:-2.654529 The number â€œ1â€ stands for the class number (in this case Ethanol), whereas the remaining 128 columns list the actual feature values for each measurement recording organized as described above. Finally, to make the results presented in the associated article reproducible for the reader, please use the following parameter values in the training task: â€¢ folds: 10 â€¢ log2c = -5, 10, 1 â€¢ log2g = -10, 5, 1 â€¢ Scale the features in the training set appropriately to lie between -1 and +1. â€¢ And use the following cross validation parameters: Batch C Gamma (É¤) Rate 1 256.0 0.03125 98.8764 2 64.0 0.00390625 99.7588 3 128.0 0.03125 100.0 4 1.0 0.25 100.0 5 2.0 0.015625 99.4924 6 256.0 0.0009765625 99.5217 7 64.0 0.0625 99.9723 8 1024.0 0.0078125 99.6599 9 2.0 0.00390625 100.0

Dataset Files

File	Size
Dataset/batch7.dat	5.8 MB
Dataset/batch10.dat	5.8 MB
Dataset/batch6.dat	3.7 MB
Dataset/batch3.dat	2.6 MB
Dataset/batch2.dat	2 MB

Rows per page

0 to 5 of 10

Download (9.5 MB)

0 citations

21707 views

Creators

Alexander Vergara

DOI

10.24432/C5RP6W

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.