2
Classification Datasets Overview
Classification Experiments have been performed on 3 types of data:
1. Three datasets from the UCI machine learning repository [1], experiments
on these datasets were performed by former MSc student H°
avard Engum
(co-supervised by the author) [4].
2. Zereal game simulation output has been used as datasets for the section on
"Game Player Classification".
3. Web usage logs from www.jfipa.org has been used as datasets for the section
on "Web Intelligence".
2.1
Classification Tools Overview
The classifiers that is compared to MIPSVM implementation in IncRidge are
classifiers in the Weka toolkit [10] and the C-SVM algorithm in the LIBSVM
toolkit. These toolkits are described briefly below. (Other classifier tools were
considered, but not selected since they didn't support 10-fold cross-validation
testing).
WEKA WEKA is an akrynom for "Waikato Environment for Knowledge Anal-
ysis" and is a software tool that consists of a set of machine earning algorithms
including classifiers. Weka is implemented in Java and has been succesfully run
on all major computer platforms citewitten:mining. The Weka classifiers used
are Naive Bayes, C4.5, Logistic Regression, Voted Perceptron and SMO. Naive
Bayes is usually the default classifier in many domains, it is simple and gives in
general good results. Logistic Regression is considered to be the standard clas-
sification method in the domain of medical research [7]. C4.5 has been shown to
perform well compared to other classifiers, in fact outperforming Linear Discrim-
inant Analysis and Logistical Analysis for the classification of high performance
mutual funds [5]. Sequential Minimal Optimization (SMO) is an approximate
and fast method to train Support Vector Machine Classifiers [6].
LIBSVM LIBSVM is an integrated software for support vector classification,
regression and distribution estimation. It support multicategory classification
and different SVM formulations [2].
In order to get optimal results with LIBSVM, there are a few steps that can
be done in advance to enhance LIBSVM's performance both in accuracy and
computational efficiency [3].
The first step is to scale the features to the range [-1, 1] or [0, 1]. This is
because you do not want attributes in greater numeric ranges dominate those
in smaller numeric ranges. The other advantage of scaling is that you avoid
numerical difficulties of large attribute values when calculating the values of
kernel functions. LIBSVM features a tool, svm - scalem that scales the data.
Paper I
137