1.
Introduction
In this report, we are going the give a review on Weka, how
to download it, what its latest version etc. then we will show our dataset and
explore it to extract the suitable ones for association rules mining.
Parameters will be set before applying Apriori algorithm which is mainly used
to extract the best rules in a relation. We will comment on the results and
finally give our conclusion.
2. WEKA
review
Weka is machine learning algorithms in Java. It is developed at the
University of Waikato in New Zealand. It has been written in Java. And it’s
successfully tested under Linux, Windows, and Macintosh operating systems. Weka
is used for Data pre-processing, Classification, Regression, Clustering,
Association rules, and Visualization. Weka is open source software issued under
the GNU General Public License.
Figure 1. Weka Explorer
Versions: Weka 3.4.x is the most recently stable version. There is also a version
for developers it is Weka 3.5.x. For normal users we are going to use version
3.4.x
Requirements: It is required to install java before installing Weka. For
all Weka versions tell version 3.4.x
Java 1.4 or later must be installed. Form Weka 3.5.3 and later versions,
Java 5.0 is required to be installed.
Download: WEKA can be downloaded from the following link (http://www.cs.waikato.ac.nz/~ml/weka/). Which is the main website for Weka, there you will
find versions available for all operating systems, such as Max OS, Windows,
Linux etc.
GUI:
Weka provides GUI or Graphical User Interface for data processing.
However for developers Weka can be integrated with the Java code, so all functionality
in GUI can be called from within the Java code. The main components of Weka GUI
are the Explorer, the Experimenter, the Knowledge flow and CLI. When clicking
on the Explorer button, the screen in Figure 1 will be displayed.
3. Dataset
Selection
File Formats: Two file types are mainly used in Weka, namely ARFF and CSV.
ARFF (Attribute-Relation File Format) file format is a text file
containing all the instances of a specific relationship, it also divides the
relation into a set of attributes. The second file format is CSV( Comma
Separated )Files, it is a tabular format for the data. Converters in Weka can be used to convert
form one file format to another for example it is easy to convert from CSV file
format to ARFF file format and vise versa. Figure 2 shows such types of datasets,
to display any file, you need to use the “Open File” button in the explorer.
For Bank dataset, remove ID and discretize age,
income and children fields
Figure 2. (left) Diabetes dataset
with approximately all attributes are of type Numeric. (right) Weather.nominal
dataset with all attributes of type nominal.
Datasets: Selection of data depends on its suitability for association rules
mining. Association rules works only with nominal data. Nominal data is the
data with specific states, such as the attribute “Sex” which has only two
values, either MALE or FEMALE. The values of the nominal type are discrete. So
searching in the available data sets to get the suitable ones, take the following
datasets (as an example)
- Bank
- Sonar
- Zoo
- Diabetes
- Weather.nominal
When we look at the Sonar and Diabetes datasets, we found all
the fields are of numerical values. So, association is not possible with all
fields discrete. Instead, classifying can be used. Association, is to relate
one field to the other. This can’t be done with numerical values. See Figure 2,
for understanding the difference between Diabetes and Weather.nominal datasets.
For now, we will select the following datasets for the upcoming tests.
- Bank
- Zoo , and
- Weather.nominal
Figure 3. The “Start” button in associate tab
is inactive due to the existence of numeric data in the bank dataset.
4.
Preparing the Datasets for Association Rules Test.
After exploring Bank, Zoo and Wether.nominal datasets, we
found that, the fields in the Bank datasets are not all nominal, instead there
are three numerical fields, namely age, income, and children. To start creating
association rules, we need to use the Apriori algorithm which is very sensitive
to the data type. If we start using Apriori algorithm wile the Bank database
open, Apriori start button will not be active as shown in Figure 3.
To change the Numeric data into nominal data, we need to use
filters to discretize data. Discretize means normalizing a range of data into a
single value, for example, the age has a value within 15 and 60 years old. The
new values will be either YOUNG or OLD, this is done by descretizing the age
values; so, from 15 to 22.5 will take the value YOUNG and from 22.6 to 60 will
take the value OLD. We can also expand the range of descretizing by using more
than two bins, for example from 15 to 30 is YOUNG and from 31 to 45 is
MIDDLE_AGED and from 46 to 60 is OLD, and so forth. From Weak we are going to use two types of
filters
- Remove filter: In the "Filter" panel, click on "Choose" button. A window with a list of available filters will be displayed. Scroll down in the list and select the "weka.filters.unsupervised.attribute.Remove"
- Discretize filter: In the "Filter" panel,
click on "Choose" button. A window with a list of available filters
will be displayed. Scroll down in the list and select the "weka.filters.unsupervised.attribute.Discretize"
the panel for filter is shown in Figure 4
Figure 4. Selecting the Discretize filter
After
selecting the required filter, the parameters will be displayed beside the
“Choose” button. You still can change the parameters by clicking on it. After
setting all parameters, click apply to apply the filter. We need to do the
following
- For Zoo dataset, remove legs field
- For Weather.nominal, there is no action needed.
Finally, it
is better to save the datasets after applying filters into new file.
5. Association
Rules Mining
In this section we are going to select the suitable parameters
for each apriori test. The three datasets will be used one after another. We
will start by the bank dataset then the zoo dataset and finally weather.nominal
dataset. After selecting the suitable parameter for each test, selection of the
interesting rules will be introduced.
Bank Dataset: first we need to open the bank file in Weka explorer; click
the "Associate" tab an interface for association rule algorithms will
be opened. Apriori algorithm will be selected by default. The experiments in
this report will be done with Apriori algorithm. To change the prameters of
this test we need to click on the textbox beside the name of the algorithm. This
box shows the parameters used for the specified test, and it is always the
case. The dialog box which contain the parameters and their values depicted in
Figure 5. Here, various parameters associated with Apriori can be changed.
Figure 5. Setting the parameters for Apriori
algorithm before applying it on the bank dataset
Lift is a metric type used in Apriori algorithm. We are going to
use it in our test. Lift is calculated by dividing the rule confidence by the
support of Right Hand Side or RHS. In more formal way, take an example the
following rule A => B, so Lift is a ratio, calculated by getting the probability
that A and B happened together divided by the probability of A and B happened
separately. If the Lift value is 1 this means A and B are independent and don’t
occur together. If the Lift is greater than 1 this means that A and B are more
probably associated. The following
parameters are set to the values specified
- Maximum
number of rules to be displayed is 10, they will be sorted according to
their lift value.
- The
maximum support value is set to 1.0 and the minimum support is set to 0.1.
- Apriori
algorithm will start creating rules with max support ending with either
the number of rules specified or the minimum support. Each step is moved
with the delta value, set to 0.05, All parameters are shown in Figure 5.
Apply the Apriori algorithm by clicking on the “Start”
button. The following data will be displayed as an output of the Apriori
algorithm.
As shown in the figure we can conclude that, best rules are
the first and the second one with lift = 1.92.
We can translate the first rule in pure English by saying “Most of not
married bank customers have pep or Personal Equity Plan. It is really
interesting discovery.
Zoo Dataset: Now we need to set the parameters for testing zoo dataset.
We are going to set the same parameters used in the bank dataset as follow
- Maximum
number of rules to be displayed is 10, they will be sorted according to
their lift value.
- The
maximum support value is set to 1.0 and the minimum support is set to 0.1.
- Apriori
algorithm will start creating rules with max support ending with either
the number of rules specified or the minimum support. Each step is moved
with the delta value, set to 0.05.
Apply the Apriori algorithm by clicking on the “Start”
button. The following data will be displayed as an output of the Apriori algorithm.
The first
interesting rule means “Animals with backbone are not venomous and have tail”
interesting!!!!!!
Weather.Nominal
Dataset: We are
going to use the same parameters, for the zoo test, but change only metric type
to be confidence instead of lift. The
remaining parameters are the same as that for zoo dataset.
- Maximum
number of rules to be displayed is 10, they will be sorted according to
their lift value.
- The
maximum support value is set to 1.0 and the minimum support is set to 0.1.
- Apriori
algorithm will start creating rules with max support ending with either
the number of rules specified or the minimum support. Each step is moved
with the delta value, set to 0.05.
Apply the Apriori algorithm by clicking on the “Start”
button. The following data will be displayed as an output of the Apriori
algorithm.
6. Conclusion
In this report we have seen how to use Weka to extract the
useful or the best rule in a dataset. We have applied Apriori algorithm on 3
datasets. We have extracted the most 10 interesting rules or the best 10 rules
for each dataset. Not all datasets are suitable for association rules mining.
Only attributes with values of type nominal can be used in association rules
mining.
7. Resources
- Association Rules Mining Basics [PDF Slides]
- Weka Tutorial [PDF slides]
- Association rules Exercise [PDF] and Solutions [PDF]
- Dataset [Winrar File]
- PlagScan to detect plagiarism.
No comments:
Post a Comment