Osama Hosam: Association Rules Mining

1. Introduction

In this report, we are going the give a review on Weka, how to download it, what its latest version etc. then we will show our dataset and explore it to extract the suitable ones for association rules mining. Parameters will be set before applying Apriori algorithm which is mainly used to extract the best rules in a relation. We will comment on the results and finally give our conclusion.

2. WEKA review

Weka is machine learning algorithms in Java. It is developed at the University of Waikato in New Zealand. It has been written in Java. And it’s successfully tested under Linux, Windows, and Macintosh operating systems. Weka is used for Data pre-processing, Classification, Regression, Clustering, Association rules, and Visualization. Weka is open source software issued under the GNU General Public License.

Figure 1. Weka Explorer

Versions: Weka 3.4.x is the most recently stable version. There is also a version for developers it is Weka 3.5.x. For normal users we are going to use version 3.4.x

Requirements: It is required to install java before installing Weka. For all Weka versions tell version 3.4.x Java 1.4 or later must be installed. Form Weka 3.5.3 and later versions, Java 5.0 is required to be installed.

Download: WEKA can be downloaded from the following link (http://www.cs.waikato.ac.nz/~ml/weka/). Which is the main website for Weka, there you will find versions available for all operating systems, such as Max OS, Windows, Linux etc.

GUI: Weka provides GUI or Graphical User Interface for data processing. However for developers Weka can be integrated with the Java code, so all functionality in GUI can be called from within the Java code. The main components of Weka GUI are the Explorer, the Experimenter, the Knowledge flow and CLI. When clicking on the Explorer button, the screen in Figure 1 will be displayed.

3. Dataset Selection

File Formats: Two file types are mainly used in Weka, namely ARFF and CSV. ARFF (Attribute-Relation File Format) file format is a text file containing all the instances of a specific relationship, it also divides the relation into a set of attributes. The second file format is CSV( Comma Separated )Files, it is a tabular format for the data. Converters in Weka can be used to convert form one file format to another for example it is easy to convert from CSV file format to ARFF file format and vise versa. Figure 2 shows such types of datasets, to display any file, you need to use the “Open File” button in the explorer.

Figure 2. (left) Diabetes dataset with approximately all attributes are of type Numeric. (right) Weather.nominal dataset with all attributes of type nominal.

Datasets: Selection of data depends on its suitability for association rules mining. Association rules works only with nominal data. Nominal data is the data with specific states, such as the attribute “Sex” which has only two values, either MALE or FEMALE. The values of the nominal type are discrete. So searching in the available data sets to get the suitable ones, take the following datasets (as an example)

Bank
Sonar
Zoo
Diabetes
Weather.nominal

When we look at the Sonar and Diabetes datasets, we found all the fields are of numerical values. So, association is not possible with all fields discrete. Instead, classifying can be used. Association, is to relate one field to the other. This can’t be done with numerical values. See Figure 2, for understanding the difference between Diabetes and Weather.nominal datasets. For now, we will select the following datasets for the upcoming tests.

Bank
Zoo , and
Weather.nominal

Figure 3. The “Start” button in associate tab is inactive due to the existence of numeric data in the bank dataset.

4. Preparing the Datasets for Association Rules Test.

After exploring Bank, Zoo and Wether.nominal datasets, we found that, the fields in the Bank datasets are not all nominal, instead there are three numerical fields, namely age, income, and children. To start creating association rules, we need to use the Apriori algorithm which is very sensitive to the data type. If we start using Apriori algorithm wile the Bank database open, Apriori start button will not be active as shown in Figure 3.

To change the Numeric data into nominal data, we need to use filters to discretize data. Discretize means normalizing a range of data into a single value, for example, the age has a value within 15 and 60 years old. The new values will be either YOUNG or OLD, this is done by descretizing the age values; so, from 15 to 22.5 will take the value YOUNG and from 22.6 to 60 will take the value OLD. We can also expand the range of descretizing by using more than two bins, for example from 15 to 30 is YOUNG and from 31 to 45 is MIDDLE_AGED and from 46 to 60 is OLD, and so forth. From Weak we are going to use two types of filters

Remove filter: In the "Filter" panel, click on "Choose" button. A window with a list of available filters will be displayed. Scroll down in the list and select the "weka.filters.unsupervised.attribute.Remove"
Discretize filter: In the "Filter" panel, click on "Choose" button. A window with a list of available filters will be displayed. Scroll down in the list and select the "weka.filters.unsupervised.attribute.Discretize" the panel for filter is shown in Figure 4

Figure 4. Selecting the Discretize filter

After selecting the required filter, the parameters will be displayed beside the “Choose” button. You still can change the parameters by clicking on it. After setting all parameters, click apply to apply the filter. We need to do the following

For Bank dataset, remove ID and discretize age, income and children fields

For Zoo dataset, remove legs field
For Weather.nominal, there is no action needed.

Finally, it is better to save the datasets after applying filters into new file.

5. Association Rules Mining

In this section we are going to select the suitable parameters for each apriori test. The three datasets will be used one after another. We will start by the bank dataset then the zoo dataset and finally weather.nominal dataset. After selecting the suitable parameter for each test, selection of the interesting rules will be introduced.

Bank Dataset: first we need to open the bank file in Weka explorer; click the "Associate" tab an interface for association rule algorithms will be opened. Apriori algorithm will be selected by default. The experiments in this report will be done with Apriori algorithm. To change the prameters of this test we need to click on the textbox beside the name of the algorithm. This box shows the parameters used for the specified test, and it is always the case. The dialog box which contain the parameters and their values depicted in Figure 5. Here, various parameters associated with Apriori can be changed.

Figure 5. Setting the parameters for Apriori algorithm before applying it on the bank dataset

Lift is a metric type used in Apriori algorithm. We are going to use it in our test. Lift is calculated by dividing the rule confidence by the support of Right Hand Side or RHS. In more formal way, take an example the following rule A => B, so Lift is a ratio, calculated by getting the probability that A and B happened together divided by the probability of A and B happened separately. If the Lift value is 1 this means A and B are independent and don’t occur together. If the Lift is greater than 1 this means that A and B are more probably associated. The following parameters are set to the values specified

Maximum number of rules to be displayed is 10, they will be sorted according to their lift value.
The maximum support value is set to 1.0 and the minimum support is set to 0.1.
Apriori algorithm will start creating rules with max support ending with either the number of rules specified or the minimum support. Each step is moved with the delta value, set to 0.05, All parameters are shown in Figure 5.

Apply the Apriori algorithm by clicking on the “Start” button. The following data will be displayed as an output of the Apriori algorithm.

As shown in the figure we can conclude that, best rules are the first and the second one with lift = 1.92. We can translate the first rule in pure English by saying “Most of not married bank customers have pep or Personal Equity Plan. It is really interesting discovery.

Zoo Dataset: Now we need to set the parameters for testing zoo dataset. We are going to set the same parameters used in the bank dataset as follow

Maximum number of rules to be displayed is 10, they will be sorted according to their lift value.
The maximum support value is set to 1.0 and the minimum support is set to 0.1.
Apriori algorithm will start creating rules with max support ending with either the number of rules specified or the minimum support. Each step is moved with the delta value, set to 0.05.

Apply the Apriori algorithm by clicking on the “Start” button. The following data will be displayed as an output of the Apriori algorithm.

The first interesting rule means “Animals with backbone are not venomous and have tail” interesting!!!!!!

Weather.Nominal Dataset: We are going to use the same parameters, for the zoo test, but change only metric type to be confidence instead of lift. The remaining parameters are the same as that for zoo dataset.

Maximum number of rules to be displayed is 10, they will be sorted according to their lift value.
The maximum support value is set to 1.0 and the minimum support is set to 0.1.
Apriori algorithm will start creating rules with max support ending with either the number of rules specified or the minimum support. Each step is moved with the delta value, set to 0.05.

Apply the Apriori algorithm by clicking on the “Start” button. The following data will be displayed as an output of the Apriori algorithm.

6. Conclusion

In this report we have seen how to use Weka to extract the useful or the best rule in a dataset. We have applied Apriori algorithm on 3 datasets. We have extracted the most 10 interesting rules or the best 10 rules for each dataset. Not all datasets are suitable for association rules mining. Only attributes with values of type nominal can be used in association rules mining.

7. Resources

Association Rules Mining Basics [PDF Slides]
Weka Tutorial [PDF slides]
Association rules Exercise [PDF] and Solutions [PDF]
Dataset [Winrar File]
PlagScan to detect plagiarism.

Osama Hosam

Association Rules Mining

No comments:

Post a Comment