Exercises used for the Classification course.

Examples of case studies used for assessment in this course.

Exercise A (used in Spring 1993 and 1996).

Design and write a procedure for the Variable-linkage method. Include this with your procedure for calculating the similarity or distance matrix to obtain software for an analysis of sets of data. Choose suitable sets of test data for the seven examples covered by the data-generation program and present the output as tables or as dendrograms. You will need to study which methods are best suited to which sets of data and at which values of g the variable-linkage method changes from one type of behaviour to the other.

This will of course require some consideration of what is meant by the "same" behaviour. Obviously if all the same objects are grouped together at precisely the same similarity levels, then the behaviour is identical. However this is too restrictive a definition and would result in very few methods giving similar behaviour. The loosest definition is to accept as the "same", any two methods which have the same objects grouped together in the final two or three clusters. A slightly more rigid definition accepts any methods which group the objects together in the same order, but at different levels of similarity. This is probably the most realistic definition to take, but you will have to state your choice when you write your report.

Having written the software, tested it and used it to study these sets of data, you will need to produce a written report on your work for the assessment for this course. Although discussion between students is encouraged during the early stages of this work, the conclusions and written report should be entirely your own work.

Exercise B. (used in Spring 1994 and Spring 1997).

Use the data-generation program to produce 9 sets of data and use the procedure you designed to calculate a distance matrix for numeric data. Although the test data are only in two-dimensions, if these are considered to be the first two principal axes, then your results will obviously be applicable to more general problems.

The aim of the exercise is to study and compare the hypersphere method and the mode analysis method for each of the nine sets of data. Since each student will have slightly different sets of data, the results obtained will differ in many of the cases. You should start by producing plots of your data and describing what you expect to find. Then write software for these two methods and apply them to your data sets. The results may or may not be what you expect. You will need to discuss these results in your description of the project.

Full documentation of the program is not required for this assessment, since the object of the course is to study classification methods, but you will need some documentation for your own use. You will need to test the program to ensure that it is working correctly, but the written description of this project work will need to include the results from the programs and your analysis of these results.

Exercise C (used in Spring 1995 and Spring 1998.).

1. Use the data-generation program to produce your own seven sets of data. The data is numeric and consists of (x,y) pairs for each set of data.

2. Write a procedure to calculate a distance matrix using the general Minkowski metric (or use a procedure written by someone else - I'm assessing the classification method not the software).

3. Use this procedure in a program to obtain clusters using the four metric methods. Remember the general equation for these methods takes the form:

d(P+Q,R) = a1* d(P,R) + a2* d(Q,R) + b* d(P,Q)

and different values of these parameters will give the four methods (Centroid-sorting Method, Gower's Median Method, Ward's Method and Lance-Williams Flexible-b Method).

4. Taking the recommended value of b=-0.25 for the Lance-Williams method, apply each method in turn to each set of data for both Manhattan metric and Euclidean metric. Find some reasonably efficient way of presenting your data (e.g. a plot of each result with different clusters indicated by a different colour or symbol or a dendrogram) and use it to indicate whether or not the four methods give the same results as each other for each set of data. Also check whether the different metric gives different results or not.

5. Produce a short report describing your study and what results you have obtained. Note that each of you has a different set of data and so may obtain different results. You should share the writing and testing of software, but applying this to your own data and interpreting what you have found should be entirely your own work. Remember that your friend's results may not apply to your data.

6. If time, you may experiment with additional values of b for the Lance-Williams method or see whether different metrics using higher values of r give different results. By now, you should have found which data set is most likely to show a variation.