Examples of case studies used for assessment in this course.
Exercise A (used in Spring 1993 and 1996).
Design and write a procedure for the
Variable-linkage method. Include this with your procedure for calculating the
similarity or distance matrix to obtain software for an analysis of sets of
data. Choose suitable sets of test data for the seven examples covered by the
data-generation program and present the output as tables or as dendrograms. You
will need to study which methods are best suited to which sets of data and at
which values of g the variable-linkage method changes from one type of
behaviour to the other.
This will of course require some consideration of
what is meant by the "same" behaviour. Obviously if all the same objects are
grouped together at precisely the same similarity levels, then the behaviour is
identical. However this is too restrictive a definition and would result in
very few methods giving similar behaviour. The loosest definition is to
accept as the "same", any two methods which have the same objects grouped
together in the final two or three clusters. A slightly more rigid definition
accepts any methods which group the objects together in the same order, but at
different levels of similarity. This is probably the most realistic definition
to take, but you will have to state your choice when you write your report.
Having written the software, tested it and used it
to study these sets of data, you will need to produce a written report on your
work for the assessment for this course. Although discussion between students
is encouraged during the early stages of this work, the conclusions and written
report should be entirely your own work.
Exercise B. (used in Spring 1994 and Spring 1997).
Use the data-generation program to produce 9 sets of data
and use the procedure you
designed to calculate a distance matrix for numeric data. Although the test
data are only in two-dimensions, if these are considered to be the first two
principal axes, then your results will obviously be applicable to more general
problems.
The aim of the exercise is to study and compare the hypersphere method and the mode
analysis method for each of the nine sets of data. Since each student will have
slightly different sets of data, the results obtained will differ in many of
the cases. You should start by producing plots of your data and describing what
you expect to find. Then write software for these two methods and apply them to
your data sets. The results may or may not be what you expect. You will need to
discuss these results in your description of the project.
Full documentation of the program is not required for this assessment, since the
object of the course is to study classification methods, but you will need some
documentation for your own use. You will need to test the program to ensure that
it is working correctly, but the written description of this project work will
need to include the results from the programs and your analysis of these results.
Exercise C (used in Spring 1995 and Spring 1998.).
1. Use the data-generation program to produce your own seven sets of data. The
data is numeric and consists of (x,y) pairs for each set of data.
2. Write a procedure to calculate a distance matrix using the general Minkowski
metric (or use a procedure written by someone else - I'm assessing the
classification method not the software).
3. Use this procedure in a program to obtain clusters using the four metric
methods. Remember the general equation for these methods takes the form:
d(P+Q,R) = a1* d(P,R) + a2* d(Q,R) + b* d(P,Q)
and different values of these parameters will give the four methods
(Centroid-sorting Method, Gower's Median Method, Ward's Method and
Lance-Williams Flexible-b Method).
4. Taking the recommended value of b=-0.25 for the Lance-Williams method,
apply each method
in turn to each set of data for both Manhattan metric and Euclidean metric.
Find some reasonably efficient way of presenting your data (e.g. a plot of each
result with different clusters indicated by a different colour or symbol or a
dendrogram) and use it to indicate whether or not the four methods give the
same results as each other for each set of data. Also check whether the
different metric gives different results or not.
5. Produce a short report describing your study and what results you have
obtained. Note that each of you has a different set of data and so may obtain
different results. You should share the
writing and testing of software, but applying this to your own data and
interpreting what you have found should be entirely your own work. Remember
that your friend's results may not apply to your data.
6. If time, you may experiment with additional values of b for the Lance-Williams
method or see whether different metrics using higher values of r give different
results. By now, you should have found which data set is most likely to show a
variation.
Copyright (c) Susan Laflin 1998.