Data Mining - An Introduction Tutorial/Practical, QUB

The Queen's University of Belfast

Parallel Computer Centre

[Next] [Previous] [Top]

Tutorial/Practical

Example

Clementine

Supplied by Integral Solutions Limited (ISL), Basingstoke, England
Visual Programming Interface
- builds a discovery model
- performs learning task
Uses neural networks and rule induction
Data sources
- ASCII file format, Oracle, Informix, Sybase and Ingres
Clementine has many useful facilities:
- Data Manipulation - construct new data items derived from existing ones, and breaking the data down into meaningful sub-sets
- Browsing and Visualisation - displaying aspects of the data using interactive graphics
- Statistics - confirming suspected relationships between factors in the data
- Hypothesis testing - constructing models of how the data behaves and verifying them

Clementine Example

Drug trials

A number of hospital patients all suffering from the same illness were treated with a range of drugs
5 different drugs were available and the different patients responded differently to the different drugs
Problem - which drug is appropriate for any given future patient?

Problem solving

Stages

Accessing the data
- read in the data e.g. from a text file with delimiters
- name the fields
  - age
  - sex
  - BP - High, Normal or Low
  - Cholesterol - Normal or High
  - Na - blood sodium concentration
  - K - blood potassium concentration
  - drug - i.e. to which the patient responded
View the records by using the table node e.g.

Can select fields or filter the data
Display properties of the data e.g.
- what proportion of cases respond to each drug?

Answer DrugY followed by DrugX
Finding a relationship e.g.
- relationship between sodium and potassium levels as displayed in a point plot
- Random scattering - no apparent relationship
Re-examine according to a particular drug i.e. drugY, and sodium to potassium ratio
calculate the Na/K ratio i.e. as a derived field or node
Patients with a high Na to K ratio respond best to drugY

Machine learning

Clementine

Which is the best drug for any patient?
- Filter the unwanted fields
- Define types for the fields
- Building rules and training nets i.e. by attaching the appropriate nodes

Building rules and training nets

Net trained, rules built on 200 example cases

Net and rules completed

Rules formulated

The rules first decision is based on the same criterion discovered previously i.e. allocates drugY to patients with a high Na to K ratio

UUJ Example

House price prediction

Problem - mass appraisal of property in N.Ireland
10 attributes per property including:
- Ward No, Area No, Price, Age of house, Number of bedrooms, Detached/Semi, Type of building - (bungalow, house, chalet etc.), Heating
Best predictive accuracy
- 82% with mean error 15%
After removal of outliers
- 93% with mean error 7.8%

Data file - houses.dat

Graphical Output

Outliers - any property over 100,000 had land attached

Statistics Produced

Predictions set

Clementine stream

Clementine Output

Rule Browser

REFERENCES

Knowledge Discovery Data Mine
- http://info.gte.com/~kdd/
University of Ulster Jordanstown
- Database Mining Interest Group (UUDMIG)
- http://iserve1.infj.ulst.ac.uk/main.html
Queens University Parallel Computer Centre
- Training and Education - course materials
- http://www.pcc.qub.ac.uk/
Articles
- Byte, October 1995

[Next] [Previous] [Top]

All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk

Generated with CERN WebMaker