Data Mining

The Queen's University of Belfast

Parallel Computer Centre

[Next] [Previous] [Top]

Introduction

Historical Perspective

The Relational Model
- revolutionised transaction processing systems
- DBMS gave access to the data stored
- OLTPs are good at putting data into databases
The data explosion
- Increase in use of electronic data gathering devices e.g. point-of-sale, remote sensing devices etc.
- Data storage became easier and cheaper with increasing computing power

Problems

DBMS gave access to the data stored but no analysis of data
Analysis required to unearth the hidden relationships within the data i.e. for decision support
Size of databases has increased e.g. VLDBs, need automated techniques for analysis as they have grown beyond manual extraction
Obstacles
- typical scientific user knew nothing of commercial business applications
- the business database programmers, knew nothing of massively parallel principles
- solution was for database software producers to create easy-to-use tools and form strategic relationships with hardware manufacturers

What is data mining?

Definitions

the non trivial extraction of implicit, previously unknown, and potentially useful information from data

William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus

variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful

Clementine User Guide

Techniques used

Data mining encompasses a number of different technical approaches, such as:
- clustering,
- data summarization,
- learning classification rules,
- finding dependency net works,
- analysing changes, and
- detecting anomalies.

Summary

Data mining is the analysis of data and the use of software techniques for finding patterns and regularities in sets of data.
The computer is responsible for finding the patterns by identifying the underlying rules and features in the data.
It is possible to `strike gold' in unexpected places as the data mining software extracts patterns not previously discernable or so obvious that no-one has noticed them before.
Mining analogy:
- large volumes of data are sifted in an attempt to find something worthwhile
- in a mining operation large amounts of low grade materials are sifted through in order to find something of value.

Comparison Data Mining and DBMS

DBMS - queries based on the data held e.g.
- last months sales for each product
- sales grouped by customer age etc.
- list of customers who lapsed their policy
Data Mining - infer knowledge from the data held to answer queries e.g.
- what characteristics do customers share who lapsed their policies and how do they differ from those who renewed their policies?
- why is the Cleveland division so profitable?

Characteristics

of a data mining system

Large quantities of data
- volume of data so great it has to be analysed by automated techniques e.g. POS, satellite information, credit card transactions etc.
Noisy, incomplete data
- imprecise data is characteristic of all data collection
- databases - usually contaminated by errors, cannot assume that the data they contain is entirely correct e.g. some attributes rely on subjective or measurement judgements
Complex data structure - conventional statistical analysis not possible
Heterogeneous data stored in legacy systems

Who needs data mining?

Who(ever) has information fastest and uses it wins

Don McKeough,

former president of Coke Cola

Businesses are looking for new ways to let end users find the data they need to:
- make decisions
- serve customers and
- gain the competitive edge

Example

Philadelphia Police & Fire Credit Union

Used data mining to maximise their membership base i.e.
- looked at the multiple relationships with members such as consumer loans, annuities, credit cards etc.
Information Harvesters software was used to identify most and least profitable members to the organization, most attractive loan candidates etc.
Major discovery which was counter-intuitive
- members who had filed for bankruptcy are more inclined to clear debts with the Credit Union than outside lenders

Applications

Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.
Finance - stock market prediction, credit assessment, fraud detection etc.
Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behaviour' etc.
Knowledge Acquisition
Scientific discovery - superconductivity research, etc.
Engineering - automotive diagnostic expert systems, fault detection etc.

Data Mining Goals

Classification

DM system learns from examples or the data how to partition or classify the data i.e. it formulates classification rules
Example - customer database in a bank
- Question - Is a new customer applying for a loan a good investment or not?
- Typical rule formulated -

if STATUS = married and INCOME > 10000

and HOUSE_OWNER = yes

then INVESTMENT_TYPE = good

Association

Rules that associate one attribute of a relation to another
Set oriented approaches are the most efficient means of discovering such rules
Example - supermarket database
- 72% of all the records that contain items A and B also contain item C
- the specific percentage of occurrences, 72 is the confidence factor of the rule

Sequence/Temporal

Sequential pattern functions analyse collections of related records and detect frequently occurring patterns over a period of time
Difference between sequence rules and other rules is the temporal factor
Example - retailers database
- Can be used to discover the set of purchases that frequently precedes the purchase of a microwave oven
Example - natural disasters database
- Discovery could be that when there is an earthquake in Los Angeles the next day Mount Kilimanjaro erupts

Data Mining and Machine Learning

Differences

Data Mining (DM) or Knowledge Discovery in Databases (KDD) is about finding understandable knowledge
Machine Learning (ML) is concerned with improving performance of an agent
- training a neural network to balance a pole is part of ML, but not of KDD
Efficiency of the algorithm and scalability is more important in DM or KDD
- DM is concerned with very large, real-world databases
- ML typically looks at smaller data sets

Differences

ML has laboratory type examples for the training set
DM deals with `real world' data
- Real world data tend to have problems such as:
  - missing values
  - dynamic data
  - pre-existing data
  - noise

Data Mining Process

Stages

Data pre-processing
- heterogeneity resolution
- data cleansing
- data warehousing
Data Mining Tools applied
- extraction of patterns from the pre-processed data
Interpretation and evaluation
- user bias i.e. can direct DM tools to areas of interest
  - attributes of interest in databases
  - goal of discovery
  - domain knowledge
  - prior knowledge or belief about the domain

Issues in Data Mining

Noisy data
Missing values
Static data
Sparse data
Dynamic data
Relevance
Interestingness
Heterogeneity
Algorithm efficiency
Size and complexity of data

Techniques

Set oriented database methods
Statistics
Clustering
Visualisation
Neural networks
Rule Induction
Set oriented approaches/Databases
- make use of DBMSs to discover knowledge, SQL is limiting
Statistics
- can be used in several data mining stages
  - data cleansing i.e. the removal of erroneous or irrelevant data known as outliers
  - EDA, exploratory data analysis e.g. frequency counts, histograms etc.
  - data selection - sampling facilities and so reduce the scale of computation
  - attribute re-definition e.g. Body Mass Index, BMI, which is Weight/Height2
  - data analysis - measures of association and relationships between attributes, interestingness of rules, classification etc.
Visualization
- enhances EDA, makes patterns more visible e.g. NETMAP a commercial data mining tool uses this technique
Clustering i.e. Cluster Analysis
- Clustering and segmentation is basically partitioning the database so that each partition or group is similar according to some criteria or metric
- Clustering according to similarity is a concept which appears in many disciplines e.g. in chemistry the clustering of molecules
- Data mining applications make use of clustering according to similarity e.g. to segment a client/customer base
- It provides sub-groups of a population for further analysis or action - very important when dealing with very large databases
- Can be used for profile generation for target marketing i.e. where previous response to mailing campaigns can be used to generate a profile of people who responded and this can be used to predict response and filter mailing lists to achieve the best response

Knowledge Representation Methods

Neural Networks
- a trained neural network can be thought of as an "expert" in the category of information it has been given to analyse
- provides projections given new situations of interest and answers "what if" questions
- problems include:
  - the resulting network is viewed as a black box
  - no explanation of the results is given i.e. difficult for the user to interpret the results
  - difficult to incorporate user intervention
  - slow to train due to their iterative nature

A neural net can be trained to identify the risk of cancer from a number of factors

Decision trees
- used to represent knowledge
- built using a training set of data and can then be used to classify new objects
- problems are:
  - opaque structure - difficult to understand
  - missing data can cause performance problems
  - they become cumbersome for large data sets
- Example

Rules
- probably the most common form of representation
- tend to be simple and intuitive
- unstructured and less rigid
- problems are:
  - difficult to maintain
  - inadequate to represent many types of knowledge
- Example format
  - if X then Y
Frames
- templates for holding clusters of related knowledge about a very particular subject
- a natural way to represent knowledge
- has a taxonomy approach
- problem is
  - more complex than rule representation

Related Technologies

Data Warehousing
On-line Analytical Processing, OLAP

Data Warehousing

Definition

A data warehouse can be defined as any centralised data repository which can be queried for business benefit
warehousing makes it possible to
- extract archived operational data
- overcome inconsistencies between different legacy data formats
- integrate data throughout an enterprise, regardless of location, format, or communication requirements
- incorporate additional or expert information

Characteristics of a data warehouse

defined by Bill Inmon (IS guru)

subject-oriented - data organized by subject instead of application e.g.
- an insurance company would organize their data by customer, premium, and claim, instead of by different products (auto, life, etc.)
- contains only the information necessary for decision support processing
integrated - encoding of data is often inconsistent e.g.
- gender might be coded as "m" and "f" or 0 and 1 but when data are moved from the operational environment into the data warehouse they assume a consistent coding convention
time-variant - the data warehouse contains a place for storing data that are five to 10 years old, or older e.g.
- this data is used for comparisons, trends, and forecasting
- these data are not updated
non-volatile - data are not updated or changed in any way once they enter the data warehouse
- data are only loaded and accessed

Data warehousing

Processes

insulate data - i.e. the current operational information
- preserves the security and integrity of mission-critical OLTP applications
- gives access to the broadest possible base of data
retrieve data - from a variety of heterogeneous operational databases
- data is transformed and delivered to the data warehouse/store based on a selected model (or mapping definition)
- metadata - information describing the model and definition of the source data elements
data cleansing - removal of certain aspects of operational data, such as low-level transaction information, which slow down the query times.
transfer - processed data transferred to the data warehouse, a large database on a high performance box

Uses

of a data warehouse

a central store against which the queries are run
- uses very simple data structures with very little assumptions about the relationships between data
a data mart is a small warehouse which provides subsets of the main store, summarised information
- depending on the requirements of a specific group/department
- marts often use multidimensional databases which can speed up query processing as they can have data structures which are reflect the most likely questions

Data Warehouse model

Structure of data inside the data warehouse

An example of levels of summarization of data

Criteria

for a data warehouse

Load Performance
- require incremental loading of new data on a periodic basis
- must not artificially constrain the volume of data
Load Processing
- data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata update
Data Quality Management
- ensure local consistency, global consistency, and referential integrity despite "dirty" sources and massive database size
Query Performance
- must not be slowed or inhibited by the performance of the data warehouse RDBMS
Terabyte Scalability
- Data warehouse sizes are growing at astonishing rates so RDBMS must not have any architectural limitations. It must support modular and parallel management.
Mass User Scalability
- Access to warehouse data must not be limited to the elite few has to support hundreds, even thousands, of concurrent users while maintaining acceptable query performance.
Networked Data Warehouse
- Data warehouses rarely exist in isolation, users must be able to look at and work with multiple warehouses from a single client workstation
Warehouse Administration
- large scale and time-cyclic nature of the data warehouse demands administrative ease and flexibility
The RDBMS must Integrate Dimensional Analysis
- dimensional support must be inherent in the warehouse RDBMS to provide the highest performance for relational OLAP tools
Advanced Query Functionality
- End users require advanced analytic calculations, sequential and comparative analysis, and consistent access to detailed and summarized data

Problems with data warehousing

the rush of companies to jump on the band wagon as

these companies have slapped `data warehouse' labels on traditional transaction-processing products and co- opted the lexicon of the industry in order to be considered players in this fast-growing category

Chris Erickson, Red Brick

Data warehousing & OLTP

Similarities and Differences

OLTP systems

OLTP systems designed to maximise transaction capacity but they:
- cannot be repositories of facts and historical data for business analysis
- cannot quickly answer ad hoc queries
- rapid retrieval is almost impossible
- data is inconsistent and changing, duplicate entries exist, entries can be missing
- OLTP offers large amounts of raw data which is not easily understood
Typical OLTP query is a simple aggregation e.g.
- what is the current account balance for this customer?

Data warehouse systems

Data warehouses are interested in query processing as opposed to transaction processing
Typical business analysis query e.g.
- which product line sells best in middle-America and how does this correlate to demographic data?

OLAP

On-line Analytical processing

Problem is how to process larger and larger databases
OLAP involves many data items (many thousands or even millions) which are involved in complex relationships
Fast response is crucial in OLAP
Difference between OLAP and OLTP
- OLTP servers handle mission-critical production data accessed through simple queries
- OLAP servers handle management-critical data accessed through an iterative analytical investigation

OLAP

common analytical operations

Consolidation - involves the aggregation of data i.e. simple roll-ups or complex expressions involving inter-related data
- e.g. sales offices can be rolled-up to districts and districts rolled-up to regions
Drill-Down - can go in the reverse direction i.e. automatically display detail data which comprises consolidated data
"Slicing and Dicing" - ability to look at the data base from different viewpoints e.g.
- one slice of the sales database might show all sales of product type within regions;
- another slice might show all sales by sales channel within each product type
- often performed along a time axis in order to analyse trends and find patterns

Knowledge acquisition

using data mining

Expert systems are models of real world processes
Much of the information is available straight from the process e.g.
- in production systems, data is collected for monitoring the system
- knowledge can be extracted using data mining tools
- experts can verify the knowledge
Example
- TIGON project - detection and diagnosis of an industrial gas turbine engine

Siftware History

Most significant development was retooling database software to maspar environments
- Parallel processors can easily assign small, independent transactions to different processors.
- More processors, more transactions can be executed without reducing throughput
  - same concept applies to executing multiple independent SQL statements i.e a set of SQL statements can be broken up and allocated to different processors to increase speed
- Multiple data streams allow several operations to proceed simultaneously e.g.
  - a customer table, can be spread across multiple disks, and independent threads can search each subset of the customer data
  - data is partitioned into multiple subsets and performance is increased, the I/O subsystems feed data from the disks to the appropriate threads or streams
- An essential part of designing a database for parallel processing is the partitioning scheme
  - large databases are indexed - independent indexes must also be partitioned to maximize performance
Commercial Developments
- Oracle was first to market a parallel database ORACLE7 RDBMS
- Red Brick has a strong showing - VPT is a DBMS tuned for data warehouse applications
- IBM is still the world's largest producer of database management software, 80% of the FORTUNE 500, including the top 100 companies, rely on DB2 database
- INFORMIX is on-line with 8.0
- Sybase and System 10
- Information Harvester - software on the Convex Exemplar, market researchers in retail, insurance, financial and telecommunications firms will be able to analyse large data sets in a short time.

Commercial Examples

Information Harvester Inc.

Founded 1994, based in Cambridge, Mass
Makes use of conventional statistical analysis techniques by building upon a proprietary tree-based learning algorithm
- generates expert-system-like rules from datasets, initially presented in forms such as numbers, dates, categories, codes, etc.
Examples of use:
- Healthcare - Michael Reese Medical Associates (MRMA) employed data mining software from Information Harvesting and Vantage Point as a tool for gaining advantage in contract negotiations
- Finance - The Philadelphia Police and Fire Federal Credit Union used data mining to maximize their membership base

Red Brick Company

California based, specializes in products used for fast, accurate business decisions on large client/server databases
VPT - Very large data warehouse support, Parallel query processing, Time based data management
- database server - SQL with decision support extensions
- TMU (table management utility) - transforms data to a warehouse schema
- gateway technology supporting client/server access to the warehouse
Examples of use:
- H.E.B.- Category management in retailing
- Hewlett-Packard: "Discovering" Data To Manage Worldwide Support
Reference - http://www.redbrick.com

IBM

IBM provides a number of decision support tools giving a powerful but easy-to- use interface to the data warehouse
IBM Information Warehouse Solutions - a choice of decision support tools that best meet the needs of the end users
Customer Partnership Program e.g.
- Visa and IBM announced an agreement in May 1995 signalling their intention to work together
- changes the way in which Visa and its member banks exchange information worldwide i.e the proposed structure will facilitate the timely delivery of information and critical decision support tools directly to member financial institutions' desktops worldwide
Reference - http://www.ibm.com

Data mining projects

UU - Jordanstown

Data mining in the N.Ireland Housing Executive
Knee disorders classification
Fault diagnosis in a telecommunication network
A self-learning urology patient audit system
Policy lapse/renewal prediction
House price prediction

UUJ Example

Policy lapse/renewal prediction

Problem - predicting whether a motor insurance policy will be lapsed or renewed
34 attributes stored for each policy
- 14 attributes were deemed relevant
- 2 attributes were derived from underlying attributes
Predictive accuracy
- 71%
In a period of 3 weeks achieved the same accuracy as statistical models developed by the insurance company which had taken much longer to develop

MKS

The Mining Kernel System

Based on the interdisciplinary approach of data mining

MKS

Data pre-processing functionality i.e.
- statistical operations for removing outliers
- sampling
- data dimensionality reduction
- dealing with missing data
Algorithms provided for
- classification
- association
Facility to provide what is interesting to the user and presenting only interesting rules
Facility to incorporate domain knowledge into the knowledge discovery process

Conclusion

The future

Data mining has a lot of potential
Diversity in the field of application
Estimated market for data mining is $500 million

[Next] [Previous] [Top]

All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk

Generated with CERN WebMaker