DATA MINING
DATA MINING :-
Data mining, a branch of computer
science, is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Data mining is seen as an increasingly important tool by modern
business to transform data into business intelligence giving an
informational advantage. It is currently used in a wide range of profiling
practices, such as marketing, surveillance, fraud detection, and scientific discovery.
The related terms data
dredging, data fishing and data snooping
refer to the use of data mining techniques to sample portions of the larger
population data set that are (or may be) too small for reliable statistical
inferences to be made about the validity of any patterns discovered. These
techniques can, however, be used in the creation of new hypotheses to test
against the larger data populations.
Process
Pre-processing
Before data mining algorithms can be used, a target
data set must be assembled. As data mining can only uncover patterns already
present in the data, the target dataset must be large enough to contain these
patterns while remaining concise enough to be mined in an acceptable timeframe.
A common source for data is a datamart or data
warehouse. Pre-process is essential to analyse the multivariate
datasets before clustering or data mining.
The target set is then cleaned. Cleaning removes
the observations with noise and missing data.
The clean data are reduced into feature vectors, one vector
per observation. A feature vector is a summarised version of the raw data
observation. For example, a black and white image of a face which is 100px by
100px would contain 10,000 bits of raw data. This might be turned into a feature
vector by locating the eyes and mouth in the image. Doing so would reduce the
data for each vector from 10,000 bits to three codes for the locations,
dramatically reducing the size of the dataset to be mined, and hence reducing
the processing effort. The feature(s) selected will depend on what the
objective(s) is/are; obviously, selecting the "right" feature(s) is
fundamental to successful data mining.
The feature vectors are divided into two sets, the
"training set" and the "test set". The training set is used
to "train" the data mining algorithm(s), while the test set is used
to verify the accuracy of any patterns found.
Data mining
Data mining commonly involves four classes of tasks
- Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
- Classification – is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines.
- Regression – Attempts to find a function which models the data with the least error.
- Association rule learning – Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
Results validation
The final step of knowledge discovery from data is
to verify the patterns produced by the data mining algorithms occur in the
wider data set. Not all patterns found by the data mining algorithms are
necessarily valid. It is common for the data mining algorithms to find patterns
in the training set which are not present in the general data set, this is
called overfitting. To overcome this, the evaluation uses a test
set of data which the data mining algorithm was not
trained on. The learnt patterns are applied to this test set and the resulting
output is compared to the desired output. For example, a data mining algorithm
trying to distinguish spam from legitimate emails would be trained on a training
set of sample emails. Once trained, the learnt
patterns would be applied to the test set of emails which it had not been
trained on, the accuracy of these patterns can then be measured from how many
emails they correctly classify. A number of statistical methods may be used to
evaluate the algorithm such as ROC
curves.
If the learnt patterns do not meet the desired
standards, then it is necessary to reevaluate and change the preprocessing and
data mining. If the learnt patterns do meet the desired standards then the
final step is to interpret the learnt patterns and turn them into knowledge.
USES
1. Games
2.
Business
3.
Science and engineering
No comments:
Post a Comment