Many of the methods used in data mining are actually derived from the statistics, especially multi-variate statistics and are often matched only in complexity for use in data mining. In addition, they are critical to Knowledge Management Solutions.

The loss of accuracy is often accompanied by a loss of statistical validity, so that from a purely statistical point of view the process can be wrong. Unsupervised machine learning is closely related to data mining. Methods from machine learning often found in the data mining applications, and vice-versa.

Research in the area of database systems, in particular of index structures for data mining plays a major role when it comes to reducing the complexity. Typical tasks can be significantly accelerated using an appropriate database index, and the running time of a data mining algorithm can be improved.

Information retrieval (IR) is another area of expertise that benefits from knowledge of data mining. Data mining methods such as cluster analysis are used for search results. Text mining and web mining are two specializations of data mining, which are closely connected to the information retrieval.

The data collection, that is, the recording of information in a systematic manner is an important prerequisite to obtain valid results with the help of data mining.

Data mining process

Data mining is the actual analysis step of Knowledge Discovery. The steps of the iterative process are roughly outlined as follows: Focus: data collection and selection, but also the determination of existing knowledge. Preprocessing: data cleansing, integrated in the sources. Inconsistencies are eliminated , for example by removing incomplete data sets.

Transformation into the appropriate format for the analysis step, for example through selection of attributes or the discretization of the values. Data mining, the actual analysis step. Evaluation of the patterns found by the expert and control of targets achieved.

Typical tasks of data mining and Knowledge Management Solutions include outlier detection: identification of unusual records: outliers, errors and changes. Cluster analysis: Grouping of objects on the basis of similarities and classification: unassigned elements are delegated to the existing classes.

Association analysis: identification of correlations and dependencies in the data in the form of rules, such as From A and B usually follows C. Regression analysis: identification of relationships between dependent and independent variables. Summary: Reduction of the data set into a more compact description without significant loss of information.

These tasks can still be roughly divided into observation problems (outlier detection, cluster analysis) and forecast problems (classification, regression analysis). Outlier detection: In this task, data objects that are inconsistent are searched with the rest of the data, for example by having unusual attribute values. Identified outliers are often then manually verified and removed from the record, as they may exacerbate the results of other methods.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>