Data can be obtained from automatic categorization of text

Data
Mining and Classification

 

There
has been a significant increase observed in the generation of data each day.
And hence with the advent of smarter technologies, data is required to be
classified and sorted before framing out decisions from it. There are so many
techniques available for classifying documents into various categories or
labels. Data mining is the process of non-trivial extraction of novel,
implicit, and actionable knowledge from large data sets 1. Popularly referred
to as Knowledge Discovery in Databases (KDD). It is the automated extraction of
patterns representing knowledge implicitly stored in large databases, data
warehouses and other massive information repositories. Standard data mining
methods may be integrated with information retrieval techniques and the
construction or use of hierarchies specifically for text data as well as
discipline oriented term categorization systems (such as in chemistry,
medicine, law, or economics) 2. Text databases are databases that contain
word descriptions for objects. These word descriptions are usually not simple
keywords but rather long sentences or paragraphs, such as product
specifications, error or bug reports, warning messages, summary reports, notes,
or other documents.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

      Classification is a data mining technique
used to predict group membership for data instances. It is used to build
structures from examples of past decisions that can be used to make decisions
for unseen or future cases. Various classification techniques are used for web
page classification, data classification etc. Coming on to Text categorization,
it is the task of automated assigning of texts to predefined categories based
on their content by learning models of categorized collections of documents.
Text categorization is the primary requirement of Text Retrieval systems. As
the amount of online text increases, the demand for text categorization for the
analysis and management of text is increasing. Though the text is cheap, it is
expensive to get the information that, to which class a text belongs to. This
information can be obtained from automatic categorization of text at low cost,
but building the classifier itself is expensive because it require a lot of
human effort or it must be  trained from
texts which have themselves been manually classified. The task is usually
performed in two stages: 1) the training phase and 2) the testing phase. During
the training phase, sample documents are provided to the document classifier
for each predefined category. The classifier uses machine learning algorithms
to learn a class prediction model based on these labelled documents. In the
testing phase, unlabelled documents are provided to the classifier, which
applies its classification model to determine the categories or classes of the
unseen documents. This training-testing approach makes the process of document
classification a supervised learning task where unlabelled documents are
categorized into known categories. Seeing the importance of text mining
numerous text mining applications today use some form of text classification
and this has fuelled extensive research in the area. Efficient training and
application and building understandable classifiers, are continuing fields of
text classification research. Document (email) filtering and routing is a very
important application in large corporate settings. Spam filtering is perhaps
the most common application that impacts all of us, with Bayesian or rule-based
spam filtering being necessary component in all client and server mail software.
Web directories are an invaluable source of well categorized information on a
broad variety of topics on the web, and though manually created for now, there
are many applications using them for better information presentation and
navigation.

            Data mining is the process of applying machine
learning techniques for automatically or semi automatically analysing and
extracting knowledge from stored data. It can also be defined as technology
which enables data analysis, exploration and visualization of very large
databases using high level of abstraction. Data mining also known as
Knowledge-Discovery in Databases,

·        
   Predictive
tasks: The main objective is to predict the value of an attribute based on
already known values of other attributes.

·        
Descriptive tasks: The
main objective is to derive patterns that lend information in order to derive
the underlying relationships in data. Based on the main techniques used in data
mining, Predictive Modelling refers to the process of building a model for the
target variable as a function of other variables. The main two types of
predictive modelling are Classification and Regression.

    The
automated document classification may follow these approaches:

·        
Rule
Based Classification: Here, the user groups the documents together, decide on
categories, and formulates the rules that define those categories? these rules
are actually query phrases 5. Then a matching operator is applied to classify
the documents. This approach is very accurate for small document sets. Results are
always based on what user defines, since user write the rules. But, defining
rules can be tedious for large document sets with many categories. As the
document set grows, user may need to write correspondingly more rules.

·        
 Machine Learning Based Approach: Here, the
machine is trained using a set of sample documents that are already classified
into the classes (training data) and as it learns it hence automatically create
classifiers based on this data 6. On one hand it shows a high predictive
performance but on the other side we might require an effective training data
set.

·        
Supervised
classification: Supervised classification which requires an external mechanism
(such as human feedback) to provide information on correct classification for
documents 7, 8, 10. Non supervised classification (also called document
clustering) where the classification must be done entirely without reference to
external information.

Naive Bayes Theorem: Easy calculation,
rapid classification, independent properties, continuous property values
difficult to be deal.