Accès chercheur

EEDIS Laboratory

Evolutionary Engineering

and

Distributed Information Systems

Réseaux et Communication

Sécurité et Multimédia

Ingénierie des Connaissances

Data Mining & Web Intelligent

Interopérabilité des Systèmes d’information
& Bases de données

Développement Orienté Service

N-grams in Texts Categorization


Auteurs:	» ELBERRICHI Zakaria » Aljohar Badr
Type :	Revue Internationale
Nom du journal :	Scientific Journal of King Faisal University (Basic and Applied Sciences) ISSN:
Volume : 8	Issue: 2	Pages: 1428H
Lien : »
Publié le :	01-01-2007

This paper deals with automatic classification of documents; this is performed by a supervised classification since it operates on a set of preset classes. The suggested approach is original since it is based on a vector representation of the documents centred not on the words but on the n-grams of characters for n varying from 2 to 5.

Considering the significant number of the n-grams generated for each class, we used in our work the law of Ï‡2 to reduce the number of the characteristic ngrams of each class. The weighting of the vectors was done by using the measurement of the TFIDF, and for the calculation of the distance between two vectors, we used the method of the Cosine. The experiments were done on two well-known corpora in the community of categorization, the Reuter 21578 and the 20Newsgroups. Evaluation of the approach was performed by using a function combining both precision and recall.