Home / Papers / Phrase detection Project proposal for Machine Learning course project

Phrase detection Project proposal for Machine Learning course project

88 Citations2006
S. Shringarpure
journal unavailable

A probabilistic method of finding meaningful phrases in the Twenty-Newsgroups text corpus by indexing terms which have length more than one is explored.

Abstract

Queries made to search engines are normally longer than a sin gle word in length. In fact, [3] show in an analysis of Altavista query logs that approxim ately more than half of the queries have length more than one. Conventional IR methods p ropose intersection of the occurence lists for each word in a phrase, using various meth ods to reduce the time required for this task. Thus, the average response time of a search eng ine can be reduced by indexing terms which have length more than one. However, in a index whi ch as N words, there are potentiallyN bigram phrases and N trigram phrases and so on. Clearly it would be infeasible to index all possible bigrams, trigrams etc. We w ould therefore like to obtain such phrases which are “meaningful”which we define as their co-occurence being not merely due to chance.We will explore a probabilistic method of finding meaningful phrases in the Twenty-Newsgroups text corpus.