The blog data for this project has already been tokenized, but the data consist of two parts, training vectors and dictionary, which is quite crude.
The blog data we have for the project has already been tokenized, more specifically, the data consist of two parts, training vectors and dictionary. Each training sample is represented as a vector, with approximately 1000 components and each component as a word id mapping to the word in the dictionary. However, the dictionary we have is quite crude. Many words are actually closely related, with rather similar meanings and could be seen as the same word. Also there are some random signs and characters with no specific meanings, such as: u