This is a Demo on IMDB dataset, where documents are movie reviews. Labels are annotations “good” or “bad” for each review obtained from ratings.
The purpose of a model is to predict the “good” or “bad” from the review text. It is essentially the same problem as predicting “spam” or “not spam” from the email text.

--------------------------------train -------------------------
(pos) I highly recommend this movie.
(neg) I do not recommend this movie to anybody.
(neg) It is a waste of time.
(pos) Good fun stuff !
(neg) It's just not worth your time.
--------------------------------------------------------------
---------------------------------- test --------------------------
(neg) I do not recommend this movie unless you are prepared for the biggest waste of money and time of your life.
(neg) This movie was the slowest and most boring so called horror that I have ever seen.
(neg) The film is not worth watching.
(pos) A wonderful film
(pos) This is a really nice and sweet movie that the entire family can enjoy.
-------------------------------------------------------------------

Gather ngrams only once from the training set. Use these ngrams to compute matching scores for both training set and test set. Make sure the same ngrams are used and the orders are the same.

Procedure:

connected to index
there are 10 documents in the index.
number of training documents = 5
there are 2 classes in the training set.
label distribution in training set:
neg:3, pos:2,
LabelTranslator{intToExt={0=neg, 1=pos}, extToInt={neg=0, pos=1}}
fields to be considered = [body]
gathering 1-grams from field body with slop 0 and minDf 0
gathered 22 1-grams
gathering 2-grams from field body with slop 0 and minDf 0
gathered 21 2-grams
there are 43 ngrams in total
creating training set
allocating 43 columns for training set
training set created
data set saved to /huge1/people/chengli/projects/pyramid/archives/exp35/imdb_toy/1/train
creating test set
allocating 43 columns for test set
test set created
data set saved to /huge1/people/chengli/projects/pyramid/archives/exp35/imdb_toy/1/test

The format that we want: an on-disk sparse matrix
In each line, the first number is the label. The rests are feature index: feature value pairs. The feature index starts at 0. Since the feature matrix is very sparse, only non-zero feature values are stored. We expect features not listed to have value 0.

Two steps: 1. gather ngrams, 2. computing matching scores

Enumerating ngrams:
Scan all documents. For each document, pull out the term vector. Get sorted list. Scan the list.

Computing matching scores:

Fundamental constraint: cannot hold the entire dense matrix in memory

sparse matrix options:

1. use a sparse matrix library
python: numpy sparse matrix
http://docs.scipy.org/doc/scipy/reference/sparse.html

java: Mahout sparse matrix or Guava table
http://mahout.apache.org/
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Table.html

WARNING: Be careful with complexity of the operations

2. write your own data structure
array of hash maps