Hee: hash - How to implement Bag of words feature hashing in python? -

Sunday, 15 August 2010

hash - How to implement Bag of words feature hashing in python? -

i'm trying classify few one thousand documents, few lines each. i've used regular handbag of words before, want utilize hashing trick time, , i'm having problem understanding implementation. there around 8000 unique words in data, figure 128*128 should enough

i'm using these sources:

http://blog.someben.com/2013/01/hashing-lang/ http://www.hpl.hp.com/techreports/2008/hpl-2008-91r1.pdf

here function generatve feature vectors each document:

import mmh3  def add_doc(text):     text = str.split(text)     d_input = dict()     word in text:         hashed_token = mmh3.hash(word) % 127         d_input[hashed_token] = d_input.setdefault(hashed_token, 0) + 1     return(d_input)

now must doing wrong, or not understanding somewhere, because there seem huge amount of collisions. help appreciated

you should not modding hash % 127, generate 127 possible outputs, seem want 128^2 possible outputs per 8000 unique words reasoning.

python hash machine-learning nlp

Hee

Sunday, 15 August 2010

hash - How to implement Bag of words feature hashing in python? -

No comments:

Post a Comment