Sunday, 15 August 2010

hash - How to implement Bag of words feature hashing in python? -



hash - How to implement Bag of words feature hashing in python? -

i'm trying classify few one thousand documents, few lines each. i've used regular handbag of words before, want utilize hashing trick time, , i'm having problem understanding implementation. there around 8000 unique words in data, figure 128*128 should enough

i'm using these sources:

http://blog.someben.com/2013/01/hashing-lang/ http://www.hpl.hp.com/techreports/2008/hpl-2008-91r1.pdf

here function generatve feature vectors each document:

import mmh3 def add_doc(text): text = str.split(text) d_input = dict() word in text: hashed_token = mmh3.hash(word) % 127 d_input[hashed_token] = d_input.setdefault(hashed_token, 0) + 1 return(d_input)

now must doing wrong, or not understanding somewhere, because there seem huge amount of collisions. help appreciated

you should not modding hash % 127, generate 127 possible outputs, seem want 128^2 possible outputs per 8000 unique words reasoning.

python hash machine-learning nlp

No comments:

Post a Comment