hash - How to implement Bag of words feature hashing in python? -
i'm trying classify few one thousand documents, few lines each. i've used regular handbag of words before, want utilize hashing trick time, , i'm having problem understanding implementation. there around 8000 unique words in data, figure 128*128 should enough
i'm using these sources:
http://blog.someben.com/2013/01/hashing-lang/ http://www.hpl.hp.com/techreports/2008/hpl-2008-91r1.pdf
here function generatve feature vectors each document:
import mmh3 def add_doc(text): text = str.split(text) d_input = dict() word in text: hashed_token = mmh3.hash(word) % 127 d_input[hashed_token] = d_input.setdefault(hashed_token, 0) + 1 return(d_input)
now must doing wrong, or not understanding somewhere, because there seem huge amount of collisions. help appreciated
you should not modding hash % 127, generate 127 possible outputs, seem want 128^2 possible outputs per 8000 unique words reasoning.
python hash machine-learning nlp
No comments:
Post a Comment