Monday, 15 March 2010

Python - Finding word frequencies of list of words in text file -



Python - Finding word frequencies of list of words in text file -

i trying speed project count word frequencies. have 360+ text files, , need total number of words , number of times each word list of words appears. know how single text file.

>>> import nltk >>> import os >>> os.chdir("c:\users\cameron\desktop\pdf-to-txt") >>> filename="1976.03.txt" >>> textfile=open(filename,"r") >>> inputstring=textfile.read() >>> word_list=re.split('\s+',file(filename).read().lower()) >>> print 'words in text:', len(word_list) #spits out number of words in textfile >>> word_list.count('inflation') #spits out number of times 'inflation' occurs in textfile >>>word_list.count('jobs') >>>word_list.count('output')

its tedious frequencies of 'inflation', 'jobs', 'output' individual. can set these words list , find frequency of words in list @ same time? this python.

example: instead of this:

>>> word_list.count('inflation') 3 >>> word_list.count('jobs') 5 >>> word_list.count('output') 1

i want (i know isn't real code, i'm asking help on):

>>> list1='inflation', 'jobs', 'output' >>>word_list.count(list1) 'inflation', 'jobs', 'output' 3, 5, 1

my list of words going have 10-20 terms, need able point python toward list of words counts of. nice if output able copy+paste excel spreadsheet words columns , frequencies rows

example:

inflation, jobs, output 3, 5, 1

and finally, can help automate of textfiles? figure point python toward folder , can above word counting new list each of 360+ text files. seems easy enough, i'm bit stuck. help?

an output fantastic: filename1 inflation, jobs, output 3, 5, 1

filename2 inflation, jobs, output 7, 2, 4 filename3 inflation, jobs, output 9, 3, 5

thanks!

collections.counter() has covered if understand problem.

the illustration docs seem match problem.

# tally occurrences of words in list cnt = counter() word in ['red', 'blue', 'red', 'green', 'blue', 'blue']: cnt[word] += 1 print cnt # find 10 mutual words in hamlet import re words = re.findall('\w+', open('hamlet.txt').read().lower()) counter(words).most_common(10)

from illustration above should able do:

import re import collections words = re.findall('\w+', open('1976.03.txt').read().lower()) print collections.counter(words)

edit naive approach show 1 way.

wanted = "fish chips steak" cnt = counter() words = re.findall('\w+', open('1976.03.txt').read().lower()) word in words: if word in wanted: cnt[word] += 1 print cnt

python text frequency

No comments:

Post a Comment