Friday, 15 April 2011

python - how to extract specific lines from a data file -



python - how to extract specific lines from a data file -

i have problem sense solution should quite simple. i'm building model , want test accuracy 10-fold cross-validation. have split training corpus 90%/10% training , test sections, train model on 90% , test on 10%. want 10 times, taking different 90%/10% split every time, each bit of corpus has been used testing data. i'll average results each 10% test.

i have tried write script extract 10% of training corpus , write new file, far don't working. have done counting total number of lines in file, , dividing number 10 know size of each of 10 different test sets want extract.

trainfile = open("danish.train") numberoflines = 0 line in trainfile: numberoflines += 1 lengthtest = numberoflines / 10

i have found, own training file, consists of 3638 lines, each test should consist of 363 lines.

how write line 1-363, line 364-726, etc. different test files?

once have count of lines, go origin of file, , start copying out lines danish.train.part-01. when line number multiple of size of 10% test set, open new file next part.

#!/usr/bin/env python2.7 trainfile = open("danish.train") numberoflines = 0 line in trainfile: numberoflines += 1 lengthtest = numberoflines / 10 # rewind file origin trainfile.seek(0) numberoflines = 0 file_number = 0 line in trainfile: if numberoflines % lengthtest == 0: file_number += 1 output = open('danish.train.part-%02d' % file_number, 'w') numberoflines += 1 output.write(line)

on input file (sorry don’t speak danish!):

one 2 3 4 5 6 7 8 9 10 11 twelve 13 14 15 16 seventeen 18 19 20 twenty-one twenty-two twenty-three twenty-four twenty-five twenty-six twenty-seven twenty-eight twenty-nine 30

this creates files

danish.train.part-01 danish.train.part-02 danish.train.part-03 danish.train.part-04 danish.train.part-05 danish.train.part-06 danish.train.part-07 danish.train.part-08 danish.train.part-09 danish.train.part-10

and part 5, example, contains:

thirteen 14 15

python file-io

No comments:

Post a Comment