python - how to extract specific lines from a data file -
i have problem sense solution should quite simple. i'm building model , want test accuracy 10-fold cross-validation. have split training corpus 90%/10% training , test sections, train model on 90% , test on 10%. want 10 times, taking different 90%/10% split every time, each bit of corpus has been used testing data. i'll average results each 10% test.
i have tried write script extract 10% of training corpus , write new file, far don't working. have done counting total number of lines in file, , dividing number 10 know size of each of 10 different test sets want extract.
trainfile = open("danish.train") numberoflines = 0 line in trainfile: numberoflines += 1 lengthtest = numberoflines / 10
i have found, own training file, consists of 3638 lines, each test should consist of 363 lines.
how write line 1-363, line 364-726, etc. different test files?
once have count of lines, go origin of file, , start copying out lines danish.train.part-01
. when line number multiple of size of 10% test set, open new file next part.
#!/usr/bin/env python2.7 trainfile = open("danish.train") numberoflines = 0 line in trainfile: numberoflines += 1 lengthtest = numberoflines / 10 # rewind file origin trainfile.seek(0) numberoflines = 0 file_number = 0 line in trainfile: if numberoflines % lengthtest == 0: file_number += 1 output = open('danish.train.part-%02d' % file_number, 'w') numberoflines += 1 output.write(line)
on input file (sorry don’t speak danish!):
one 2 3 4 5 6 7 8 9 10 11 twelve 13 14 15 16 seventeen 18 19 20 twenty-one twenty-two twenty-three twenty-four twenty-five twenty-six twenty-seven twenty-eight twenty-nine 30
this creates files
danish.train.part-01 danish.train.part-02 danish.train.part-03 danish.train.part-04 danish.train.part-05 danish.train.part-06 danish.train.part-07 danish.train.part-08 danish.train.part-09 danish.train.part-10
and part 5, example, contains:
thirteen 14 15
python file-io
No comments:
Post a Comment