Monday, 15 June 2015

Split sample of R tm corpus objects -



Split sample of R tm corpus objects -

i using r tm package, trying split corpus training set , testing set, , encode metadata selection. what's easiest way (suppose i'm trying split sample in half)?

here things i've tried:

i know when type... > meta(d) metaid y 1 0 1 2 0 1

i see ids, cannot seem access them (in order first half belong in 1 set, , sec in set). rownames(attributes(d)$dmetadata) gives me indexes, looks ugly, , they're factors.

now, after converting dataframe, d dataset, say: half <- floor(dim(d)[1]/2) d$train <- d[1:half,] d$test <- d[(half+1):(half*2),]

but how can like...

meta(d, tag="split") = ifelse((meta(d,"id")<=floor(length(d)/2)),"train","test")

...to result like:

> meta(d) metaid y split 1 0 1 train 2 0 1 train ... . . ... 100 0 1 test

unfortunately, meta(d,"id") doesn't work, meta(d[[1]],"id") == 1 does, redundant. i'm looking whole-vector way of accessing meta id, or smarter way of subsetting , assigning "split" meta variable.

a corpus list. can split normal list . here example:

i create data. utilize info within tm package

txt <- system.file("texts", "txt", bundle = "tm") (ovid <- corpus(dirsource(txt))) corpus 5 text documents

now split info train , test

nn <- length(ovid) ff <- as.factor(c(rep('train',ceiling(nn/2)), ## create split factor want rep('test',nn-ceiling(nn/2)))) ## can add together validation set example... ll <- split(as.matrix(ovid),ff) ll $test corpus 2 text documents $train corpus 3 text documents

then assign new tag

ll <- sapply( names(ll), function(x) { meta(ll[[x]],tag = 'split') <- ff[ff==x] ll[x] })

you can check result:

lapply(ll,meta) $test.test metaid split 4 0 test 5 0 test $train.train metaid split 1 0 train 2 0 train 3 0 train

r tm

No comments:

Post a Comment