Split sample of R tm corpus objects -
i using r tm package, trying split corpus training set , testing set, , encode metadata selection. what's easiest way (suppose i'm trying split sample in half)?
here things i've tried:
i know when type...> meta(d) metaid y 1 0 1 2 0 1
i see ids, cannot seem access them (in order first half belong in 1 set, , sec in set). rownames(attributes(d)$dmetadata)
gives me indexes, looks ugly, , they're factors.
half <- floor(dim(d)[1]/2) d$train <- d[1:half,] d$test <- d[(half+1):(half*2),]
but how can like...
meta(d, tag="split") = ifelse((meta(d,"id")<=floor(length(d)/2)),"train","test")
...to result like:
> meta(d) metaid y split 1 0 1 train 2 0 1 train ... . . ... 100 0 1 test
unfortunately, meta(d,"id")
doesn't work, meta(d[[1]],"id") == 1
does, redundant. i'm looking whole-vector way of accessing meta id, or smarter way of subsetting , assigning "split" meta variable.
a corpus list. can split normal list . here example:
i create data. utilize info within tm
package
txt <- system.file("texts", "txt", bundle = "tm") (ovid <- corpus(dirsource(txt))) corpus 5 text documents
now split info train , test
nn <- length(ovid) ff <- as.factor(c(rep('train',ceiling(nn/2)), ## create split factor want rep('test',nn-ceiling(nn/2)))) ## can add together validation set example... ll <- split(as.matrix(ovid),ff) ll $test corpus 2 text documents $train corpus 3 text documents
then assign new tag
ll <- sapply( names(ll), function(x) { meta(ll[[x]],tag = 'split') <- ff[ff==x] ll[x] })
you can check result:
lapply(ll,meta) $test.test metaid split 4 0 test 5 0 test $train.train metaid split 1 0 train 2 0 train 3 0 train
r tm
No comments:
Post a Comment