Monday, 15 August 2011

r - subsetting on a factor column parsed by readHTMLTable return no result -



r - subsetting on a factor column parsed by readHTMLTable return no result -

i ran weird subsetting problem. problem can subset 1 column cannot subset other. both columns seemed have been parsed readhtmltable same way.

the code replicate

require(xml) theurl <- "http://en.wikipedia.org/wiki/list_of_stock_exchanges" html <- htmlparse(theurl) sedata <- readhtmltable(html)[[2]] names(sedata) = c("rank","ex","economy","hq","marketcap","tradevalue") sedata = transform(sedata,marketcap = as.numeric(gsub(",","",marketcap))) sedata = transform(sedata,tradevalue = as.numeric(gsub(",","",tradevalue)))

i want subset indian stock exchange, used:

> subset(sedata,sedata$economy == "india") [1] rank ex economic scheme hq marketcap tradevalue <0 rows> (or 0-length row.names) > subset(sedata,sedata$economy == " india") [1] rank ex economic scheme hq marketcap tradevalue <0 rows> (or 0-length row.names)

i don't rows back, despite having validated there 2 rows should satisfy condition, can same thing other column "ex":

> subset(sedata,sedata$ex == "jse limited") rank ex economic scheme hq marketcap tradevalue 17 17 jse limited southafrica johannesburg 903 287

i've ran other functions , 2 columns same..

> sapply(sedata,class) rank ex economic scheme hq marketcap tradevalue "factor" "factor" "factor" "factor" "numeric" "numeric" > levels(sedata$economy) [1] " australia" " brazil" " canada" [4] " china" " germany" " hong kong" [7] " india" " japan" " russia" ... > levels(sedata$ex) [1] "australian securities exchange" "bme spanish exchanges" [3] "bm&f bovespa" "bombay stock exchange" [5] "deutsche börse" "hong kong stock exchange" [7] "jse limited" "korea exchange" ...

what did miss? wrong subsetting command used? :(

subset(sedata,sedata$economy == " india")

as mentioned in comments, issue occurs due non standard characters in info guessing happens due character encoding settings.

instead of using subset may want utilize standard [ subsetting grepl grepl give vector of logical values , utilize subset dataframe. allows partial matching added bonus

> sedata[grepl('india', sedata$ex),] rank ex economic scheme hq marketcap tradevalue 11 11 national stock exchange of republic of india  republic of india bombay 1234 442

edit

grepl work within subset function

> subset(sedata, grepl('india', sedata$ex) ) rank ex economic scheme hq marketcap tradevalue 11 11 national stock exchange of republic of india  republic of india bombay 1234 442

r

No comments:

Post a Comment