r - subsetting on a factor column parsed by readHTMLTable return no result -
i ran weird subsetting problem. problem can subset 1 column cannot subset other. both columns seemed have been parsed readhtmltable same way.
the code replicate
require(xml) theurl <- "http://en.wikipedia.org/wiki/list_of_stock_exchanges" html <- htmlparse(theurl) sedata <- readhtmltable(html)[[2]] names(sedata) = c("rank","ex","economy","hq","marketcap","tradevalue") sedata = transform(sedata,marketcap = as.numeric(gsub(",","",marketcap))) sedata = transform(sedata,tradevalue = as.numeric(gsub(",","",tradevalue)))
i want subset indian stock exchange, used:
> subset(sedata,sedata$economy == "india") [1] rank ex economic scheme hq marketcap tradevalue <0 rows> (or 0-length row.names) > subset(sedata,sedata$economy == " india") [1] rank ex economic scheme hq marketcap tradevalue <0 rows> (or 0-length row.names)
i don't rows back, despite having validated there 2 rows should satisfy condition, can same thing other column "ex":
> subset(sedata,sedata$ex == "jse limited") rank ex economic scheme hq marketcap tradevalue 17 17 jse limited southafrica johannesburg 903 287
i've ran other functions , 2 columns same..
> sapply(sedata,class) rank ex economic scheme hq marketcap tradevalue "factor" "factor" "factor" "factor" "numeric" "numeric" > levels(sedata$economy) [1] " australia" " brazil" " canada" [4] " china" " germany" " hong kong" [7] " india" " japan" " russia" ... > levels(sedata$ex) [1] "australian securities exchange" "bme spanish exchanges" [3] "bm&f bovespa" "bombay stock exchange" [5] "deutsche börse" "hong kong stock exchange" [7] "jse limited" "korea exchange" ...
what did miss? wrong subsetting command used? :(
subset(sedata,sedata$economy == " india")
as mentioned in comments, issue occurs due non standard characters in info guessing happens due character encoding settings.
instead of using subset
may want utilize standard [
subsetting grepl
grepl
give vector of logical values , utilize subset dataframe. allows partial matching added bonus
> sedata[grepl('india', sedata$ex),] rank ex economic scheme hq marketcap tradevalue 11 11 national stock exchange of republic of india  republic of india bombay 1234 442
edit
grepl
work within subset
function
> subset(sedata, grepl('india', sedata$ex) ) rank ex economic scheme hq marketcap tradevalue 11 11 national stock exchange of republic of india  republic of india bombay 1234 442
r
No comments:
Post a Comment