Hee: r - subsetting on a factor column parsed by readHTMLTable return no result -

Monday, 15 August 2011

r - subsetting on a factor column parsed by readHTMLTable return no result -

i ran weird subsetting problem. problem can subset 1 column cannot subset other. both columns seemed have been parsed readhtmltable same way.

the code replicate

require(xml) theurl <- "http://en.wikipedia.org/wiki/list_of_stock_exchanges" html <- htmlparse(theurl) sedata <- readhtmltable(html)[[2]] names(sedata) = c("rank","ex","economy","hq","marketcap","tradevalue") sedata = transform(sedata,marketcap = as.numeric(gsub(",","",marketcap))) sedata = transform(sedata,tradevalue = as.numeric(gsub(",","",tradevalue)))

i want subset indian stock exchange, used:

> subset(sedata,sedata$economy == "india") [1] rank       ex          economic  scheme    hq         marketcap  tradevalue <0 rows> (or 0-length row.names) > subset(sedata,sedata$economy == " india") [1] rank       ex          economic  scheme    hq         marketcap  tradevalue <0 rows> (or 0-length row.names)

i don't rows back, despite having validated there 2 rows should satisfy condition, can same thing other column "ex":

> subset(sedata,sedata$ex == "jse limited")    rank          ex       economic  scheme           hq marketcap tradevalue 17   17 jse limited  southafrica johannesburg       903        287

i've ran other functions , 2 columns same..

> sapply(sedata,class)       rank         ex     economic  scheme         hq  marketcap tradevalue    "factor"   "factor"   "factor"   "factor"  "numeric"  "numeric"  > levels(sedata$economy)  [1] " australia"             " brazil"                " canada"                 [4] " china"                 " germany"               " hong kong"              [7] " india"                 " japan"                 " russia"                ... > levels(sedata$ex)  [1] "australian securities exchange"   "bme spanish exchanges"             [3] "bm&f bovespa"                     "bombay stock exchange"             [5] "deutsche bÃ¶rse"                  "hong kong stock exchange"          [7] "jse limited"                      "korea exchange"                    ...

what did miss? wrong subsetting command used? :(

subset(sedata,sedata$economy == " india")

as mentioned in comments, issue occurs due non standard characters in info guessing happens due character encoding settings.

instead of using subset may want utilize standard [ subsetting grepl grepl give vector of logical values , utilize subset dataframe. allows partial matching added bonus

> sedata[grepl('india', sedata$ex),]    rank                               ex  economic  scheme     hq marketcap tradevalue 11   11 national stock exchange of  republic of india Â  republic of india  bombay      1234        442

edit

grepl work within subset function

> subset(sedata,  grepl('india', sedata$ex) )    rank                               ex  economic  scheme     hq marketcap tradevalue 11   11 national stock exchange of  republic of india Â  republic of india  bombay      1234        442

Hee

Monday, 15 August 2011

r - subsetting on a factor column parsed by readHTMLTable return no result -

No comments:

Post a Comment