Using Regular Expression
1 Writing a Function
plothistory = function(symbol,what){
match.arg(what,c('Open','High','Low','Close','Volume','Adj.Close'))
data = gethistory(symbol)
plot(data$Date,data[,what],main=paste(what,'price for ',symbol),type='l')
invisible(data)
}
This function introduces several features of functions that we
haven't seen yet.
- The match.arg function lets us specify a list of acceptable values
for a parameter being passed to a function, so that an error will occur
if you use a non-existent variable:
> plothistory('aapl','Last')
Error in match.arg("what", c("Open", "High", "Low", "Close", "Volume", :
'arg' should be one of "Open", "High", "Low", "Close", "Volume", "Adj.Close"
-
The invisible function prevents the returned value (data in
this case) from being printed when the output of the function is not
assigned, but also allows us to assign the output to a variable if we want
to use it latter:
> plothistory('aapl','Close')
will produce a plot without displaying the data to the screen, but
> aapl.close = plothistory('aapl','Close')
will produce the plot, and store the data in aapl.close.
2 Another Example
When using google, it's sometimes inconvenient to have to click through
all the pages. Let's write a function that will return the web links
from a google search. If you type a search time into google, for example
a search for 'introduction to r", you'll notice that the address bar of
your browser looks something like this:
http://www.google.com/search?q=introduction+to+r&ie=utf-8&oe=utf-8&aq=t&rls=com.ubuntu:en-US:unofficial&client=firefox-a
For our purposes, we only need the "q=" part of the
search. For our current example, that would be the URL
http://www.google.com/search?q=introduction+to+r
Note that, since blanks aren't allowed in URLs, plus signs are
used in place of spaces. If we were to click on the "Next" link at the
bottom of the page, the URL changes to something like
http://www.google.com/search?hl=en&safe=active&client=firefox-a&rls=com.ubuntu:en-US:unofficial&hs=xHq&q=introduction+to+r&start=10&sa=N
For our purposes, we only need to add the &start= argument
to the web page. Since google displays 10 results per page, the second page
will have start=10, the next page will have start=20, and
so on. Let's read in the first page of this search into R:
z = readLines('http://www.google.com/search?q=introduction+to+r')
Warning message:
In readLines("http://www.google.com/search?q=introduction+to+r") :
incomplete final line found on 'http://www.google.com/search?q=introduction+to+r'
As always, you can safely ignore the message about the incomplete
final line.
Since we're interested in the web links, we only want lines with
"href=" in them. Let's check how many lines we've got,
how long they are, and which ones contain the href string:
> length(z)
[1] 17
> nchar(z)
[1] 369 208 284 26 505 39 40605 1590 460 291 152 248
[13] 317 513 507 5 9
> grep('href=',z)
[1] 5 7
It's pretty clear that all of the links are on the seventh line.
Now we can construct a tagged regular expression to grab all the
links.
> hrefpat = 'href *= *"([^"]*)"'
> getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1)
> gg = gregexpr(hrefpat,z[[7]])
> res = mapply(getexpr,z[[7]],gg)
> res = sub(hrefpat,'\\1',res)
> res[1:10]
[1] "http://images.google.com/images?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wi"
[2] "http://video.google.com/videosearch?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wv"
[3] "http://maps.google.com/maps?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wl"
[4] "http://news.google.com/news?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wn"
[5] "http://www.google.com/products?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wf"
[6] "http://mail.google.com/mail/?hl=en&tab=wm"
[7] "http://www.google.com/intl/en/options/"
[8] "/preferences?hl=en"
[9] "https://www.google.com/accounts/Login?hl=en&continue=http://www.google.com/search%3Fq%3Dintroduction%2Bto%2Br"
[10] "http://www.google.com/webhp?hl=en"
We don't want the internal (google) links - we want external links which
will begin with "http://". Let's extract all the external links,
and then eliminate the ones that just go back to google:
> refs = res[grep('^https?:',res)]
> refs = refs[-grep('google.com/',refs)]
> refs[1:3]
[1] "http://cran.r-project.org/doc/manuals/R-intro.pdf"
[2] "http://74.125.155.132/search?q=cache:d4-KmcWVA-oJ:cran.r-project.org/doc/manuals/R-intro.pdf+introduction+to+r&cd=1&hl=en&ct=clnk&gl=us&ie=UTF-8"
[3] "http://74.125.155.132/search?q=cache:d4-KmcWVA-oJ:cran.r-project.org/doc/manuals/R-intro.pdf+introduction+to+r&cd=1&hl=en&ct=clnk&gl=us&ie=UTF-8"
If you're familiar with google, you may recognize these as the links to
google's cached results. We can easily eliminate them:
> refs = refs[-grep('cache:',refs)]
> length(refs)
[1] 10
We can test these same steps with some of the other pages from this
query:
> z = readLines('http://www.google.com/search?q=introduction+to+r&start=10')
Warning message:
In readLines("http://www.google.com/search?q=introduction+to+r&start=10") :
incomplete final line found on 'http://www.google.com/search?q=introduction+to+r&start=10'
> hrefpat = 'href *= *"([^"]*)"'
> getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1)
> gg = gregexpr(hrefpat,z[[7]])
> res = mapply(getexpr,z[[7]],gg)
Error in substring(s, g, g + attr(g, "match.length") - 1) :
invalid multibyte string at '<93>GNU S'
Unfortunately, there seems to be a problem. Fortunately, it's easy
to fix. What the message is telling us is that there's a character in one of the
results that doesn't make sense in the language (English) that we're using. We
can solve this by typing:
> Sys.setlocale('LC_ALL','C')
> res = mapply(getexpr,z[[7]],gg)
Since we no longer get the error, we can continue
> res = sub(hrefpat,'\\1',res)
> refs = res[grep('^https?:',res)]
> refs = refs[-grep('google.com/',refs)]
> refs = refs[-grep('cache:',refs)]
> length(refs)
[1] 10
Once again, it found all ten links. This obviously suggests a
function:
googlerefs = function(term,pg=0){
getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1)
qurl = paste('http://www.google.com/search?q=',term,sep='')
if(pg > 0)qurl = paste(qurl,'&start=',pg * 10,sep='')
qurl = gsub(' ','+',qurl)
z = readLines(qurl)
hrefpat = 'href *= *"([^"]*)"'
wh = grep(hrefpat,z)[2]
gg = gregexpr(hrefpat,z[[wh]])
res = mapply(getexpr,z[[wh]],gg)
res = sub(hrefpat,'\\1',res)
refs = res[grep('^https?:',res)]
refs = refs[-grep('google.com/|cache:',refs)]
names(refs) = NULL
refs[!is.na(refs)]
}
Now suppose that we want to retreive the links for the first
ten pages of query results:
> links = sapply(0:9,function(pg)googlerefs('introduction to r',pg))
> links = unlist(links)
> head(links)
[1] "http://cran.r-project.org/doc/manuals/R-intro.pdf"
[2] "http://cran.r-project.org/manuals.html"
[3] "http://www.biostat.wisc.edu/~kbroman/Rintro/"
[4] "http://faculty.washington.edu/tlumley/Rcourse/"
[5] "http://www.stat.cmu.edu/~larry/all-of-statistics/=R/Rintro.pdf"
[6] "http://www.stat.berkeley.edu/~spector/R.pdf"
File translated from
TEX
by
TTH,
version 3.67.
On 18 Feb 2011, 15:47.