Using Regular Expression

1 Writing a Function

plothistory = function(symbol,what){
   match.arg(what,c('Open','High','Low','Close','Volume','Adj.Close'))
   data = gethistory(symbol)
   plot(data$Date,data[,what],main=paste(what,'price for ',symbol),type='l')
   invisible(data)
}

This function introduces several features of functions that we haven't seen yet.

The match.arg function lets us specify a list of acceptable values for a parameter being passed to a function, so that an error will occur if you use a non-existent variable:

> plothistory('aapl','Last')
Error in match.arg("what", c("Open", "High", "Low", "Close", "Volume",  : 
  'arg' should be one of "Open", "High", "Low", "Close", "Volume", "Adj.Close"

The invisible function prevents the returned value (data in this case) from being printed when the output of the function is not assigned, but also allows us to assign the output to a variable if we want to use it latter:
```
> plothistory('aapl','Close')
```
will produce a plot without displaying the data to the screen, but
```
> aapl.close = plothistory('aapl','Close')
```
will produce the plot, and store the data in aapl.close.

2 Another Example

When using google, it's sometimes inconvenient to have to click through all the pages. Let's write a function that will return the web links from a google search. If you type a search time into google, for example a search for 'introduction to r", you'll notice that the address bar of your browser looks something like this:

http://www.google.com/search?q=introduction+to+r&ie=utf-8&oe=utf-8&aq=t&rls=com.ubuntu:en-US:unofficial&client=firefox-a

For our purposes, we only need the "q=" part of the search. For our current example, that would be the URL

http://www.google.com/search?q=introduction+to+r

Note that, since blanks aren't allowed in URLs, plus signs are used in place of spaces. If we were to click on the "Next" link at the bottom of the page, the URL changes to something like

http://www.google.com/search?hl=en&safe=active&client=firefox-a&rls=com.ubuntu:en-US:unofficial&hs=xHq&q=introduction+to+r&start=10&sa=N

For our purposes, we only need to add the &start= argument to the web page. Since google displays 10 results per page, the second page will have start=10, the next page will have start=20, and so on. Let's read in the first page of this search into R:

z = readLines('http://www.google.com/search?q=introduction+to+r')
Warning message:
In readLines("http://www.google.com/search?q=introduction+to+r") :
  incomplete final line found on 'http://www.google.com/search?q=introduction+to+r'

As always, you can safely ignore the message about the incomplete final line.

Since we're interested in the web links, we only want lines with "href=" in them. Let's check how many lines we've got, how long they are, and which ones contain the href string:

> length(z)
[1] 17
> nchar(z)
 [1]   369   208   284    26   505    39 40605  1590   460   291   152   248
[13]   317   513   507     5     9
> grep('href=',z)
[1] 5 7

It's pretty clear that all of the links are on the seventh line.

Now we can construct a tagged regular expression to grab all the links.

> hrefpat = 'href *= *"([^"]*)"'
> getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1)
> gg = gregexpr(hrefpat,z[[7]])
> res = mapply(getexpr,z[[7]],gg)
> res = sub(hrefpat,'\\1',res)
> res[1:10]
 [1] "http://images.google.com/images?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wi"                          
 [2] "http://video.google.com/videosearch?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wv"                      
 [3] "http://maps.google.com/maps?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wl"                              
 [4] "http://news.google.com/news?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wn"                              
 [5] "http://www.google.com/products?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wf"                           
 [6] "http://mail.google.com/mail/?hl=en&tab=wm"                                                                    
 [7] "http://www.google.com/intl/en/options/"                                                                       
 [8] "/preferences?hl=en"                                                                                           
 [9] "https://www.google.com/accounts/Login?hl=en&continue=http://www.google.com/search%3Fq%3Dintroduction%2Bto%2Br"
[10] "http://www.google.com/webhp?hl=en"

We don't want the internal (google) links - we want external links which will begin with "http://". Let's extract all the external links, and then eliminate the ones that just go back to google:

> refs = res[grep('^https?:',res)]
> refs = refs[-grep('google.com/',refs)]
> refs[1:3]
 [1] "http://cran.r-project.org/doc/manuals/R-intro.pdf"                                                                                                                   
 [2] "http://74.125.155.132/search?q=cache:d4-KmcWVA-oJ:cran.r-project.org/doc/manuals/R-intro.pdf+introduction+to+r&cd=1&hl=en&ct=clnk&gl=us&ie=UTF-8"                    
 [3] "http://74.125.155.132/search?q=cache:d4-KmcWVA-oJ:cran.r-project.org/doc/manuals/R-intro.pdf+introduction+to+r&amp;cd=1&amp;hl=en&amp;ct=clnk&amp;gl=us&amp;ie=UTF-8"

If you're familiar with google, you may recognize these as the links to google's cached results. We can easily eliminate them:

> refs = refs[-grep('cache:',refs)]
> length(refs)
[1] 10

We can test these same steps with some of the other pages from this query:

> z = readLines('http://www.google.com/search?q=introduction+to+r&start=10')
Warning message:
In readLines("http://www.google.com/search?q=introduction+to+r&start=10") :
  incomplete final line found on 'http://www.google.com/search?q=introduction+to+r&start=10'
> hrefpat = 'href *= *"([^"]*)"'
> getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1)
> gg = gregexpr(hrefpat,z[[7]])
> res = mapply(getexpr,z[[7]],gg)
Error in substring(s, g, g + attr(g, "match.length") - 1) : 
  invalid multibyte string at '<93>GNU S'

Unfortunately, there seems to be a problem. Fortunately, it's easy to fix. What the message is telling us is that there's a character in one of the results that doesn't make sense in the language (English) that we're using. We can solve this by typing:

> Sys.setlocale('LC_ALL','C')
> res = mapply(getexpr,z[[7]],gg)

Since we no longer get the error, we can continue

> res = sub(hrefpat,'\\1',res)
> refs = res[grep('^https?:',res)]
> refs = refs[-grep('google.com/',refs)]
> refs = refs[-grep('cache:',refs)]
> length(refs)
[1] 10

Once again, it found all ten links. This obviously suggests a function:

googlerefs = function(term,pg=0){
  getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1)
  qurl = paste('http://www.google.com/search?q=',term,sep='')
  if(pg > 0)qurl = paste(qurl,'&start=',pg * 10,sep='')
  qurl = gsub(' ','+',qurl)
  z = readLines(qurl)
  hrefpat = 'href *= *"([^"]*)"'
  wh = grep(hrefpat,z)[2]
  gg = gregexpr(hrefpat,z[[wh]])
  res = mapply(getexpr,z[[wh]],gg)
  res = sub(hrefpat,'\\1',res)
  refs = res[grep('^https?:',res)]
  refs = refs[-grep('google.com/|cache:',refs)]
  names(refs) = NULL
  refs[!is.na(refs)]
}

Now suppose that we want to retreive the links for the first ten pages of query results:

> links = sapply(0:9,function(pg)googlerefs('introduction to r',pg))
> links = unlist(links)
> head(links)
[1] "http://cran.r-project.org/doc/manuals/R-intro.pdf"             
[2] "http://cran.r-project.org/manuals.html"                        
[3] "http://www.biostat.wisc.edu/~kbroman/Rintro/"                  
[4] "http://faculty.washington.edu/tlumley/Rcourse/"                
[5] "http://www.stat.cmu.edu/~larry/all-of-statistics/=R/Rintro.pdf"
[6] "http://www.stat.berkeley.edu/~spector/R.pdf"

File translated from T_EX by T_TH, version 3.67.
On 18 Feb 2011, 15:47.