Resources/Tips for Computing

Computing Resources and Tips

Git | Knitr | R | Unix | Home Page

Generally

These are a compilation of help and suggestions useful for computing and version tracking, intended to help students in my group get off the ground. If you think of more, please feel free to tell me.

For help with parallel computing and using the cluster on SCF see their help pages http://statistics.berkeley.edu/computing (in particular the description of the cluster servers and the workshops and tutorials).
How I like my students to set up their work:

I like to have a project account that we can both access on the SCF rather than storing work on your local SCF machines. This makes it easier to share, create git repositories, etc. I also can ask for significant space allotments for these projects that you won't have on your account. Also after students move on, I still have the account! I 'own' the account, and students ssh into it via ssh keys (SCF can help on this, see also https://kb.iu.edu/d/aews). Please don't put your research with me on your local SCF account, except if you are religiously using the git repository (and even then the big objects not on the repository (data and results) should all be stored on the project account, usually on the scratch account).

I use Git for version tracking of our code and paper writting, with the repository on the SCF machines. This allows us to work on the same files and not worry about writing over each other's work (much better than Dropbox). This is primarily for text files (.tex, .Rmd, .R, etc). Do not include a lot of binary files (.pdf, .rdata, etc) on git unless they are fairly 'permanent' results needed for long term (like pdfs needed by Latex paper, final versions of a paper, etc.) -- just have your copies locally and transfer around via scp/sftp. In particular, don't put the compiled .pdf of Latex or Knitr files or any intermediate files like .aux, .md, etc.; they should be able to be created locally and they are always getting overwritten making a mess of the git repository for no reason.

Please continually commit your changes to git -- not just when I ask to see something. It is much better to have lots of small changes commited than a giant change every month. Please don't be worried that you will 'mess up' the repository. That's the point of git -- it's pretty hard to erase changes permanently.

I want my students to use Knitr to create reports, papers, and thesis. This makes sure that the R code is there for making the plots, etc. and that I can tweak the plotting code as I want. There are likely to be extensive analysis that takes a lot of time and probably the cluster to run. I don't expect that to be in the knitr file. Rather, get meaningful, compact summaries of the results of your simulations or extensive data analysis/processing that can be the starting point of the analysis you want to show in the writeup. And of course, anything that doesn't take long (e.g. simple data transformations/cleaning of the data, etc.) should be in the knitr file.

If you are making a report or presentation do put a copy of this compiled file, but in a different location (e.g. 'Reports' folder) so that we can go back to it for reference. This is particularly important if you show something to collaborators, and then later will be tweaking the code. We need a 'hard' copy of what you showed them -- and this also includes the important individual pictures, not just the pdf/html report.

Always close your R session at the end of each day (without saving your session). If it will take you longer than 5 minutes to get back where you were by reruning your code you should save some intermediate files (and have the corresponding code that you can run in BATCH mode to recreate those intermediate files). If you can't get back to where you were because you did something interactively, then its good to discover that right away and not after a week of work.

It's better to have a .txt files for your intermediate and final results, rather than .rdata files, except for rare exceptions. This will make the results more transferable and force you to create nice headers and identification for your files, rather than relying on multiple objects that link up. Some R objects are too useful to break apart (e.g. the result of lm).

I give some brief help on these tools below.

Git Repositories

I would suggest using something like SourceTree to work with git repositories (and if you are managing a package use the built-in functionality in SourceTree to use gitflow for branches to separate development and public releases).

This is a link to a pretty good tutorial and the Git Pro book

See the following article about git and GitHub to think how to integrate git into your workflow: Ten Simple Rules for Taking Advantage of git and GitHub

The following are basically just list of useful commands compiled in one quick place. You should look at tutorials to understand what these are doing.

Getting Started (Minimal you need to learn)

Make Local Git repository: note that if you are making a brand new repository and plan to connect up to a central repository on the SCF server, it's better to just clone it, rather than following these instructions.

Inside the directory where you want the repository type
git init

Add files to the git with
git add filename
You can use wild card, or add everything with "*"

Commit files to be versioned with
git commit -a -m "Your message here"

Working on your local git repository

Once you have your repository set up locally, it will keep track locally of the differences in each file, but only of those files you request to be tracked, and only when you request it. So you need to 'add' files and 'commit' changes.

Check the status of files that have been modified, etc:
git status

To add files to be tracked (or to add files whose modifications should be included in the next git if already being tracked):
git add filename

To commit changes to files to be tracked:
git commit -a -m "my message"
The '-a' says to add to the commit all modifications of files currently being tracked by git. If you forget this option, probably none of your changes will be committed; this is a common reason students think they have pushed their changes so I can see them, but in fact aren't showing up on the other end. '-a' will not add new files, however. You must do that manually with 'git add'.
If you forget the '-m' in the 'git commit' command, a screen will come up for you to enter your message (usually in VIM which is a pain.)

The above commands only add and commit your files to your local version control. To share it on the central repository, you have to make a connection to the central repository and then sync your changes to the central repository

Connect to existing central repository on SCF Serve:

Make New Local Git repository from existing Remote repository (via SSH)
In the location you want the new repository to be created (i.e. 'cloned') type
git clone ssh://addressOfServerDirectory
For example:
git clone ssh://isoform@beren.berkeley.edu/accounts/projects/isoform/gitRepos/projectName

Connect Existing Local Git repository with Remote (see also this online tutorial)
Inside the directory that has the git repository, type
git remote add origin ssh://addressOfServer
For example
git remote add origin ssh://isoform@beren.berkeley.edu/accounts/projects/isoform/gitRepos/projectName

To check what are your remote connections:
git remote -v

To change the url of a remote:
git remote set-url origin ssh://addressOfServer

Syncing with the central repository on SCF Serve

git will not allow you to add anything that will conflict with what is on the repository (e.g. changes uploaded to the repository by someone else). It forces you to incorporate those changes locally before you inflict them on the rest of the world. Usually this is done automatically by default merging, but sometimes if git can't resolve the differences between the two versions, you have to manually go and fix the files.

So generally you need to first 'commit' your changes, 'pull' from the central repos., if needed fix any differences and commit them (usually done automatically), then 'push' your new (merged) version to the repository.

Commit your changes:
git commit -a -m "My changes"

Pull down the commits from the central:
git pull
This will automatically result in merging of your work with the online. If there are conflicts between your changes and the remote's you will have to manually fix them before you can go any further (see merging below)

Push your (committed) changes to the central:
git push

Useful for day-to-day

Setup a Central Git on SCF Server: If there isn't an existing central repository, you can create one on the SCF project account.

If there is not a folder in the home directory called 'gitRepos' create it. This is where I like all of the repositories for the account to be saved.

Under the folder gitRepos, make a folder for this repository, preferably with the extensions '.git' at the end

Inside that folder, initialize the repository by
git init --bare

Note that because this is a 'bare' repository, this will have no 'files' that you can look at but only the diff files of changes at each push. This keeps anyone from accidently trying to work off of this directory.

It can frequently be useful to have a copy on the SCF account that you *can* work off of and run scripts from, etc. You should do this by cloning a copy of the central repository to an appropriate directory on the account (not under 'gitRepos'!). This local copy will need to push and pull to the repository just like you would from your computer.
Merging/Fetching (see also documentation):

If you are combining your information with another via pull then git will first 'fetch' changes from the central repos and them 'merge' together. git will try to resolve any conflicts (merge). If there is a conflict it can't resolve, they you have to manually solve them before you can push any changes (i.e. you can't add something that will break things for other people).

It can be better to do this separately if there are major changes between them. (post on why better)
git fetch

git merge origin/master

"origin/master" refers to the local copy of origin that you just fetched.

The main reason it can be safer, is that you can look and see what will happen and change options (merge has different strategies for merging, including 'ours' and 'theirs' that defines where the priority should be given). After you've gotten use to using git via pull, I would encourage you to try this way and the techniques below and be careful before you use automatic merging. I have 'lost' significant edits via merging -- of course I was able to recover them because they were all on old commits!

Resolving conflicts: Usually the best way to resolve conflicts is to use
git mergetool

This should be done after git merge. This will launch a graphical tool to compare and choose what you want to keep. (you may have to set what the graphical tool should be; Mac has opendiff if you've installed XCode. On Windows you will probably need to download a tool.)

By default, the results of merging are 'committed'. The program needs a merge message, which can be given with the option -m so it doesn't open an edit screen.

You can do a merge with out committing
git merge --no-commit origin/master

This allows you to look at the changes (e.g. with a difftool between HEAD and what's in your local directory, see below). If you're okay with them, you can then commit them like normal.

If you did a merge that resulted in conflicts, you can undo the merge (at least for committed files, see documentation) with
git merge --abort origin/master

Advanced look at what a merge will do
If you do git fetch, you can then check out what would happen if you merge before you merge. For example after a fetch,

The following gives you a list of each file that would change:

git diff --name-status origin/master

The following allows you to look, file-by-file, the differences between them
git difftool origin/master

You can also do difftool with just a single file,
git difftool origin/master path/to/file

difftool is just the gui version of diff command which gives you text that you can also scroll through on your screen. See rholmes response to this this question for a summary as to how different syntax is used by diff.

Reverting/Undoing

If you want to look at an old commit, you want the checkout command (see good tutorial here)
git checkout XXX
where XXX is the name of the commit. You should make sure that any of your changes you've made have been committed or you will lose them. checkout brings then entire state of your commit at that time into your directory (and makes your current commit go away).

To get back to where you were, you type,
git checkout master

checkout is good for looking at old commits and figuring out where what you want is (checkout is also for moving between branches). But if you want to revert to an old commit or start making changes based on an old commit, this isn't a good way to do it (if you accidently do this, you're in a 'detached HEAD state' and see this help page to see what you should do). The better way to get rid of changes and revert is git revert or to work simultaneously on both create a new branch.
How do you find the name of a commit? You usually use the log command
git log

This spits out the information from each commit, including your comments. In the below commit the name of the commit is the long string of numbers/letters (682d89ca9e04d8c274d2d89419f9bb8a1b142c5a) though you don't usually have to type the whole thing when you checkout (probably just enough of the beginning to uniquely identify it)
commit 682d89ca9e04d8c274d2d89419f9bb8a1b142c5a Author: Elizabeth Purdom Date: Fri Aug 7 17:10:55 2015 -0700 edits to main text
This is when you discover the utility (or not) of the comments you made at each change.

History of a single file:
git log -- filename
Log formats
log has a lot of options for how to format the output (see useful description here). For example try the following commands:
git log --pretty=oneline git log --stat git log --author=epurdom -n=3 git log --pretty=format:"%h %ad | %s%d [%an]" --graph --date=short
The last shows the graph of the commits across users.

Undo changes since your last commit(see also this tutorial)
Frequently, you want to undo everything you've done and go back to your last commit. This happens frequently if you use a synchronization program for your computer, and as a result you've 'updated' everything in your git directory, but you don't want those changes via a standard copy. You want to pull them in properly via a pull from the repository. You also might realize that you didn't pull down the changes that are on the repos, and rather than dealing with conflicts, you just want to trash anything you've done. You can do
git checkout -- .
The '.' just means everything in the directory you're in (so do it at the top of the git directory to do everything), but it can also be replaced with a specific filename.
You can also do
git reset --hard HEAD
to undo all uncommitted changes (note the difference, 'checkout' allows you to work on a single file or folder using standard unix abbreviations for files; git reset does everything. This is an indepth blog on reset).
Note checkout reverts only those files in the git repos. What about if you want to get rid of files that haven't been committed yet? For example, you copied those files manually from one computer to another (e.g. from SCF machine to yours via scp), and but then they get added to git on the SCF machine. You won't be able to pull unless you either a) add and commit your local versions of the file to the repository or b) delete your local versions. Committing your version of the files in this setting will be a headache (git doesn't nicely merge pdf files), but if you delete versioned files you will delete them on the repository, so you can't just delete everything in the directory (unless you are ready to reclone the repository)-- you need to get just those not already on the repository.
git clean -f -d
Be very careful, because there's no undoing these (-f means force and -d means remove directories too).

Undo changes on committed files (see also this tutorial):
It's likely you want to get an old version of a single file to get something you lost, etc. If you want to look at a specific file from an old commit,
git checkout XXX path/to/file
This is different from checking out an entire commit, because now you've pulled just the old version of the file in the place of the current version of the file.

A simple thing to do would be to copy this old file to a new (unversioned) file name, and then get your current version back via
git checkout master path/to/file
Then you can compare the two (e.g. using difftool) and make edits to the current version from the old version.

If you truly want to revert to the old file (or entirely to an old commit) rather than manually pulling in information from an old file, you probably want to use git revert. See this tutorial for the difference between checkout, revert, reset, and rebase in undoing changes.

Removing file from git without deleting local copy
git rm --cached myfile
Note that keeps your local copy intact, but other users, once they pull your changes, will see their file get deleted (see stackoverflow question)

Other useful tips

Aliases: this tutorial describes how to do this. It also has a collection of useful aliases
For example
git config --global alias.hist 'log --pretty=format:"%h %ad | %s%d [%an]" --graph --date=short'
creates a (global) git command on your machine (call with 'git hist') that makes a nicely formatted output of the log history.

To always commit .R, .m, .Rout files (always executed in top of the directory, so gets everything , see the following link

For example, create a 'commitx' command:
git config alias.commitx '!git add *.Rout && git add *.m && git add *.R && git commit -a -m'

.gitignore files
Make good use of the .gitignore file so that your 'git status' command doesn't show a lot of directories or files that you don't want. And also so you don't dump annoying files onto the rest of the world when you do git add. There is a public repository of ignore files for common scenarios.

Websites:
http://mislav.uniqpath.com/2010/07/git-tips/

Back to top

Knitr and RMarkdown
Note that the button 'knit' in RStudio and knit2html do not work exactly the same. RStudio runs it in a vanilla environment, while knit2html by default uses the global enviornment and also leaves the output in you environment (though you can change that, see below); this can be handy sometimes for debugging, but also means you won't catch problems in your code that your global variables are masking. Also, RStudio will stop compiling if there are errors, while knit2html will compile and create an html, but the html will have errors posted in the html (again, you can change this). Again, both behaviors can be useful.

Further, current versions of Rmarkdown created in RStudio may need to be compiled with `render` if you are working at the command line, rather than `knit2html`

Set global options at top of the document so you can change from echo=FALSE to echo=TRUE in one blow, for example.
knitr::opts_chunk$set(fig.align="center", cache=TRUE, cache.path = "filename_cache/", message=FALSE, echo=FALSE, results="hide", fig.path="filename_figure/")

Make the names of the chunks without spaces (knitr allows spaces, but then your files have spaces, which is a pain).

Use fig.width and fig.height in chunks so that you figures are good spaces, especially if you set par(mfrow= ) so you don't get squished plots. You can create an object that defines these, so you could change all of them at once.
figWidth2Col<-12 figHeight2Col<-6 ```{r MyChunk,fig.width=figWidth2Col,fig.height=figHeight2Col} par(mfrow=c(1,2)) #code ```

To give numbering to your figures in .html output, see the function capFig donated by contributer to this thread

To run knitr at the unix/terminal command line, rather than within R (like R CMD SWEAVE), use Rscript
Rscript --vanilla -e "library(knitr); knit2html('test.Rmd');"
Or if your using yaml headers, like in RStudio, you need
Rscript --vanilla -e "library(rmarkdown); render('test.Rmd');"
If you get the error
Error: pandoc version 1.12.3 or higher is required and was not found.
see this link for how to set the right location for pandoc when you run render outside of RStudio.

To run knit2html and get a new environment (i.e. don't use variables already in session) set 'envir=new.env()'. I think this is what RStudio does (RStudio may also detach the libraries)
Alternatively, running it in the global environment is a good way to 'load' everything in your working space so you can play around with it (especially true for cached chunks.)

To have it stop when you hit an error (like the 'knit' button in Rstudio) set `error=FALSE` in the chunk environment (or globally at the top).

Caching

Don't set cache=TRUE unless you are going to pay attention to dependencies between your chunks (e.g. using 'dependson'). Otherwise, your changes won't get really implemented. For example,
```{r combine} #code here ``` ```{r norm, dependson="combine"} #code here ```
There are also options for finding dependencies automatically which I don't usually use unless I'm doing something simple that I just as easily could delete the whole cache and rerun denovo. This is because I have found the autodependencies spotty in performance at times, though this is not a systematic evaluation.

It's probably a good idea to every so often delete your cache files and run it again regardless. This makes sure that you are catching all the updates you might have made along the way (sometimes I find caching operations finicky and giving unexpected results, especially with dependencies on libraries and reading in files). This is especially true if you are using child documents.

If you use cache and are importing external files in that chunk, you can use 'cache.extra' to make the cache depend on the time stamp of the file. See https://github.com/yihui/knitr/issues/238

Import libraries in an uncached chunk, particularly if you might change the version of a library or if they are used in multiple chunks. Changes to the library won't get detected by caching. However, any cache chunks that use the functions in that library will not rerun, regardless, unless you use something about the version of the library as a 'cache.extra' to trigger the cache.

You can load cache files using 'lazyLoad' if you need to get the results of a cache.

Organizing multiple files
Generally you will not have a giant file with everything you have done, but it is important to be able to keep a record of how to stitch these together.

Source: You can of course just source a .R file in a knitr chunk. If you do this you need to make sure that it depends on a time stamp of that file (see above).

External Chunk definitions: A disadvantage of the above is that you can't annotate your steps so that you can go back and make some sense of them later, other than with comments. However, knitr allows you to pull out chunks of code from a .R file and use them in a .Rmd/.Rnw file using the `read_chunk` command (http://yihui.name/knitr/demo/externalization/).
Your R code would look like this:
##---- MyFirstChunk ... ##---- MySecondChunk ...
Then you would pull in the chunks like this
```{r readR, cache=FALSE} knitr::read_chunk('myCode.R') ``` ```{r MyFirstChunk, cache=FALSE} ``` ```{r MySecondChunk, cache=TRUE, dependson="MyFirstChunk", fig.width=12} ```
This has many nice uses. You can reuse chunks in multiple files; you can source your .R file on its own when you don't want to annotate it (e.g. one .Rmd file gives you a reference as to what the preprocessing steps were, and another more polished version just calls it without comment); you can have more code in your .R file than you reference in your .Rmd file.

Spin: The above has the downside that it removes your annotation and notes about what you did from the actual code. I haven't tried it yet, but there are other options for making inline comments (r oxygen) that get picked up by the function spin() see http://yihui.name/knitr/demo/stitch/). You can put the text and the chunk definitions in a .R file (with appropriate commenting) so that if you 'spin' it, it will convert to either text or chunk definitions. But otherwise it would just be regular comments.

You can also have children .Rmd files that are read in by a parent file (http://yihui.name/knitr/demo/child/). This reads in everything, including the annotation, so its really more for chapters, supplementary text, etc, where a single .Rmd/Rnw file is getting too unwieldy. Another example would be if you want to make a similar kind of report over and over again with different input data.

For automating reports over a template, look at the following useful tips

Looping over .Rmd files Calls rmarkdown::render in a for loop, so that the .Rmd files called make use of variables defined in the global environment (easy way to set arguments for a .Rmd file)

Function run_chunk for sourcing the chunks from a .R file This donated function gives the ability to selectively read in chunks defined in a .R file (like external chunks used in a .Rmd file above). But you can do it in a R session or call from a R file.

Back to top

R Tutorials and Help

General Useful Information

Summary of Useful R commands by Elizabeth Purdom. Also limited version in pdf.

Saving/Moving R Objects; Libraries by Elizabeth Purdom

Writing Functions by Elizabeth Purdom (Rmarkdown available here)

Other useful webpages

Writing A Package: I strongly suggest you look at the tools available in `devtools` and RStudio and that you use Roxygen for writing your documentation files.

Building and Maintaining R Packages with devtools and roxygen2

Developing Packages with RStudio

R Packages by Hadley Wickham

Coding Tips

FasteR! HigheR! StrongeR! - A Guide to Speeding Up R Code for Busy People

Specific Teaching Labs (created for various courses)

General Introductions by Elizabeth Purdom:

Short Version--Barebones of R objects, Plotting, etc.

More Advanced -- Includes information regarding writing functions, more about R objects

Most Complete (except for exercises!) -- More advanced plotting examples and techniques, basic statistical techniques

Introduction to R/SPLUS; Bootstrap Calculations; Programming & Debugging by Noureddine El Karoui and Pei Wang

A Note on Frequency Tables by Elizabeth Purdom

Probability Calculations by Debashis Paul and updated by Elizabeth Purdom

Test/Power Calculations by Debashis Paul and updated by Claire Lunch

ANOVA by Elizabeth Purdom

Regression by Debashis Paul and Elizabeth Purdom

Back to top

Unix
You will have to do a lot of work on the servers and will need to figure out unix commands. A few useful things to learn

screen: This allows you to do multiple 'windows' on unix. If you are running R interactively, you probably want to do it via screen, because it allows you to exit the R session, go back to the unix terminal, and then re-enter R -- and from any connection. You can log-off at the department and log-back on at home and still re-enter your R session. However, the rule above still applied. Don't keep a R session running longer than a day without really good reason.

makefiles This is very useful for automating a series of steps. For example, I always create a make file that will compile my knitr file or latex file. For extensive simulations or processing of data, especially that have dependencies (if file X updates, you need to rerun command A, but otherwise don't), you can code in the dependencies in makefile, and make will only rerun if neeed. Knitr caching does this to some degree, but I find it much trickier -- I always wind up 'forcing' it to rerun something or being frustrated that it reran something I didn't think changed.

Back to top

Last updated 06/05/2015