Generally
These are a compilation of help and suggestions useful for computing and version tracking, intended to help students in my group get off the ground. If you think of more, please feel free to tell me.
For help with parallel computing and using the cluster on SCF see their help pages http://statistics.berkeley.edu/computing (in particular the description of the cluster servers and the workshops and tutorials).
How I like my students to set up their work:
I like to have a project account that we can both access on the SCF rather than storing work on your local SCF machines. This makes it easier to share, create git repositories, etc. I also can ask for significant space allotments for these projects that you won't have on your account. Also after students move on, I still have the account! I 'own' the account, and students ssh into it via ssh keys (SCF can help on this, see also https://kb.iu.edu/d/aews). Please don't put your research with me on your local SCF account, except if you are religiously using the git repository (and even then the big objects not on the repository (data and results) should all be stored on the project account, usually on the scratch account).
I use Git for version tracking of our code and paper writting, with the repository on the SCF machines. This allows us to work on the same files and not worry about writing over each other's work (much better than Dropbox). This is primarily for text files (.tex, .Rmd, .R, etc). Do not include a lot of binary files (.pdf, .rdata, etc) on git unless they are fairly 'permanent' results needed for long term (like pdfs needed by Latex paper, final versions of a paper, etc.) -- just have your copies locally and transfer around via scp/sftp. In particular, don't put the compiled .pdf of Latex or Knitr files or any intermediate files like .aux, .md, etc.; they should be able to be created locally and they are always getting overwritten making a mess of the git repository for no reason.
Please continually commit your changes to git -- not just when I ask to see something. It is much better to have lots of small changes commited than a giant change every month. Please don't be worried that you will 'mess up' the repository. That's the point of git -- it's pretty hard to erase changes permanently.
I want my students to use Knitr to create reports, papers, and thesis. This makes sure that the R code is there for making the plots, etc. and that I can tweak the plotting code as I want. There are likely to be extensive analysis that takes a lot of time and probably the cluster to run. I don't expect that to be in the knitr file. Rather, get meaningful, compact summaries of the results of your simulations or extensive data analysis/processing that can be the starting point of the analysis you want to show in the writeup. And of course, anything that doesn't take long (e.g. simple data transformations/cleaning of the data, etc.) should be in the knitr file.
If you are making a report or presentation do put a copy of this compiled file, but in a different location (e.g. 'Reports' folder) so that we can go back to it for reference. This is particularly important if you show something to collaborators, and then later will be tweaking the code. We need a 'hard' copy of what you showed them -- and this also includes the important individual pictures, not just the pdf/html report.
Always close your R session at the end of each day (without saving your session). If it will take you longer than 5 minutes to get back where you were by reruning your code you should save some intermediate files (and have the corresponding code that you can run in BATCH mode to recreate those intermediate files). If you can't get back to where you were because you did something interactively, then its good to discover that right away and not after a week of work.
It's better to have a .txt files for your intermediate and final results, rather than .rdata files, except for rare exceptions. This will make the results more transferable and force you to create nice headers and identification for your files, rather than relying on multiple objects that link up. Some R objects are too useful to break apart (e.g. the result of lm).
I give some brief help on these tools below.
Git Repositories
I would suggest using something like SourceTree to work with git repositories (and if you are managing a package use the built-in functionality in SourceTree to use gitflow for branches to separate development and public releases).
This is a link to a pretty good tutorial and the Git Pro book
See the following article about git and GitHub to think how to integrate git into your workflow:
Ten Simple Rules for Taking Advantage of git and GitHub
The following are basically just list of useful commands compiled in one quick place. You should look at tutorials to understand what these are doing.
Getting Started (Minimal you need to learn)
- Make Local Git repository: note that if you are making a brand new repository and plan to connect up to a central repository on the SCF server, it's better to just clone it, rather than following these instructions.
Working on your local git repository
Once you have your repository set up locally, it will keep track locally of the differences in each file, but only of those files you request to be tracked, and only when you request it. So you need to 'add' files and 'commit' changes.
The above commands only add and commit your files to your local version control. To share it on the central repository, you have to make a connection to the central repository and then sync your changes to the central repository
Connect to existing central repository on SCF Serve:
- Make New Local Git repository from existing Remote repository (via SSH)
In the location you want the new repository to be created (i.e. 'cloned') type
git clone ssh://addressOfServerDirectory For example: git clone ssh://isoform@beren.berkeley.edu/accounts/projects/isoform/gitRepos/projectName
- Connect Existing Local Git repository with Remote (see also this online tutorial)
Inside the directory that has the git repository, type git remote add origin ssh://addressOfServer For example
git remote add origin ssh://isoform@beren.berkeley.edu/accounts/projects/isoform/gitRepos/projectName
- To check what are your remote connections:
git remote -v
- To change the url of a remote:
git remote set-url origin ssh://addressOfServer
Syncing with the central repository on SCF Serve
git will not allow you to add anything that will conflict with what is on the repository (e.g. changes uploaded to the repository by someone else). It forces you to incorporate those changes locally before you inflict them on the rest of the world. Usually this is done automatically by default merging, but sometimes if git can't resolve the differences between the two versions, you have to manually go and fix the files.
So generally you need to first 'commit' your changes, 'pull' from the central repos., if needed fix any differences and commit them (usually done automatically), then 'push' your new (merged) version to the repository.
Useful for day-to-day
Setup a Central Git on SCF Server: If there isn't an existing central repository, you can create one on the SCF project account.
It can frequently be useful to have a copy on the SCF account that you *can* work off of and run scripts from, etc. You should do this by cloning a copy of the central repository to an appropriate directory on the account (not under 'gitRepos'!). This local copy will need to push and pull to the repository just like you would from your computer.
Merging/Fetching (see also documentation):
If you are combining your information with another via pull then git will first 'fetch' changes from the central repos and them 'merge' together. git will try to resolve any conflicts (merge). If there is a conflict it can't resolve, they you have to manually solve them before you can push any changes (i.e. you can't add something that will break things for other people).
It can be better to do this separately if there are major changes between them. (post on why better)
git fetch
git merge origin/master
"origin/master" refers to the local copy of origin that you just fetched.
The main reason it can be safer, is that you can look and see what will happen and change options (merge has different strategies for merging, including 'ours' and 'theirs' that defines where the priority should be given). After you've gotten use to using git via pull, I would encourage you to try this way and the techniques below and be careful before you use automatic merging. I have 'lost' significant edits via merging -- of course I was able to recover them because they were all on old commits!
- Resolving conflicts: Usually the best way to resolve conflicts is to use
git mergetool
This should be done after git merge. This will launch a graphical tool to compare and choose what you want to keep. (you may have to set what the graphical tool should be; Mac has opendiff if you've installed XCode. On Windows you will probably need to download a tool.)
-
By default, the results of merging are 'committed'. The program needs a merge message, which can be given with the option -m so it doesn't open an edit screen.
- You can do a merge with out committing
git merge --no-commit origin/master
This allows you to look at the changes (e.g. with a difftool between HEAD and what's in your local directory, see below). If you're okay with them, you can then commit them like normal.
- If you did a merge that resulted in conflicts, you can undo the merge (at least for committed files, see documentation) with
git merge --abort origin/master
- Advanced look at what a merge will do
If you do git fetch, you can then check out what would happen if you merge before you merge. For example after a fetch,
- The following gives you a list of each file that would change:
git diff --name-status origin/master
- The following allows you to look, file-by-file, the differences between them
git difftool origin/master
You can also do difftool with just a single file,
git difftool origin/master path/to/file
difftool is just the gui version of diff command which gives you text that you can also scroll through on your screen. See rholmes response to this this question for a summary as to how different syntax is used by diff.
Reverting/Undoing
If you want to look at an old commit, you want the checkout command (see good tutorial here)
git checkout XXX
where XXX is the name of the commit. You should make sure that any of your changes you've made have been committed or you will lose them. checkout brings then entire state of your commit at that time into your directory (and makes your current commit go away).
To get back to where you were, you type,
git checkout master
checkout is good for looking at old commits and figuring out where what you want is (checkout is also for moving between branches). But if you want to revert to an old commit or start making changes based on an old commit, this isn't a good way to do it (if you accidently do this, you're in a 'detached HEAD state' and see this help page to see what you should do). The better way to get rid of changes and revert is git revert or to work simultaneously on both create a new branch.
How do you find the name of a commit? You usually use the log command
git log
This spits out the information from each commit, including your comments. In the below commit the name of the commit is the long string of numbers/letters (682d89ca9e04d8c274d2d89419f9bb8a1b142c5a) though you don't usually have to type the whole thing when you checkout (probably just enough of the beginning to uniquely identify it)
commit 682d89ca9e04d8c274d2d89419f9bb8a1b142c5a
Author: Elizabeth Purdom
Date: Fri Aug 7 17:10:55 2015 -0700
edits to main text
This is when you discover the utility (or not) of the comments you made at each change.
-
History of a single file:
git log -- filename
Log formats
log has a lot of options for how to format the output (see useful description here). For example try the following commands:
git log --pretty=oneline
git log --stat
git log --author=epurdom -n=3
git log --pretty=format:"%h %ad | %s%d [%an]" --graph --date=short
The last shows the graph of the commits across users.
- Undo changes since your last commit(see also this tutorial)
Frequently, you want to undo everything you've done and go back to your last commit. This happens frequently if you use a synchronization program for your computer, and as a result you've 'updated' everything in your git directory, but you don't want those changes via a standard copy. You want to pull them in properly via a pull from the repository. You also might realize that you didn't pull down the changes that are on the repos, and rather than dealing with conflicts, you just want to trash anything you've done. You can do
git checkout -- .
The '.' just means everything in the directory you're in (so do it at the top of the git directory to do everything), but it can also be replaced with a specific filename.
You can also do
git reset --hard HEAD
to undo all uncommitted changes (note the difference, 'checkout' allows you to work on a single file or folder using standard unix abbreviations for files; git reset does everything. This is an indepth blog on reset).
Note checkout reverts only those files in the git repos. What about if you want to get rid of files that haven't been committed yet? For example, you copied those files manually from one computer to another (e.g. from SCF machine to yours via scp), and but then they get added to git on the SCF machine. You won't be able to pull unless you either a) add and commit your local versions of the file to the repository or b) delete your local versions. Committing your version of the files in this setting will be a headache (git doesn't nicely merge pdf files), but if you delete versioned files you will delete them on the repository, so you can't just delete everything in the directory (unless you are ready to reclone the repository)-- you need to get just those not already on the repository.
git clean -f -d
Be very careful, because there's no undoing these (-f means force and -d means remove directories too).
- Undo changes on committed files (see also this tutorial):
It's likely you want to get an old version of a single file to get something you lost, etc. If you want to look at a specific file from an old commit,
git checkout XXX path/to/file
This is different from checking out an entire commit, because now you've pulled just the old version of the file in the place of the current version of the file.
A simple thing to do would be to copy this old file to a new (unversioned) file name, and then get your current version back via
git checkout master path/to/file
Then you can compare the two (e.g. using difftool) and make edits to the current version from the old version.
If you truly want to revert to the old file (or entirely to an old commit) rather than manually pulling in information from an old file, you probably want to use git revert. See this tutorial for the difference between checkout, revert, reset, and rebase in undoing changes.
- Removing file from git without deleting local copy
git rm --cached myfile
Note that keeps your local copy intact, but other users, once they pull your changes, will see their file get deleted (see stackoverflow question)
Other useful tips
- Aliases: this tutorial describes how to do this. It also has a collection of useful aliases
For example git config --global alias.hist 'log --pretty=format:"%h %ad | %s%d [%an]" --graph --date=short' creates a (global) git command on your machine (call with 'git hist') that makes a nicely formatted output of the log history.
- To always commit .R, .m, .Rout files (always executed in top of the directory, so gets everything , see the following link
For example, create a 'commitx' command: git config alias.commitx '!git add *.Rout && git add *.m && git add *.R && git commit -a -m'
.gitignore filesMake good use of the .gitignore file so that your 'git status' command doesn't show a lot of directories or files that you don't want. And also so you don't dump annoying files onto the rest of the world when you do git add. There is a public repository of ignore files for common scenarios.
Websites:
Back to top
Knitr and RMarkdown
Note that the button 'knit' in RStudio and knit2html do not work exactly the same. RStudio runs it in a vanilla environment, while knit2html by default uses the global enviornment and also leaves the output in you environment (though you can change that, see below); this can be handy sometimes for debugging, but also means you won't catch problems in your code that your global variables are masking. Also, RStudio will stop compiling if there are errors, while knit2html will compile and create an html, but the html will have errors posted in the html (again, you can change this). Again, both behaviors can be useful.
Further, current versions of Rmarkdown created in RStudio may need to be compiled with `render` if you are working at the command line, rather than `knit2html`
- Set global options at top of the document so you can change from echo=FALSE to echo=TRUE in one blow, for example.
knitr::opts_chunk$set(fig.align="center", cache=TRUE, cache.path = "filename_cache/", message=FALSE,
echo=FALSE, results="hide", fig.path="filename_figure/")
- Make the names of the chunks without spaces (knitr allows spaces, but then your files have spaces, which is a pain).
- Use fig.width and fig.height in chunks so that you figures are good spaces, especially if you set par(mfrow= ) so you don't get squished plots. You can create an object that defines these, so you could change all of them at once.
figWidth2Col<-12
figHeight2Col<-6
```{r MyChunk,fig.width=figWidth2Col,fig.height=figHeight2Col}
par(mfrow=c(1,2))
#code
```
- To give numbering to your figures in .html output, see the function capFig donated by contributer to this thread
- To run knitr at the unix/terminal command line, rather than within R (like R CMD SWEAVE), use Rscript
Rscript --vanilla -e "library(knitr); knit2html('test.Rmd');"
Or if your using yaml headers, like in RStudio, you need
Rscript --vanilla -e "library(rmarkdown); render('test.Rmd');"
If you get the error
Error: pandoc version 1.12.3 or higher is required and was not found.
see this link for how to set the right location for pandoc when you run render outside of RStudio.
- To run knit2html and get a new environment (i.e. don't use variables already in session) set 'envir=new.env()'. I think this is what RStudio does (RStudio may also detach the libraries)
Alternatively, running it in the global environment is a good way to 'load' everything in your working space so you can play around with it (especially true for cached chunks.)
- To have it stop when you hit an error (like the 'knit' button in Rstudio) set `error=FALSE` in the chunk environment (or globally at the top).
- Caching
- Organizing multiple files
Generally you will not have a giant file with everything you have done, but it is important to be able to keep a record of how to stitch these together.
- Source: You can of course just source a .R file in a knitr chunk. If you do this you need to make sure that it depends on a time stamp of that file (see above).
- External Chunk definitions: A disadvantage of the above is that you can't annotate your steps so that you can go back and make some sense of them later, other than with comments. However, knitr allows you to pull out chunks of code from a .R file and use them in a .Rmd/.Rnw file using the `read_chunk` command (http://yihui.name/knitr/demo/externalization/).
Your R code would look like this:
##---- MyFirstChunk
...
##---- MySecondChunk
...
Then you would pull in the chunks like this
```{r readR, cache=FALSE}
knitr::read_chunk('myCode.R')
```
```{r MyFirstChunk, cache=FALSE}
```
```{r MySecondChunk, cache=TRUE, dependson="MyFirstChunk", fig.width=12}
```
This has many nice uses. You can reuse chunks in multiple files; you can source your .R file on its own when you don't want to annotate it (e.g. one .Rmd file gives you a reference as to what the preprocessing steps were, and another more polished version just calls it without comment); you can have more code in your .R file than you reference in your .Rmd file.
- Spin: The above has the downside that it removes your annotation and notes about what you did from the actual code.
I haven't tried it yet, but there are other options for making inline comments (r oxygen) that get picked up by the function spin() see http://yihui.name/knitr/demo/stitch/). You can put the text and the chunk definitions in a .R file (with appropriate commenting) so that if you 'spin' it, it will convert to either text or chunk definitions. But otherwise it would just be regular comments.
- You can also have children .Rmd files that are read in by a parent file (http://yihui.name/knitr/demo/child/). This reads in everything, including the annotation, so its really more for chapters, supplementary text, etc, where a single .Rmd/Rnw file is getting too unwieldy. Another example would be if you want to make a similar kind of report over and over again with different input data.
For automating reports over a template, look at the following useful tips
- Looping over .Rmd files Calls rmarkdown::render in a for loop, so that the .Rmd files called make use of variables defined in the global environment (easy way to set arguments for a .Rmd file)
- Function run_chunk for sourcing the chunks from a .R file This donated function gives the ability to selectively read in chunks defined in a .R file (like external chunks used in a .Rmd file above). But you can do it in a R session or call from a R file.
Back to top
R Tutorials and Help
General Useful Information
Writing A Package: I strongly suggest you look at the tools available in `devtools` and RStudio and that you use Roxygen for writing your documentation files.
Coding Tips
Specific Teaching Labs (created for various courses)
Back to top
Unix
You will have to do a lot of work on the servers and will need to figure out unix commands. A few useful things to learn
screen: This allows you to do multiple 'windows' on unix. If you are running R interactively, you probably want to do it via screen, because it allows you to exit the R session, go back to the unix terminal, and then re-enter R -- and from any connection. You can log-off at the department and log-back on at home and still re-enter your R session. However, the rule above still applied. Don't keep a R session running longer than a day without really good reason.
makefiles This is very useful for automating a series of steps. For example, I always create a make file that will compile my knitr file or latex file. For extensive simulations or processing of data, especially that have dependencies (if file X updates, you need to rerun command A, but otherwise don't), you can code in the dependencies in makefile, and make will only rerun if neeed. Knitr caching does this to some degree, but I find it much trickier -- I always wind up 'forcing' it to rerun something or being frustrated that it reran something I didn't think changed.
Back to top
|