R bootcamp, Module 9: Workflows, managing projects and good practices

August 2013, UC Berkeley

Jarrod Millman, Chris Paciorek, and Fernando Pérez

How not to make a mess of it

Programmers are always surrounded by complexity; we cannot avoid it. Our applications are complex because we are ambitious to use our computers in ever more sophisticated ways. Programming is complex because of the large number of conflicting objectives for each of our programming projects. *If our basic tool, the language in which we design and code our programs, is also complicated, the language itself becomes part of the problem rather than part of its solution.

At first I hoped that such a technically unsound project would collapse but I soon realized it was doomed to success. Almost anything in software can be implemented, sold, and even used given enough determination. There is nothing a mere scientist can say that will stand against the flood of a hundred million dollars. But there is one quality that cannot be purchased in this way — and that is reliability. The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay.

--- C.A.R. Hoare - The Emperor's Old Clothes - Turing Award Lecture (1980)

Note: remember to start recording.

Reproducible research

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.

--- Jonathan Buckheit and David Donoho, WaveLab and Reproducible Research (1995)

Style guide

Try to avoid using the names of standard R functions for your objects, but R will generally be fairly smart about things.

c <- 7
c(3, 5)
## [1] 3 5
c
## [1] 7
rm(c)
c
## function (..., recursive = FALSE)  .Primitive("c")

Try to use upperCamelCase or underscores when naming your objects (and don't mix and match). Leave periods for use in the context of S3 objects.

precipExtremes <- rnorm(5)  # precip_extremes
timeInDays <- rnorm(5)  # time_in_days
summary.lm(lm(precipExtremes ~ timeInDays))
## 
## Call:
## lm(formula = precipExtremes ~ timeInDays)
## 
## Residuals:
##       1       2       3       4       5 
##  0.0829 -0.5116  1.1812 -0.3451 -0.4073 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.319      0.448    0.71     0.53
## timeInDays    -1.226      0.899   -1.36     0.27
## 
## Residual standard error: 0.806 on 3 degrees of freedom
## Multiple R-squared:  0.382,  Adjusted R-squared:  0.177 
## F-statistic: 1.86 on 1 and 3 DF,  p-value: 0.266

Debugging

As a scripting language, R essentially has a debugger working automatically. But there is an official debugger and other tools that greatly help in figuring out problems.

Let's briefly see these in action. I'll demo with a set of nested functions in which one has an error:

fun1 <- function(x) {
    x <- x * 2
    fun2(x)
}
fun2 <- function(x) {
    fun3(x)
}
fun3 <- function(x) {
    apply(x, 1, mean)
}
  1. We can use debug() to step through a function line by line

  2. After an error occurs, we can use traceback() to look at the call stack

  3. More helpfully, if we set options(error = recover) before running code, we can go into the function in which the error occurred

  4. We can insert browser() inside a function and R will stop there and allow us to proceed with debugging statements

  5. You can temporarily insert code into a function (including built-in functions) with trace(fxnName, edit = TRUE)

Testing

Testing should be performed on multiple levels and begun as early as possible in the development process. For programs that accept input either from a user or file, it is important that the code validates the input is what it expects to receive. Tests that ensure individual code elements (e.g., functions, classes, and class methods) behave correctly are called . Writing unit tests early in the process of implementing new functionality helps you think about what you want a piece of code to do, rather than just how it does it. This practice improves code quality by focusing your attention on use cases rather than getting lost in implementation details.

RUnit is a testing framework for R, which helps automate test setup, creation, execution, and reporting. For more information, see Bioconductor's unit testing guidelines.

Timing your code

premature optimization is the root of all evil

--- Donald Knuth, 1974

There are a few tools in R for timing your code.

system.time(mean(rnorm(1e+07)))
##    user  system elapsed 
##   0.852   0.016   0.869

library(rbenchmark)
x <- rnorm(1e+07)
benchmark(ifelse(x < 0, x, 0), x[x < 0] <- 0, replications = 5, columns = c("replications", 
    "elapsed"))
##   replications elapsed
## 1            5  10.132
## 2            5   1.252

Timing your code

For more advanced assessment of bottlenecks in your code, consider Rprof().

fun1 <- function(x) {
  res <- NULL
  n <- nrow(x)
  for (i in 1:n) if (!any(is.na(x[i,])))
    res <- rbind(res,x[i,])
  res
}

fun2 <- function(x) {
  n <- nrow(x)
  res <- matrix(0,n,ncol(x))
  k <- 1
  for (i in 1:n)
    if (!any(is.na(x[i,]))) {
      res[k,] <- x[i,]
      k <- k+1
    }
  res[1:(k-1),]
}

For a suitable x the following code snippet will create file called profile1.out with collected timing results.

Rprof("profile1.out", line.profiling=TRUE)
print(unix.time(fun1(x)))
Rprof(NULL)

Memory use

You should know how much memory (RAM) the computer you are using has and keep in mind how big your objects are and how much memory you code might use. All objects in R are stored in RAM unlike, e.g., SAS or a database.

If in total, the jobs on a machine approach the physical RAM, the machine will start to use the hard disk as 'virtual memory'. This is called paging or swapping, and once this happens you're often toast.

You can assess memory use with top or ps in Linux/Mac or the Task Manager in Windows.

x <- rnorm(1e+07)
object.size(x)
## 80000040 bytes
1e+07 * 8/1e+06
## [1] 80
print(object.size(x), units = "auto")
## 76.3 Mb
x <- rnorm(1e+08)
gc()
##             used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells    175060   9.4     407500  21.8    342473  18.3
## Vcells 100535290 767.1  122202557 932.4 110541952 843.4
rm()
gc()
##             used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells    175082   9.4     407500  21.8    342473  18.3
## Vcells 100535331 767.1  128392684 979.6 110541952 843.4

Some ideas about reproducibility

Scripting

If you use a good editor (such as RStudio's built-in editor, emacs with ESS, Aquamacs, it's easier to write and understand your code.

You can then run blocks of code within RStudio easily.

To run all the code in an entire file, do source('myCodeFile.R').

To run code as a background job in a Linux/Mac context: R CMD BATCH --no-save myCodeFile.R myOutput.Rout &

Then you don't need to leave RStudio or R or a terminal window open. Your job will run by itself. If you are logged into a remote machine, you can log out and come back later.

IMPORTANT: make sure to write any needed output to a file (e.g. .Rda files, CSV files, text output from print() or cat()).

The UNIX command line

Being able to operate your computer from a command line rather than via point-and-click in a GUI environment provides incredible power and flexibility to the user, as well as allowing for reproducibility and automation in your work.

In Linux and Mac, you just need to open a Terminal window. On the Mac this is in Applications->Utilities->Terminal. Once in the Terminal, you can treat your Mac as if it is a UNIX machine (which it is).

Windows has the old MS-DOS command line that allows some command line functionality. There is something call cygwin that provides UNIX-like functionality.

The UNIX Philosophy

Write programs that do one thing and do it well.

Write programs to work together.

Write programs to handle text streams, because that is a universal interface.

--- Doug McIlroy

For more details, please see:

http://en.wikipedia.org/wiki/Unix_philosophy

Version control

The basic idea is that instead of manually trying to keep track of what changes you've made to code, data, documents, you use software to help you manage the process. This has several benefits:

At a basic level, a simple principle is to have version numbers for all your work: code, datasets, manuscripts. Whenever you make a change to a dataset, increment the version number. For code and manuscripts, increment when you make substantial changes or have obvious breakpoints in your workflow.

Version control architectures

Git concepts: commit

a snapshot of work at a point in time

Credit: ProGit book, by Scott Chacon, CC License.

Credit: ProGit book, by Scott Chacon, CC License.

Git concepts: repository

a group of linked commits (DAG)

Credit: ProGit book, by Scott Chacon, CC License.

Credit: ProGit book, by Scott Chacon, CC License.

Git concepts: hash

toy "implementation"

library("digest")

# first commit
data1 <- "This is the start of my paper2."
meta1 <- "date: 8/20/13"
hash1 <- digest(c(data1, meta1), algo = "sha1")
cat("Hash:", hash1)
## Hash: 32cff5042299244c8c248f6804bb9756912e5492

# second commit, linked to the first
data2 <- "Some more text in my paper..."
meta2 <- "date: 8/20/13"
# Note we add the parent hash here!
hash2 <- digest(c(data2, meta2, hash1), algo = "sha1")
cat("Hash:", hash2)
## Hash: 43925778805d33de28aba5697a967f04613f9153

Stage 1: Local, single-user, linear workflow

Simply type git (or git help) to see a full list of all the 'core' commands. We'll now go through most of these via small practical exercises:

git help
## usage: git [--version] [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
##            [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]
##            [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
##            [-c name=value] [--help]
##            <command> [<args>]
## 
## The most commonly used git commands are:
##    add        Add file contents to the index
##    bisect     Find by binary search the change that introduced a bug
##    branch     List, create, or delete branches
##    checkout   Checkout a branch or paths to the working tree
##    clone      Clone a repository into a new directory
##    commit     Record changes to the repository
##    diff       Show changes between commits, commit and working tree, etc
##    fetch      Download objects and refs from another repository
##    grep       Print lines matching a pattern
##    init       Create an empty git repository or reinitialize an existing one
##    log        Show commit logs
##    merge      Join two or more development histories together
##    mv         Move or rename a file, a directory, or a symlink
##    pull       Fetch from and merge with another repository or a local branch
##    push       Update remote refs along with associated objects
##    rebase     Forward-port local commits to the updated upstream head
##    reset      Reset current HEAD to the specified state
##    rm         Remove files from the working tree and from the index
##    show       Show various types of objects
##    status     Show the working tree status
##    tag        Create, list, delete or verify a tag object signed with GPG
## 
## See 'git help <command>' for more information on a specific command.

git init: create an empty repository

cd ~/src
rm -rf jm-git-demo
git init jm-git-demo
## Initialized empty Git repository in /accounts/gen/vis/paciorek/src/jm-git-demo/.git/

Let's look at what git did:

cd ~/src/jm-git-demo
ls -la
## total 3
## drwxr-sr-x 3 paciorek scfstaff  3 Aug 26 12:21 .
## drwxr-sr-x 3 paciorek scfstaff  3 Aug 26 12:21 ..
## drwxr-sr-x 7 paciorek scfstaff 10 Aug 26  2013 .git
cd ~/src/jm-git-demo
ls -l .git
## total 4
## drwxr-sr-x 2 paciorek scfstaff  2 Aug 26 12:21 branches
## -rw-r--r-- 1 paciorek scfstaff 92 Aug 26 12:21 config
## -rw-r--r-- 1 paciorek scfstaff 73 Aug 26 12:21 description
## -rw-r--r-- 1 paciorek scfstaff 23 Aug 26 12:21 HEAD
## drwxr-sr-x 2 paciorek scfstaff 10 Aug 26 12:21 hooks
## drwxr-sr-x 2 paciorek scfstaff  3 Aug 26 12:21 info
## drwxr-sr-x 4 paciorek scfstaff  4 Aug 26  2013 objects
## drwxr-sr-x 4 paciorek scfstaff  4 Aug 26 12:21 refs

git add: adding content to the repository

Now let's edit our first file in the test directory with a text editor... I'm doing it programatically here for automation purposes, but you'd normally be editing by hand

cd ~/src/jm-git-demo
echo "My first bit of text" > file1.txt

Now we can tell git about this new file using the add command:

cd ~/src/jm-git-demo
git add file1.txt

We can now ask git about what happened with status:

cd ~/src/jm-git-demo
git status
## # On branch master
## #
## # Initial commit
## #
## # Changes to be committed:
## #   (use "git rm --cached <file>..." to unstage)
## #
## #    new file:   file1.txt
## #

git commit: permanently record our changes in git's database

For now, we are always going to call git commit either with the -a option or with specific filenames (git commit file1 file2...). This delays the discussion of an aspect of git called the index (often referred to also as the 'staging area') that we will cover later. Most everyday work in regular scientific practice doesn't require understanding the extra moving parts that the index involves, so on a first round we'll bypass it. Later on we will discuss how to use it to achieve more fine-grained control of what and how git records our actions.

cd ~/src/jm-git-demo
git commit -a -m"This is our first commit"
## [master (root-commit) 4bd7f6e] This is our first commit
##  Committer: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Your name and email address were configured automatically based
## on your username and hostname. Please check that they are accurate.
## You can suppress this message by setting them explicitly:
## 
##     git config --global user.name "Your Name"
##     git config --global user.email you@example.com
## 
## After doing this, you may fix the identity used for this commit with:
## 
##     git commit --amend --reset-author
## 
##  1 file changed, 1 insertion(+)
##  create mode 100644 file1.txt

In the commit above, we used the -m flag to specify a message at the command line. If we don't do that, git will open the editor we specified in our configuration above and require that we enter a message. By default, git refuses to record changes that don't have a message to go along with them (though you can obviously 'cheat' by using an empty or meaningless string: git only tries to facilitate best practices, it's not your nanny).

git log: what has been committed so far

cd ~/src/jm-git-demo
git log
## commit 4bd7f6e289e22426a21b3841a2782cef53e13d5c
## Author: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Date:   Mon Aug 26 12:21:29 2013 -0700
## 
##     This is our first commit

git diff: what have I changed?

Let's do a little bit more work... Again, in practice you'll be editing the files by hand, here we do it via shell commands for the sake of automation (and therefore the reproducibility of this tutorial!)

cd ~/src/jm-git-demo
echo "And now some more text..." >> file1.txt

And now we can ask git what is different:

cd ~/src/jm-git-demo
git diff
## diff --git a/file1.txt b/file1.txt
## index ce645c7..4baa979 100644
## --- a/file1.txt
## +++ b/file1.txt
## @@ -1 +1,2 @@
##  My first bit of text
## +And now some more text...

The cycle of git virtue: work, commit, work, commit, ...

cd ~/src/jm-git-demo
git commit -a -m"I have made great progress on this critical matter."
## [master ceb2b2b] I have made great progress on this critical matter.
##  Committer: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Your name and email address were configured automatically based
## on your username and hostname. Please check that they are accurate.
## You can suppress this message by setting them explicitly:
## 
##     git config --global user.name "Your Name"
##     git config --global user.email you@example.com
## 
## After doing this, you may fix the identity used for this commit with:
## 
##     git commit --amend --reset-author
## 
##  1 file changed, 1 insertion(+)

First, let's see what the log shows us now:

cd ~/src/jm-git-demo
git log
## commit ceb2b2b5d633acd7d7ce2e0ae60ed0773b978dde
## Author: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Date:   Mon Aug 26 12:21:29 2013 -0700
## 
##     I have made great progress on this critical matter.
## 
## commit 4bd7f6e289e22426a21b3841a2782cef53e13d5c
## Author: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Date:   Mon Aug 26 12:21:29 2013 -0700
## 
##     This is our first commit

Sometimes it's handy to see a very summarized version of the log:

cd ~/src/jm-git-demo
git log --oneline --topo-order --graph
## * ceb2b2b I have made great progress on this critical matter.
## * 4bd7f6e This is our first commit

Git supports aliases: new names given to command combinations. Let's make this handy shortlog an alias, so we only have to type git slog and see this compact log:

cd ~/src/jm-git-demo
# We create our alias (this saves it in git's permanent configuration file):
git config --global alias.slog "log --oneline --topo-order --graph"
# And now we can use it
git slog
## * ceb2b2b I have made great progress on this critical matter.
## * 4bd7f6e This is our first commit

git mv and rm: moving and removing files

While git add is used to add fils to the list git tracks, we must also tell it if we want their names to change or for it to stop tracking them. In familiar Unix fashion, the mv and rm git commands do precisely this:

cd ~/src/jm-git-demo
git mv file1.txt file-newname.txt
git status
## # On branch master
## # Changes to be committed:
## #   (use "git reset HEAD <file>..." to unstage)
## #
## #    renamed:    file1.txt -> file-newname.txt
## #

Note that these changes must be committed too, to become permanent! In git's world, until something hasn't been committed, it isn't permanently recorded anywhere.

cd ~/src/jm-git-demo
git commit -a -m"I like this new name better"
echo "Let's look at the log again:"
git slog
## [master 020cdc5] I like this new name better
##  Committer: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Your name and email address were configured automatically based
## on your username and hostname. Please check that they are accurate.
## You can suppress this message by setting them explicitly:
## 
##     git config --global user.name "Your Name"
##     git config --global user.email you@example.com
## 
## After doing this, you may fix the identity used for this commit with:
## 
##     git commit --amend --reset-author
## 
##  1 file changed, 0 insertions(+), 0 deletions(-)
##  rename file1.txt => file-newname.txt (100%)
## Let's look at the log again:
## * 020cdc5 I like this new name better
## * ceb2b2b I have made great progress on this critical matter.
## * 4bd7f6e This is our first commit

And git rm works in a similar fashion.

Optional: Exercise

Add a new file file2.txt, commit it, make some changes to it, commit them again, and then remove it (and don't forget to commit this last step!).

Local user, branching: the concept

What is a branch? Simply a label for the 'current' commit in a sequence of ongoing commits:

There can be multiple branches alive at any point in time; the working directory is the state of a special pointer called HEAD. In this example there are two branches, master and testing, and testing is the currently active branch since it's what HEAD points to:

Once new commits are made on a branch, HEAD and the branch label move with the new commits:

This allows the history of both branches to diverge:

But based on this graph structure, git can compute the necessary information to merge the divergent branches back and continue with a unified line of development:

Local user, branching: an example

Let's now illustrate all of this with a concrete example. Let's get our bearings first:

cd ~/src/jm-git-demo
git status
ls
## # On branch master
## nothing to commit (working directory clean)
## file-newname.txt

We are now going to try two different routes of development: on the master branch we will add one file and on the experiment branch, which we will create, we will add a different one. We will then merge the experimental branch into master.

cd ~/src/jm-git-demo
git branch experiment
git checkout experiment
cd ~/src/jm-git-demo
echo "Some crazy idea" > experiment.txt
git add experiment.txt
git commit -a -m"Trying something new"
git slog
## [experiment c5659f7] Trying something new
##  Committer: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Your name and email address were configured automatically based
## on your username and hostname. Please check that they are accurate.
## You can suppress this message by setting them explicitly:
## 
##     git config --global user.name "Your Name"
##     git config --global user.email you@example.com
## 
## After doing this, you may fix the identity used for this commit with:
## 
##     git commit --amend --reset-author
## 
##  1 file changed, 1 insertion(+)
##  create mode 100644 experiment.txt
## * c5659f7 Trying something new
## * 020cdc5 I like this new name better
## * ceb2b2b I have made great progress on this critical matter.
## * 4bd7f6e This is our first commit
cd ~/src/jm-git-demo
git checkout master
git slog
## * 020cdc5 I like this new name better
## * ceb2b2b I have made great progress on this critical matter.
## * 4bd7f6e This is our first commit
cd ~/src/jm-git-demo
echo "All the while, more work goes on in master..." >> file-newname.txt
git commit -a -m"The mainline keeps moving"
git slog
## [master 224400a] The mainline keeps moving
##  Committer: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Your name and email address were configured automatically based
## on your username and hostname. Please check that they are accurate.
## You can suppress this message by setting them explicitly:
## 
##     git config --global user.name "Your Name"
##     git config --global user.email you@example.com
## 
## After doing this, you may fix the identity used for this commit with:
## 
##     git commit --amend --reset-author
## 
##  1 file changed, 1 insertion(+)
## * 224400a The mainline keeps moving
## * 020cdc5 I like this new name better
## * ceb2b2b I have made great progress on this critical matter.
## * 4bd7f6e This is our first commit
cd ~/src/jm-git-demo
ls
## file-newname.txt
cd ~/src/jm-git-demo
git merge experiment
git slog
## Merge made by the 'recursive' strategy.
##  experiment.txt |    1 +
##  1 file changed, 1 insertion(+)
##  create mode 100644 experiment.txt
## *   9a3ade5 Merge branch 'experiment'
## |\  
## | * c5659f7 Trying something new
## * | 224400a The mainline keeps moving
## |/  
## * 020cdc5 I like this new name better
## * ceb2b2b I have made great progress on this critical matter.
## * 4bd7f6e This is our first commit

Using remotes as a single user

We are now going to introduce the concept of a remote repository: a pointer to another copy of the repository that lives on a different location. This can be simply a different path on the filesystem or a server on the internet.

For this discussion, we'll be using remotes hosted on the GitHub.com service, but you can equally use other services like BitBucket or Gitorious as well as host your own.

cd ~/src/jm-git-demo
ls
echo "Let's see if we have any remote repositories here:"
git remote -v
## experiment.txt
## file-newname.txt
## Let's see if we have any remote repositories here:

Since the above cell didn't produce any output after the git remote -v call, it means we have no remote repositories configured. We will now proceed to do so. Once logged into GitHub, go to the new repository page and make a repository called test. Do not check the box that says Initialize this repository with a README, since we already have an existing repository here. That option is useful when you're starting first at Github and don't have a repo made already on a local computer.

We can now follow the instructions from the next page:

cd ~/src/jm-git-demo
git remote add origin git@github.com:jarrodmillman/test.git
git push -u origin master

Let's see the remote situation again:

cd ~/src/jm-git-demo
git remote -v
## origin   git@github.com:jarrodmillman/test.git (fetch)
## origin   git@github.com:jarrodmillman/test.git (push)

We can now see this repository publicly on github.

Let's see how this can be useful for backup and syncing work between two different computers. I'll simulate a 2nd computer by working in a different directory...

cd ~/src/
# Here I clone my 'test' repo but with a different name, test2, to simulate a 2nd computer
git clone git@github.com:jarrodmillman/test.git test2
cd test2
pwd
git remote -v
## Cloning into 'test2'...
## /accounts/gen/vis/paciorek/src

Let's now make some changes in one 'computer' and synchronize them on the second.

cd ~/src/test2  # working on computer #2
echo "More new content on my experiment" >> experiment.txt
git commit -a -m"More work, on machine #2"
## [master fef70aa] More work, on machine #2
##  Committer: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Your name and email address were configured automatically based
## on your username and hostname. Please check that they are accurate.
## You can suppress this message by setting them explicitly:
## 
##     git config --global user.name "Your Name"
##     git config --global user.email you@example.com
## 
## After doing this, you may fix the identity used for this commit with:
## 
##     git commit --amend --reset-author
## 
##  2 files changed, 15 insertions(+), 4 deletions(-)

Now we put this new work up on the github server so it's available from the internet

cd ~/src/test2  # working on computer #2
git push

Now let's fetch that work from machine #1:

cd ~/src/jm-git-demo
git pull

An important aside: conflict management

While git is very good at merging, if two different branches modify the same file in the same location, it simply can't decide which change should prevail. At that point, human intervention is necessary to make the decision. Git will help you by marking the location in the file that has a problem, but it's up to you to resolve the conflict. Let's see how that works by intentionally creating a conflict.

We start by creating a branch and making a change to our experiment file:

cd ~/src/jm-git-demo
git branch trouble
git checkout trouble
echo "This is going to be a problem..." >> experiment.txt
git commit -a -m"Changes in the trouble branch"
## [trouble f58720b] Changes in the trouble branch
##  Committer: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Your name and email address were configured automatically based
## on your username and hostname. Please check that they are accurate.
## You can suppress this message by setting them explicitly:
## 
##     git config --global user.name "Your Name"
##     git config --global user.email you@example.com
## 
## After doing this, you may fix the identity used for this commit with:
## 
##     git commit --amend --reset-author
## 
##  1 file changed, 1 insertion(+)

And now we go back to the master branch, where we change the same file:

cd ~/src/jm-git-demo
git checkout master
echo "More work on the master branch..." >> experiment.txt
git commit -a -m"Mainline work"
## [master e94c75d] Mainline work
##  Committer: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Your name and email address were configured automatically based
## on your username and hostname. Please check that they are accurate.
## You can suppress this message by setting them explicitly:
## 
##     git config --global user.name "Your Name"
##     git config --global user.email you@example.com
## 
## After doing this, you may fix the identity used for this commit with:
## 
##     git commit --amend --reset-author
## 
##  1 file changed, 1 insertion(+)

So now let's see what happens if we try to merge the trouble branch into master:

cd ~/src/jm-git-demo
git merge trouble
## Auto-merging experiment.txt
## CONFLICT (content): Merge conflict in experiment.txt
## Automatic merge failed; fix conflicts and then commit the result.

Let's see what git has put into our file:

cd ~/src/jm-git-demo
cat experiment.txt
## Some crazy idea
## <<<<<<< HEAD
## More work on the master branch...
## =======
## This is going to be a problem...
## >>>>>>> trouble

At this point, we go into the file with a text editor, decide which changes to keep, and make a new commit that records our decision. To automate my edits, I use the sed command.

cd ~/src/jm-git-demo
sed -i '/^</d' experiment.txt
sed -i '/^>/d' experiment.txt
sed -i '/^=/d' experiment.txt

I've now made the edits, in this case I decided that both pieces of text were useful, so I just accepted both additions.

cd ~/src/jm-git-demo
cat experiment.txt
## Some crazy idea
## More work on the master branch...
## This is going to be a problem...

Let's then make our new commit:

cd ~/src/jm-git-demo
git commit -a -m"Completed merge of trouble, fixing conflicts along the way"
git slog
## [master 6444c1c] Completed merge of trouble, fixing conflicts along the way
##  Committer: Christopher Paciorek <paciorek@scf.Berkeley.EDU>
## Your name and email address were configured automatically based
## on your username and hostname. Please check that they are accurate.
## You can suppress this message by setting them explicitly:
## 
##     git config --global user.name "Your Name"
##     git config --global user.email you@example.com
## 
## After doing this, you may fix the identity used for this commit with:
## 
##     git commit --amend --reset-author
## 
## *   6444c1c Completed merge of trouble, fixing conflicts along the way
## |\  
## | * f58720b Changes in the trouble branch
## * | e94c75d Mainline work
## |/  
## *   9a3ade5 Merge branch 'experiment'
## |\  
## | * c5659f7 Trying something new
## * | 224400a The mainline keeps moving
## |/  
## * 020cdc5 I like this new name better
## * ceb2b2b I have made great progress on this critical matter.
## * 4bd7f6e This is our first commit

Note: While it's a good idea to understand the basics of fixing merge conflicts by hand, in some cases you may find the use of an automated tool useful. Git supports multiple merge tools: a merge tool is a piece of software that conforms to a basic interface and knows how to merge two files into a new one. Since these are typically graphical tools, there are various to choose from for the different operating systems, and as long as they obey a basic command structure, git can work with any of them.

Collaborating on github with a small team

Single remote with shared access: we are going to set up a shared collaboration with one partner (the person sitting next to you). This will show the basic workflow of collaborating on a project with a small team where everyone has write privileges to the same repository.

Note for SVN users: this is similar to the classic SVN workflow, with the distinction that commit and push are separate steps. SVN, having no local repository, commits directly to the shared central resource, so to a first approximation you can think of svn commit as being synonymous with git commit; git push.

We will have two people, let's call them Alice and Bob, sharing a repository. Alice will be the owner of the repo and she will give Bob write privileges.

We begin with a simple synchronization example, much like we just did above, but now between two people instead of one person. Otherwise it's the same:

Next, we will have both parties make non-conflicting changes each, and commit them locally. Then both try to push their changes:

The problem is that Bob's changes create a commit that conflicts with Alice's, so git refuses to apply them. It forces Bob to first do the merge on his machine, so that if there is a conflict in the merge, Bob deals with the conflict manually (git could try to do the merge on the server, but in that case if there's a conflict, the server repo would be left in a conflicted state without a human to fix things up). The solution is for Bob to first pull the changes (pull in git is really fetch+merge), and then push again.

Learning Git

Code review

knitr

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

--- Donald Knuth, Literate programming (1984)

Makefile

Make is a popular tool for process automation. It has a declarative syntax for expressing dependencies between sources and targets and a simple (timestamp-based) mechanism for resolving when dependencies need to be rebuilt.

Here is a simplified example taken from the build system for the slides for this bootcamp:

clean:
        rm -rf *.md *.html


%.md: %.Rmd
        ./make_slides $(basename $(@))

Implementing Reproducible Computational Research

In many of today’s research fields, including biomedicine, computational tools are increasingly being used so that the results can be reproduced. Researchers are now encouraged to incorporate software, data, and code in their academic papers so that others can replicate their research results. Edited by three pioneers in this emerging area, this book is the first one to explore this groundbreaking topic. It presents various computational tools useful in research, including cloud computing, data repositories, virtual machines, R’s Sweave function, XML-based programming, and more. It also discusses legal issues and case studies.

In many of today’s research fields, including biomedicine, computational tools are increasingly being used so that the results can be reproduced. Researchers are now encouraged to incorporate software, data, and code in their academic papers so that others can replicate their research results. Edited by three pioneers in this emerging area, this book is the first one to explore this groundbreaking topic. It presents various computational tools useful in research, including cloud computing, data repositories, virtual machines, R’s Sweave function, XML-based programming, and more. It also discusses legal issues and case studies.

Best practices

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012

arXiv:1210.0530v3 [cs.MS] 29 Nov 2012

Don't just "make the dirt fly!"

Breakout

  1. Configure Git for use at the commandline (see next slide).

  2. Clone practice problem: bash cd <where you want to work> git clone https://github.com/berkeley-scf/r-bootcamp-2013.git

  3. Debug fun3 in r-bootcamp-2013/modules/src/profile-example.R.

  4. Profile and interpret results.

Breakout: configure Git

The minimal amount of configuration for git to work without pestering you is to tell it who you are:

git config --global user.name "Jarrod Millman"
git config --global user.email "millman@berkeley.edu"

And how you will edit text files (it will often ask you to edit messages and other information, and thus wants to know how you like to edit your files):

# Put here your preferred editor. If this is not set, git will honor
# the $EDITOR environment variable
git config --global core.editor /usr/bin/jed  # a lightweight unix editor

# On Windows Notepad will do in a pinch, I recommend Notepad++ as a free alternative
# On the mac, you can set nano or emacs as a basic option

# And while we're at it, we also turn on the use of color, which is very useful
git config --global color.ui "auto"