Introduction to Concepts in Computing with Data

1 What's this course about?

The goal of this course is to introduce you to a variety of concepts and methods that are useful when dealing with data. You can think of data as any information that could potentially be studied to learn more about something.

Some simple examples:

Sales records for a company
Won/Lost records for a sports team
Web log listings
Email messages
Demographic Information found on the web

What may be surprising is that some of the data (for example web log listings or email) consists of text, not numbers. It's increasingly important to be able to deal with text in order to look at the wide variety of information that is available.

In most statistics courses, a lot of time is spent working on the preliminaries (formulas, algorithms, statistical concepts) in order to prepare the student for the interesting part of statistics, which is studying data for a problem that matters to you. Unfortunately, by the time these preliminaries are covered, many students are bored or frustrated, and they leave the course with a distorted view of statistics. In this class, we're going to do things differently. We will concentrate on:

Computer Languages - which will let us read in and manipulate data.
Graphical Techniques - which will allow us to display data in a way that makes it easier to understand and easier to make decisions based on the data.
Technologies - so that we can present these techniques to someone who's not as knowledgeable as us without too much misery.

The main computer language that we will be using in the course is a statistical programming environment known as R (http://r-project.org). R is freely downloadable and copyable, and you are strongly encouraged to install R on your own computer for use in this course and beyond. We will use this language for both data acquisition, data manipulation and producing graphical output. R is not the ideal language for all of the tasks we're going to do, but in the interest of efficiency, we'll try to use it for most things, and point you in the direction of other languages that you might want to explore sometime in the future.

Another tool that we'll use are UNIX shell commands, which allow you to store, copy and otherwise manipulate files which contain data, documents or programs. The computer accounts for this course will allow you to access computers running a version of the UNIX operating system

More and more commonly, data is stored on database servers, not in ordinary files. Most database servers use some version of a language known as the Structured Query Language (SQL). We'll use the open-source MySQL database as an example of an SQL database.

While you shouldn't have any problems understanding any of the statistical techniques we study in this class, a certain level of complexity arises when various techniques are combined together and presented as a start-to-finish solution. To learn how we can make techniques we've developed to others, we'll look at graphical user interfaces (GUIs) as well as using web servers to get information from users and display results.

2 Some Basic Concepts in Computing with Data

Most real-life projects that involve data can be broken down into several steps:

Data Acquisition - we need to find (or collect) the data, and get some representation of it into the computer
Data Cleaning - Inevitably, there will be errors in the data, either because they were entered incorrectly, we misunderstood the nature of the data, records were duplicated or omitted. Many times data is presented for viewing, and extracting the data in some other form becomes a challenge.
Data Organization - Depending on what you want to do, you may need to reorganize your data. This is especially true when you need to produce graphical representations of the data. Naturally, we need the appropriate tools to do these tasks.
Data Modeling and Presentation - We may fit a statistical model to our data, or we may just produce a graph that shows what we think is important. Often, a variety of models or graphs needs to be considered. It's important to know what techniques are available and whether they are accepted within a particular user community.

Further, techniques such as simulation and resampling have been developed to allow us to get information even in cases where we don't have exactly the data that we'd like to have.

3 A Short Note on Academic Integrity

One of the main ways you are going to learn in this course is by discussing the material in class and in lab sections, and I feel that this is one of the most important aspects of the course. However, the programs you write are like an essay or term paper, and you should not share them with others, or ask others to give their programs to you.

The University has a very detailed code of student conduct, available at http://students.berkeley.edu/uga/conductiii-vii.asp. Please refer to that document or talk to me if there are any questions.

4 Introduction to R

Here are some basic concepts about R that we'll return to later in more detail:

R can be used as a calculator. Any statements that you type at the R prompt will be executed, and the answer printed:
```
> 12 + 9
[1] 21
> 17.1* 13
[1] 222.3
> 554 /3
[1] 184.6667
```
Notice that spaces can be placed wherever you'd like, as long as they're not in the middle of a number.
R is just as happy to work with vectors (more than one value) as it is with scalars (single values). The function to create a vector from individual values is c():
```
> c(10,12,19) + c(8,5,9)
[1] 18 17 28
> c(1,2,3) * c(3,2,1)
[1] 3 4 3
```
When you use operators like *(multiplication) or +(addition), R does the operations element by element
There are three ways to get help in R:
1. The help() command, which can be abbreviated by ?, will open a help page in a browswer with information about a particular command. For example, to get help on the c function mentioned above, you could type:
```
> help(c)
```
  or
```
> ?c
```
2. The help.search() function, which can be abbreviated by ??, will show other functions already installed in your version of R that relate to a particular topic. For example, to see about functions that combine things in R, you could type
```
> ??combine
```
  help.search() only looks for functions already installed on your computer.
3. The RSiteSearch() command will open a browser to a searchable database of questions and answers posted on the R-help mailing list. (See http://www.r-project.org/mail.html for more information on the R-help mailing list.)
When you type the name of an object into R, it will display that object. This can be frustrating when you want to actually run the object. For example, if you type q at the R prompt, you'll see:
```
> q
function (save = "default", status = 0, runLast = TRUE) 
.Internal(quit(save, status, runLast))
<environment: namespace:base>
```
To actually execute the q command, type
```
> q()
```

File translated from T_EX by T_TH, version 3.67.
On 19 Jan 2011, 15:34.