Introduction to Concepts in Computing with Data

1  What's this course about?

The goal of this course is to introduce you to a variety of concepts and methods that are useful when dealing with data. You can think of data as any information that could potentially be studied to learn more about something.
Some simple examples:
  1. Sales records for a company
  2. Won/Lost records for a sports team
  3. Web log listings
  4. Email messages
  5. Demographic Information found on the web
What may be surprising is that some of the data (for example web log listings or email) consists of text, not numbers. It's increasingly important to be able to deal with text in order to look at the wide variety of information that is available.
In most statistics courses, a lot of time is spent working on the preliminaries (formulas, algorithms, statistical concepts) in order to prepare the student for the interesting part of statistics, which is studying data for a problem that matters to you. Unfortunately, by the time these preliminaries are covered, many students are bored or frustrated, and they leave the course with a distorted view of statistics. In this class, we're going to do things differently. We will concentrate on:
  1. Computer Languages - which will let us read in and manipulate data.
  2. Graphical Techniques - which will allow us to display data in a way that makes it easier to understand and easier to make decisions based on the data.
  3. Technologies - so that we can present these techniques to someone who's not as knowledgeable as us without too much misery.
The main computer language that we will be using in the course is a statistical programming environment known as R (http://r-project.org). R is freely downloadable and copyable, and you are strongly encouraged to install R on your own computer for use in this course and beyond. We will use this language for both data acquisition, data manipulation and producing graphical output. R is not the ideal language for all of the tasks we're going to do, but in the interest of efficiency, we'll try to use it for most things, and point you in the direction of other languages that you might want to explore sometime in the future.
Another tool that we'll use are UNIX shell commands, which allow you to store, copy and otherwise manipulate files which contain data, documents or programs. The computer accounts for this course will allow you to access computers running a version of the UNIX operating system
More and more commonly, data is stored on database servers, not in ordinary files. Most database servers use some version of a language known as the Structured Query Language (SQL). We'll use the open-source MySQL database as an example of an SQL database.
While you shouldn't have any problems understanding any of the statistical techniques we study in this class, a certain level of complexity arises when various techniques are combined together and presented as a start-to-finish solution. To learn how we can make techniques we've developed to others, we'll look at graphical user interfaces (GUIs) as well as using web servers to get information from users and display results.

2  Some Basic Concepts in Computing with Data

Most real-life projects that involve data can be broken down into several steps:
  1. Data Acquisition - we need to find (or collect) the data, and get some representation of it into the computer
  2. Data Cleaning - Inevitably, there will be errors in the data, either because they were entered incorrectly, we misunderstood the nature of the data, records were duplicated or omitted. Many times data is presented for viewing, and extracting the data in some other form becomes a challenge.
  3. Data Organization - Depending on what you want to do, you may need to reorganize your data. This is especially true when you need to produce graphical representations of the data. Naturally, we need the appropriate tools to do these tasks.
  4. Data Modeling and Presentation - We may fit a statistical model to our data, or we may just produce a graph that shows what we think is important. Often, a variety of models or graphs needs to be considered. It's important to know what techniques are available and whether they are accepted within a particular user community.
Further, techniques such as simulation and resampling have been developed to allow us to get information even in cases where we don't have exactly the data that we'd like to have.

3  A Short Note on Academic Integrity

One of the main ways you are going to learn in this course is by discussing the material in class and in lab sections, and I feel that this is one of the most important aspects of the course. However, the programs you write are like an essay or term paper, and you should not share them with others, or ask others to give their programs to you.
The University has a very detailed code of student conduct, available at http://students.berkeley.edu/uga/conductiii-vii.asp. Please refer to that document or talk to me if there are any questions.

4  Introduction to R

Here are some basic concepts about R that we'll return to later in more detail:



File translated from TEX by TTH, version 3.67.
On 19 Jan 2011, 15:34.