TILE - Program Evaluation Literature

Summaries of existing program evaluation literature in the context of TILE:

Gross Davis, Barbara and Sheila Humphreys. 1985. Evaluating Intervention Programs-Applications from Women's Programs in Math and Science. New York: Teachers College Press.

Anandam, Kamala and J. Terence Kelly. 1981. "Evaluating the use of Technology in Education." Journal of Educational Technology Systems. 10, 1: 21-31.

Mausner, Bernard, Edward F. Wolff, Richard W. Evans, Mary M. DeBoer, Steven P. Gulkus, Anita D'Amore, and Samuel Hirsch. 1983. "A Program of Computer Assisted Instruction for a Personalized Instructional Course in Statistics." Teaching of Psychology. 10, 4, December: 195-200.

Ross, Steven. 1984. "Matching the Lesson to the Student: Alternative Adaptive Designs for Individualized Learning Systems." Journal of Computer-Based Instruction. 11, 2, Spring: 42-48.

Duncan, Nancy C. 1993. "Evaluation of instructional software: Design considerations and recommendations." Behavior Research Methods, Instruments and Computers. 25, 2: 223-227.

Castellan, N John. 1993. "Evaluating information technology in teaching and learning." Behavior Research Methods, Instruments and Computers. 25, 2: 233-237.

Ransdell, Sarah. 1993. "Educational software evaluation research: Balancing internal, external, and ecological validity."Behavior Research Methods, Instruments and Computers. 25, 2: 228-232.

Welsh, Josephine. 1993. "The effectiveness of computerized instruction at he college level." Behavior Research Methods, Instruments and Computers. 25, 2: 220-222

Gross Davis, Barbara and Sheila Humphreys. 1985. Evaluating Intervention Programs-Applications from Women's Programs in Math and Science. New York: Teachers College Press.

A descriptions of three basic types of evaluations is provided:

Preformative: during the planning phases.

What are the social class/ethnic characteristics of students?
What are the particular needs of these students with respect to statistics?
What type of program is likely to be attractive to the students?

Formative: to improve a program that is still being developed.

Is the program attracting a sufficient number of participants?
Are the participants representative of those for whom the program was targeted?

Summative: reports on the overall quality and effectiveness of a program.

Do participants indicate increased interest, more than a comparable group of students who did not participate in the program?
Is the program more successful than other types of programs costing the same or less?

To help in creating the evaluation, the authors provide ideas and processes that can help generate evaluation goals. Possible sources for evaluation questions include:

Discussions with the audiences for evaluation (e.g. Funder)

Discussions among the program staff, including the following issues:

Questions an answer is most needed for
Past problems
Future decisions
Funders' interests

Evaluation goals and objectives can further be defined by re-reading the original proposal, observing the program in action and reading previous evaluations.

Depending on what the evaluation pertains to measure (cognitive or attitudinal change, or both), different instruments or data sources are suggested:

Cognitive:	- administering standardized test
	- teacher's judgment of students' performance
	- reviewing students' past assignments/tests
Attitudinal:	- interviewing students
	- administering questionnaires
Both:	- observing students in the lab

If one of the questions that has to be answered is cognitive change, the administering of tests would be appropriate. Two types of tests are identified:

Norm-Referenced Tests provide information on how well an individual or group does in comparison to other individuals taking the same test.

Criterion-Referenced Tests provide information about an individual's or group's performance relative to certain independently defined standards rather than relative to the performance of others.

The attitudinal dimension, on the other hand, if applied to statistics or math, may include:

Enjoyment
Feelings of competence
Recognition of utility

The authors recommend to come up with an evaluation plan, which should provide the necessary instruments/data sources, the sample, and data collection times for each evaluation question we would like to answer. The following is a synthesis of how this might look like if applied to programs similar to TILE:

Evaluation Question	Instrument/ Data Source	Sample	Data Collection
Is the lab well attended?	Observation; attendance list	Teacher; audience	Beginning and end of each session
What is the immediate impact of a lab session on the audience's attitude/knowledge	Questionnaire; interview; test	Audience	End of each session
What is the audience's reaction to the entire program	Questionnaire; interview	Sample of those who attended 1+ lab session	End of program/semester

Tools of Evaluation are discussed in more detail, and a list of different means for conducting an evaluation with discussion of pros and cons for each is provided:

Technique	Advantages	Disadvantages
Questionnaires	- anonymous - wide range of topics	- structured and inflexible - respondents may interpret questions differently (reliability issue)
Interviews	- flexible/adaptable - in-depth (probing)	- responses difficult to summarize/analyze
Observations	- information about natural setting - may uncover issues not revealed through other means	- data difficult to synthesize
Tests	- may provide most convincing evidence for some audiences (e.g. Funders)	- inappropriate for interventions of limited duration
Documents, Records, Materials	- provides background information - may uncover issues not revealed through other sources	- potentially low payoff - interpretations/explanations may be lacking

For questionnaire development, the authors point out the importance of reliability and validity issues:

*Reliability Issue:*	Does the questionnaire mean the same thing to various people at various times?
*Validity Issue:*	Does the questionnaire measure what it purports to measure (are the questions the right indicators for the variable(s) we are trying to measure)?

Examples on how to frame questions, for example in the form of open- or closed-ended questions, are shown. A checklist for good questions is provided:

Are the directions unambiguous? (e.g. "check one response only" or "fill in the blank")
Are the questions phrased in a neutral way? Have you avoided leading questions?
Are vocabulary and usage understandable and free of technical terms?
Have you avoided asking for more than one item of information per question?
Are all the alternatives covered in your response choices?
Have you given respondents enough options to express a range of opinions?
Is the sequence of questions logical and clear? Have you ensured that responses to earlier questions do not bias later questions? Are similar questions grouped together so that respondents focus on one topic area at a time?

Gross Davis and Humphreys further suggest the pilot testing of questions by

Selecting a pilot testing group from among your pool of respondents
Planning data collection procedures to imitate actual data collection
Allowing the pilot group to give their suggestions about and reactions to the questionnaire
Analyzing data to detect trends or problems in response patterns

The questions then can be rewritten, and, if time allows, the pretest can be given to the pilot group again after two weeks to check for reliability.

For a formative evaluation, questionnaires may be administered in the following ways:

Administer "minute papers" at the end of a session ("What is the most significant thing you learned today?"; "What question is uppermost in your mind at the end of today's session?")
End-of-course student questionnaire. This is a posttest for which it is important to get enough background data at the beginning of the course (i.e. pretest), such as the expectations students have for the course

As for interviews, the authors recommend structured interviews (with a choice of answers already provided) or semi-structured interviews (built around a specific set of questions) rather than unstructured interviews--the more you know what information you want and the less experienced the interviewers, the more structured interviews should be. As for questionnaires, the framing of questions and pilot testing is also important for interviews.

For the evaluation of courses or curricula, it is particularly important that the interviewer is not the teacher or professor.

Observation as a tool for data gathering is also explored. The authors, again, distinguish between different formats:

Structured: start out with information about what you want to observe
Unstructured: open-ended system for recording behavior

A list of what to observe is also provided:

Characteristics of participants
Interactions: sense of relationships, attitudes, level of enthusiasm
Nonverbal behavior: signs of boredom; nervousness
Physical surroundings: influence on participants/appropriateness
Activities: what are people doing and when? What procedures are being followed?

If the program evaluation does not involve a control group, the authors recommend the following alternatives as suitable comparison groups:

Students who attend a similar course/class or "intervention program" such as the learning center
Comparison of data against that of other stats interventions, especially if the same instruments were used
Comparison students who have attended over 75% of the program with those who have attended only 25%

Anandam, Kamala and J. Terence Kelly. 1981. "Evaluating the use of Technology in Education." Journal of Educational Technology Systems. 10, 1: 21-31.

The authors distinguish between the following instructional uses of the computer. In addition to "learning about the computer", they differentiate between:

Learning through the computer: CAI - Computer Assisted Instruction (students directly interacting with computer for practice, diagnostic testing, tutorials, etc.)
Learning with the computer (simulation, gaming, problem solving, etc.)
Learning with computer support: CMI - Computer-Managed Instruction

[see K.L. Zinn, Instructional Uses of Computers in Higher Education, The Fourth Inventory of Computers in Higher Education: An Interpretive Report, EDUCOM, Princeton, New Jersey, pp. 103-126, 1979]

Studies in the 1970s concluded that all of these computerized methods are at least as effective as non-computerized methods in bringing about learning-gains. CAI applications, however, have not considered student characteristics and subject matter uniqueness due to a "passive view of the student" in a "frame-oriented approach" to learning.

Anandam and Kelly stress that CAI really is more "instruction" than "learning" oriented. For learning to occur, more "individualization" is needed, for which there are four levels:

Arranging a predetermined instructional sequence conditional on different responses to prespecified questions with immediate feedback.
Choosing subsequent instruction based on a dynamic measure of performance on previous materials.
Providing different presentation modes or instructional sequences based on individual differences such as aptitudes, interests, or personality.
Hypothesizing a model of learning for each student consisting of procedures for presentation of instructional materials and assessment of performance (model is modified as learning occurs, allowing student to learn the material and gain insight into learning itself).

[see: G.P. Kearsley, Some Conceptional Issues in Computer-Assisted Instruction, Journal of Computer-based Instruction, 4, pp.8-16, August 1977]

The authors conclude that, for effective learning to occur, flexibility and selectivity are needed. To promote effectiveness, the following curriculum questions arise:

What is actually learned through the computer or with the computer?
Which kinds of students learn better with computers?
What kinds of learning are promoted through computers?

Mausner, Bernard, Edward F. Wolff, Richard W. Evans, Mary M. DeBoer, Steven P. Gulkus, Anita D'Amore, and Samuel Hirsch. 1983 "A Program of Computer Assisted Instruction for a Personalized Instructional Course in Statistics." Teaching of Psychology. 10, 4, December: 195-200.

The article is an evaluation of highly interactive computer units, a CAI System Program for Statistics at Beaver College (CAI-Stat).

The objective of the software was to develop a procedure by which students learn underlying concepts of descriptive and inferential statistics (problem solving).

This occurred in a largely self-paced process (students could move as fast or slowly as they liked) of computer interaction with immediate feedback. After each completed unit of the program, tests were administered to the student. Lastly, a final exam involved concepts and problem-solving exercises.

The features of the software included:

A master program with data-base files containing individual units.
A sequential log file in which the dates and times that students passed the end-of-unit exams were written
A file in which student responses and performance data are stored (course developers' control over what gets written into these files)
An examination file for each student in which questions and student responses to unit exams appear (the system assures that students who retake unit tests get different random selections of questions each time).

The authors identify two principles for courseware design for the design of instructional units:

Principle #1 Context or "Problem-Oriented Instruction"

Principle #2 Employing the Expert's Problem Solving Procedure: "When several solution principles are to be taught within the context of a relatively complex problem, the order of instruction should follow an expert's order of access to these problem-solving principles"

In the program, most statistical procedures were taught according to these principles.

For the evaluation, a posttest of problem solving ability was administered to two groups and produced the following results:

Experimental group: Computer-based course (mean of 6.61 correct answers)
Control group: Workbook-based course (mean of 4.64 correct answers)

[ The TEST can be found in: Evans, R.W. A computerized course in elementary statistics: Educational objectives and methods. In Proceedings of NECC 1981: National Educational Computing Conference, pp. 254-258 ]

The authors concluded that improved performance in problem-solving ability is most likely a result of the design principles employed in the creation of the courseware.

Furthermore, three criteria for evaluating CAI programs are discussed. The program/software should:

include adequate training of tutors/teachers in using the computer as an instructional aid
fit the needs of the students and should blend into the existing curriculum
utilize the unique capacities of the computer as an interactive tool, especially in the form of branching capabilities, which a textbook cannot offer

Furthermore, an analysis of individual student characteristics related to performance in the CAI course was made.

This was based on:

1) Log-file including data on:

Entries into every CAI-Stat unit
Date of entry
Time spent on any one interaction with the computer unit
Number of frames completed

2) Pretest:

Learning Preference Inventory [see J. Hanson and H. Silver, Teacher self assessment manual. New York: Trillium Press, 1980]
Mathematics Attitude Inventory (AMI) adapted for six attitudinal factors [see Sandman, R.S. mathematics attitude inventory and MAI user's manual. Minneapolis, MN: Minnesota Research and Evaluation Center, University of Minnesota, 1980.]

Perception of teacher
Anxiety towards math
Value of mathematics in society
Self-concept in math
Enjoyment of math
Motivation in math

Subjective Expected Utility (SEU) measure to describe students' perception of the positive and negative consequences of working regularly and systematically as against working only when absolutely necessary [see Mausner, B. and Platt, E. Smoking: a behavioral analysis. Elmsford, NY: Pergamon Press, 1971.]

3) Final Grades

All of the math attitude factor scores, except for self-concept, were significantly and highly correlated with final grades, especially math anxiety (Correlation: -.70)

The attitudinal dimension yielded a positive reaction of students to the program.

Finally, the authors emphasize the relationship between self-pacing and different types of students: students oriented toward externals, grades, grad school, or mastery, expect positive consequences from regular and systematic work. Those oriented toward internal rewards, avoidance of anxiety, creativity, long-term memory, expect lower utilities from self-pacing. Thus, encouragement of students to use the program is vital.

Ross, Steven. 1984. "Matching the Lesson to the Student: Alternative Adaptive Designs for Individualized Learning Systems." Journal of Computer-Based Instruction. 11, 2, Spring: 42-48.

Ross outlines typical educational computer programs, which tend to incorporate the following orientations of "control":

Learner Control: Learner makes decisions, e.g. menu options
Program Control: Computer makes decisions for students (e.g. program branching/adaptation)
Teacher Control: instructor can change certain parameters considered appropriate for different groups or individuals ("Personalized System of Instruction")

Ross' research broadens the types of instructional properties and bases adaptive decisions on more extensive information about learner background and current needs. He examines adaptation types, including

instructional support
earning incentives, and
the context or presentation

which were applied to the teaching of basic concepts in an undergraduate statistics course.

"Program Control" was implemented for a self-instructional lesson (CMI) covering 10 algebraic rules which comprised prerequisite learning for a statistics course. The "individual adaptive strategy" in the computer program incorporated the following steps (steps 4 - 6 represented a loop until all lessons were completed):

Entry Test (pretest, aptitude, locus of control, anxiety, etc.)
Regression Prediction (a set of 10 predicted scores for each student were generated using multiple regression equations)
Adaptive Prescription
Lesson
Immediate Posttest (after each lesson)
Refinements (based on posttest scores used to make refinements for the next lesson)
Cumulative posttest

The evaluation of the computer software occurred in different studies.

Adaptation Study 1

The following are results (based on % answers correct on the posttest) for the first adaptive study, involving several strategies:

Individualized-adaptive strategy 75% correct
Group-adaptive strategy 62% correct
Nonadaptive strategy 57% correct

Based on these findings, the author concludes that an individualized adaptive strategy is good for CMI models of instruction. Four subsequent studies were made. A replication of the first study, in Study 2, again favored the former treatment.

Study 3 and 4 involved an experiment on rewards or incentives, while the purpose of Study 5 was an evaluation of Program vs. Learner Control in PSI models.

Program control surpassed learner control and lecture on the immediate posttest and all treatments on the delayed posttest. Learner Control was associated with the lowest performance. Ross concludes that Program Control was most beneficial relative to learner control when pretestscores were low.

Ross further emphasizes that adaptation of context may increase conceptual retention.

In an experiment, statistical probability rules were presented in contexts varying in relatedness to subjects' academic majors.

The hypothesis to be tested, thus, was that adaptive (familiar) contexts facilitate assimilation of new information in memory and are more likely to promote meaningful learning.

An evaluation of Education,. Medical, and Abstract Contexts supported this hypothesis:

Some benefits were observed in comparison to nonadaptive contexts, but especially in comparison to uses of abstract contexts.

In conclusion, the author stresses the importance of adaptive contexts, and suggests the support of three types of functions through adaptive contexts:

Generate interest in the task
Activate relevant past experiences as conceptual anchoring for information
Associate rules in memory with a meaningful set of ideas

Duncan, Nancy C. 1993. "Evaluation of instructional software: Design considerations and recommendations." Behavior Research Methods, Instruments and Computers; 25, 2: 223-227.

Duncan discusses general evaluation questions, including sources of internal invalidity (non-randomness in selection process) and considers types of compared educational activity and outcome measures.

Possible sources of internal invalidity includes selection bias: results might be difficult to assess if the volunteer selection method is used. (Another problem mentioned in this article is the awareness of students that some are receiving CAI while others are not). The author stresses the importance of the measurement of learner's characteristics to construct an effective evaluation. However, which characteristics should be assessed and what instruments should be used for such assessments are flexible and depends on each evaluation design. Besides tests, some others possible assessments of student characteristics include:

prior knowledge of the subject
motivation
attitude toward computer learning environment

The potential of assessing the nature of the learning process by the computer (similar to "tracing" student use) is also discussed by Duncan. Several packages are mentioned, such as:

a HyperCard based resource library with real time replay captures all action (Ray and Mitchell)
another called DIAGNOSER, mainly designed to help instructors develop tutorial though assessment of student's progress (Levidow, Hunt, McKee)

Computer-based supplemental exercises are seen by many not as an educational opportunity, but rather an added requirement. How to make the idea of an educational software sound attractive would perhaps be an important task.

The evaluation of TILE concerns two subcategories:

Assessment of student's perception of the usefulness of the program, and any problems that they encountered. (some example questions are listed in Ransdell, pp231, same issue) .

Assessment of the effectiveness of the software by comparing test results of "treatment" and "control" groups. The selection of a proper activity for comparison can be a difficult task. This part of evaluation will run into problems concerning the validity of evaluation, issues such as the dilemma in between-group comparisons, and factors such as student's study habits and variable interest, etc. Measurement could take into account long-term factors, by looking at how this educational software has stimulated interest in course content, even major or career choices.

Castellan, N John. 1993. "Evaluating information technology in teaching and learning." Behavior Research Methods, Instruments and Computers; 25, 2: 233-237.

Castellan discusses "strategic evaluation", which emphasizes technical accuracy, crucial to ensure that students are not hindered by the inability to use the technology. All the procedures involving technical or computer skills should be well explained.

Furthermore, pedagogical soundness is emphasized, the importance of software conveying the content and concepts to be learned. Clearly articulated instructional goals are also considered important. Some questions to ask yourself in designing instructional software include: Does the technology encourage testing ideas and concepts? Can the skills and concepts learned be transferred beyond the context of introduction?

Then, substantive fidelity is outlined: the material has to be accurate and worth learning. Moreover, integrative flexibility is emphasized, the need to modify the class/course structure along with the introduction of this new technology..

Finally, cyclic improvement calls for evaluations that must be made during and after the course, in order to compare the results with evaluations made before the software is used. For instance: do student's opinion change toward CAI (computer aided instruction) after the experience with the program?

Ransdell, Sarah. 1993. "Educational software evaluation research: Balancing internal, external, and ecological validity." Behavior Research Methods, Instruments and Computers; 25, 2: 228-232.

Ransdell discusses the difficulties in evaluating software in terms of tradeoffs between internal and external or ecological validity.

Internal validity refers to the "degree to which a design allows for unconfounded results." The problematic of comparing computer-based instruction with traditional forms of teaching is one example of an internal validity risk when media, rather than "messages" (symbols carrying meaning) are compared. Further threats to internal validity include "improvements or declines in performance due to students' attendance or study habits" as well as "variable interest in, and difficulty of, individual topics."

External validity is the "degree to which results can be generalized to apply to other populations, settings, or levels of variables." External and ecological validity include factors such as "studies of short duration," "homogeneous samples of college student ability," students' respondent and instructors' observer biases.

The evaluation discussed by the author, involving two groups of students (one from a community college, the other from a university) is based on a survey which is administered along with a midterm exam, including some potentially useful questionnaire items for a formative evaluation:

Q: When watching the computer activities, were you ever frustrated by them? If so, can you describe them?

Q: Did you usually understand the computer program's main objective? If not, which activity was difficult to see what it was for?

Q: Are the concepts and terminology in the computer activities related to the material covered in the class lectures? Command

Q: Describe any activities particularly interesting to you and those that were tedious or boring.

Welsh, Josephine. 1993. "The effectiveness of computerized instruction at the college level." Behavior Research Methods, Instruments and Computers. 25, 2:220-222

Five suggestions for a successful software implementation are discussed:

Resource conservation, such as efficiency and cost effectiveness, rather than loss, that is possible with CAI and CBI (computer based instruction) should be emphasized.
Software is to be chosen wisely and requires the involvement of teachers in its evaluation
Be wary of the fact that some students fear computers, even though, at the college level, the problem of attitude is not as great as at the secondary school level.
The use of computers should not be limited to Computer Assisted Instruction, as the effectiveness of Computer Based Instruction has been repeatedly proven.
A continued investigation of both CAI and CBI, in the form of comparative research, is still needed

Relevant to TILE is the author's note that students may pay more attention to screens that contain interactive examples or provide review questions with immediate feedback. Welsh found that students tend to hurry through the screens of text and spend time on the demonstrations. While this can make the software attractive, the author alerts us to the fact that this also may not compel students to learn.

This page was last updated on 07/29/98. Questions, comments, suggestions.