class: blueBack ## Teaching evaluations:
class act or class action? ### Philip B. Stark
Department of Statistics
University of California, Berkeley ### National Center for the Study of Collective Bargaining in Higher Education and the Professions ### Annual Conference ### Hunter College ### New York, NY
19–20 April 2015 --- ## What do SET measure? No consensus. - .blue[SET scores are highly correlated with students' grade expectations]
Boring et al., 2015; Marsh & Cooper, 1980; Short et al., 2012; Worthington, 2002 -- - .blue[SET scores & enjoyment scores strongly correlated]
Stark, unpublished, 2014 -- - .blue[SET can be predicted from the students’ reaction to 30 seconds of silent video of the instructor; physical attractiveness matters]
Ambady & Rosenthal, 1993 -- - .blue[gender, ethnicity, & the instructor's age matter]
Anderson & Miller, 1997; Basow, 1995; Boring, 2014; Boring et al., 2015; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014; Wachtel, 1998; Weinberg et al., 2007; Worthington, 2002 -- - .blue[omnibus questions about curriculum design, effectiveness, etc., most influenced by factors unrelated to learning]
Worthington, 2002 --- ### Gold standard: Randomized, controlled experiments -- ### Carrell & West, 2008 United States Air Force Academy assigns students to instructors at random in core courses, including follow-on courses. All sections have identical syllabi and exams. .framed[ Student evaluations are positively correlated with contemporaneous professor value-added and negatively correlated with follow-on student achievement. .blue[That is, students appear to reward higher grades in the introductory course but punish professors who increase deep learning (introductory course professor value-added in follow-on courses).] ] -- ### Braga, Paccagnella, & Pellizzari, 2011 Randomized assignment of students to instructors at Bocconi University, Milan .framed.looser[.blue[in other words, teachers who are associated with better subsequent performance receive worst evaluations from their students.] ] --- ### MacNell, Driscoll, & Hunt, 2014: Gender Bias .left-column[ NC State online course. Randomized assignments of students into 4 groups. 2 instructors, 1 male 1 female. Each instructor was identified to students by actual gender in 1 section, false gender in 1 section. Regardless of actual gender, substantially higher ratings when each instructor was identified as male, even for "objective" measures, e.g., speed of returning homework. 5-point scale. ] .right-column[
Characteristic
F - M
Caring
-0.52
Consistent
-0.47
Enthusiastic
-0.57
Fair
-0.76
Feedback
-0.47
Helpful
-0.46
Knowledgeable
-0.35
Praise
-0.67
Professional
-0.61
Prompt
-0.80
Respectful
-0.61
Responsive
-0.22
] --- ### Boring, Ottoboni, & Stark, 2015: .blue[SET measure gender, grade expectations, NOT effectiveness] .framed.looser[ Natural experiment: 22,665 SETs, 1,177 course sections, 372 instructors, 4,423 students, five years + association between SET & final exam scores negative but insignificant (P ≈ 0.57) + association between SET & grade expectations positive & highly significant (P ≈ 0.00) + association between instructor gender & final exam score insignificant (students of male instructors do worse, P ≈ 0.52 overall, 0.76 male students, 0.68 female students) + association between instructor gender & SET is highly significant—male students rate male instructors higher (P ≈ 0.00 overall, 0.00 male students, 0.53 female students) + results vary widely across disciplines, so can't "correct" any of this. ] --- ### Lauer, 2012: Student comments knotty, too Survey of 185 students, 45 faculty at Rollins College, Winter Park, Florida > .blue[I once believed that narrative comments on course evaluation forms were straightforward and useful.] -- Faculty & students ascribe quite different meanings to words such as "fair," "professional," "organized," "challenging," & "respectful." --
not fair
means …
student %
instructor %
plays favorites
45.8
31.7
grading problematic
2.3
49.2
work is too hard
12.7
0
won't "work with you" on problems
12.3
0
other
6.9
19
--- ### Benton & Cashin, 2012: exemplar SET apologists .framed[ .blue[ >> It is difficult to get a man to understand something, when his salary depends upon his not understanding it! —Upton Sinclair ] ] + Widely cited, unrefereed technical report from a business that sells SET; flawed statistics -- + Rebut straw man positions: - Students cannot make consistent judgments. - Student ratings are just popularity contests. - Students will not appreciate good teaching until they are out of college a few years. - Students just want easy courses. - Student feedback cannot be used to help improve instruction. -- + The two non-absolutist statements they reject are demonstrably true: - Student ratings are unreliable and invalid. - The time of day the course is offered affects ratings. -- + The remaining statement they reject is true, in my experience as a teacher and department chair: - Emphasis on student ratings has led to grade inflation.
See also Ewing, 2012; Isely & Singh, 2005; Krautmann & Sander, 1999; McPherson, 2006 --- ### Recommendations 1. Do not use SET as a measure of teaching effectiveness. Reliance on SET clearly has disparate impact. 1. Drop omnibus items about "overall teaching effectiveness" and "value of the course" 1. Use teaching portfolios as part of the review process. 1. Use classroom observation as part of milestone reviews. 1. To improve teaching and evaluate teaching fairly and honestly, spend time observing teaching & teaching materials.