class: blueBack ## Teaching evaluations: truthful or truthy? ### Philip B. Stark
Department of Statistics
University of California, Berkeley ### Third Lisbon Research Workshop on Economics, Statistics and Econometrics of Education ### Lisbon, Portugal
23–24 January 2015 --- Student evaluations of teaching (SET) are the primary "data" used to assess teaching for the purpose of hiring, firing, and promoting faculty: they can make or break the careers of contract faculty. I will summarize evidence that SET are misleading and discriminatory indicators of teaching effectiveness. Indeed, the best studies find SET negatively associated with subsequent student performance and that even ratings of "objective" items, such as whether assignments are returned promptly, are influenced strongly by the gender of the instructor. Other studies suggest that omnibus items such as "overall effectiveness" are particularly affected by students' grade expectations and the gender, attractiveness, and perceived approachability of the instructor. What students mean by "fair," "professional," "organized," "challenging," and "respectful" differs surprisingly from how faculty understand those words. Finally, statistics used to summarize and compare SET across courses, instructors, and disciplines are flawed: ratings of courses of different types, levels, and subjects are incommensurable; and such comparisons generally rely on (inappropriate) averages of ordinal data and ignore response rates, multimodality, scatter, and sources of bias. Calculating averages to one or two decimal places distracts attention from the fact that the underlying data do not measure what they purport to. Much of this is common knowledge. Why, then, do institutions rely on SET for personnel decisions? I suggest three linked economic hypotheses: (1) the perceived cost of more meaningful evaluations is high and falls primarily on individuals who may not benefit from improving evaluation; (2) the perceived cost of firing effective instructors with low SET scores or of promoting ineffective instructors with high SET scores is low and falls primarily on students; and (3) to the extent that SET measure "customer satisfaction," relying on SET may meet the business interests of institutions better than more meaningful measures of teaching effectiveness would. --- ### Joint work with Richard Freishtat Stark & Freishtat, 2014. [An Evaluation of Course Evaluations](https://www.scienceopen.com/document/vid/42e6aae5-246b-4900-8015-dc99b467b6e4), [ScienceOpen](www.scienceopen.com), DOI: 10.14293/S2199-1006.1.SOR-EDU.AOFRQA.v1 [Truthiness (Colbert Report: The Word)](http://thecolbertreport.cc.com/videos/63ite2/the-word---truthiness)
--- ## Student Evaluations of Teaching (SET) + most common method to evaluate teaching + define "effective teaching" for many purposes + primary information about teaching for hiring, firing, tenure, promotion + simple, cheap, painless to administer + survey, on paper or online - Typically on Likert scale of 1-5 or 1-7 - student comments sometimes solicited --- ## Typical items: - .blue[Considering the limitations & possibilities of the subject matter & the course, how would you rate the overall effectiveness of this instructor?] - Considering the limitations & possibilities of the subject matter, how effective was the course? - The instructor presented content in an organized manner - The instructor explained concepts clearly - The instructor was helpful when I had difficulties or questions - The instructor provided clear constructive feedback - The course was effectively organized - The course developed my abilities & skills for the subject - The course developed my ability to think critically about the subject - .blue[On average, how many hours per week have you spent on this course?] -- - ≈40% of students report spending >20 h/w on every course --- ## What's effective teaching? -- + Some students will learn no matter what; some won't no matter what. -- + Effective teaching presumably facilitates learning. -- + Grades generally not a good proxy for learning ("teaching to the test," easy grading, etc.) -- + Students generally not well equipped to judge immediately and accurately how much they learned. -- + Serious problems with confounding --  -- + .blue[Need controlled, randomized experiments] --- ## Confusing validity & reliability + Many studies of SET address *reliability*: -- - Do different students rate the same instructor similarly? -- - Would a student rate the same instructor consistently later? -- + Unrelated to whether SET measure effectiveness. -- A hundred bathroom scales might all report your weight to be exactly the same. -- .blue[That doesn't mean they measured your _height_ accurately.] -- .blue[(Or your weight, for that matter.)] -- #### The question is _validity_: Do SET primarily measure teaching effectiveness? Or something else? -- Are SET a "fair" measure of effectiveness? Or biased? -- Do some instructors predictably get systematically lower SET for reasons other than effectiveness (e.g., gender, attractiveness, ethnicity, rigorous grading, time of day of the course, class size)? --- #### Is reliability (i.e., consistency) a sign of a good measurement? + instructors unlikely to be equally effective with students who have different - backgrounds - preparation - skill/aptitude - study habits - dispositions - maturity - "learning styles" -- + consistency of ratings would suggest the instructor is equally effective with all kinds of students -- + if a laboratory instrument always gives the same reading when its inputs vary substantially, it’s probably broken. -- + standard reporting ignores rater consistency: gives only averages -- + but, can _measure_ IRR in situ in every class: artificial problem --- ### Crunching the numbers + Averages make sense only if the scale is _proportional_ -- + Is the difference between 1 & 2 the same as the difference between 5 & 6? -- + Does a 1 balance a 7 to make two 4s? -- + Does a 3 mean the same thing to every student, in every class—even approximately? -- + Does a 5 in an upper-division elective architecture studio mean the same thing as a 5 in a required freshman econ course with 500 students? -- .red[Averaging SET scores doesn't make sense. Comparing average SET scores across courses, instructors, levels, types of classes, and disciplines doesn't make sense. ] --- ### The importance of variability -- Three statisticians go deer hunting. -- The first shoots and misses a meter to the left. -- The second shoots and misses a meter to the right. -- .blue[The third yells **"We got it!"**] -- + Averages throw away valuable information about variability: - (1+1+7+7)/4 = (2+3+5+6)/4 = (1+5+5+5)/4 = (4+4+4+4)/4 = 4 - There's a difference between a polarizing teacher and a teacher with mediocre ratings + .blue[Things that are equal _on average_ are not necessarily commensurable.] --- ## What do SET measure? No consensus. - .blue[SET scores are highly correlated with students' grade expectations]
Marsh & Cooper, 1980; Short et al., 2012; Worthington, 2002 -- - .blue[SET scores & enjoyment scores _very_ strongly correlated]
Stark, unpublished, 2014 -- - .blue[SET can be predicted from the students’ reaction to 30 seconds of silent video of the instructor; physical attractiveness matters]
Ambady & Rosenthal, 1993 -- - .blue[gender, ethnicity, & the instructor’s age matter]
Anderson & Miller, 1997; Basow, 1995; Boring, 2014; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014; Wachtel, 1998; Weinberg et al., 2007; Worthington, 2002 -- - .blue[omnibus questions about curriculum design, effectiveness, etc. appear most influenced by factors unrelated to learning]
Worthington, 2002 --- ## The gold standard Three randomized, controlled experiments: .framed.looser[ .blue[ + The US Air Force Academy: Carrell & West, 2008 + Bocconi University, Milan: Braga, Paccagnella, & Pellizzari, 2011 + NC State online course: MacNell, Driscoll, & Hunt, 2014 ]] --- ### Carrell & West, 2008 United States Air Force Academy assigns students to instructors at random in core courses, including follow-on courses. All sections have identical syllabi and exams. .framed[ > .blue[professors who excel at promoting contemporaneous student achievement teach in ways that improve their student evaluations but harm the follow-on achievement of their students in more advanced classes.] > Academic rank, teaching experience, and terminal degree status of professors are negatively correlated with contemporaneous value-added but positively correlated with follow-on course value-added. > Hence, students of less experienced instructors who do not possess a doctorate perform significantly better in the contemporaneous course but perform worse in the follow-on related curriculum. > Student evaluations are positively correlated with contemporaneous professor value-added and negatively correlated with follow-on student achievement. .blue[That is, students appear to reward higher grades in the introductory course but punish professors who increase deep learning (introductory course professor value-added in follow-on courses).] ] --- ### Braga, Paccagnella, & Pellizzari, 2011 Randomized assignment of students to instructors at Bocconi University, Milan > .framed.looser[The effectiveness measures are estimated by comparing the subsequent performance in follow-on coursework of students who are randomly assigned to teachers in each of their compulsory courses. We find that, even in a setting where the syllabuses are fixed, teachers still matter substantially. The average difference in subsequent performance between students who were assigned to the best & worst teachers (on the effectiveness scale) is approximately 43% of a standard deviation in the distribution of exam grades, corresponding to about 5.6% of the average grade. Additionally, we find that our measure of teacher effectiveness is negatively correlated with the students' evaluations of professors: .blue[in other words, teachers who are associated with better subsequent performance receive worst evaluations from their students.] ] --- ### MacNell, Driscoll & Hunt, 2014: Gender Bias .left-column[ NC State online course. Randomized assignments of students into 4 groups. 2 instructors, 1 male 1 female. Each instructor was identified to students by actual gender in 1 section, false gender in 1 section. Regardless of actual gender, substantially higher ratings when each instructor was identified as male, even for "objective" measures, e.g., speed of returning homework. 5-point scale. ] .right-column[
Characteristic
F - M
Caring
-0.52
Consistent
-0.47
Enthusiastic
-0.57
Fair
-0.76
Feedback
-0.47
Helpful
-0.46
Knowledgeable
-0.35
Praise
-0.67
Professional
-0.61
Prompt
-0.80
Respectful
-0.61
Responsive
-0.22
] --- ### Boring, 2014: more evidence of gender bias .framed.looser[This paper uses a unique database from a French university to show that student evaluations of teachers (SETs) suffer from gender biases. Male students in particular tend to give higher overall satisfaction scores to male teachers, rewarding them for their perceived higher quality in course delivery style. These gender biases create different incentives for male and female teachers to change behaviors in order to improve their SET scores. Male teachers can increase their SET scores by investing more effort in the characteristics that male students tend to value more. However, female teachers must invest more effort improving the teaching dimensions in which students tend to perceive a slight comparative advantage for women, i.e. course structure, organization and teaching material. Because students do not value these teaching dimensions as much in terms of their ratings of overall satisfaction for a course, male teachers tend to stay longer in position, as they respond better to male students’ incentives. .blue[The results suggest that better teaching is not necessarily measured by SETs.] ] --- ### Lauer, 2012: Student comments knotty, too Survey of 185 students, 45 faculty at Rollins College, Winter Park, Florida > .blue[I once believed that narrative comments on course evaluation forms were straightforward and useful.] -- Faculty & students use the same vocabulary quite differently, ascribing quite different meanings to words such as "fair," "professional," "organized," "challenging," & "respectful." --
not fair
means …
student %
instructor %
plays favorites
45.8
31.7
grading problematic
2.3
49.2
work is too hard
12.7
0
won't "work with you" on problems
12.3
0
other
6.9
19
--- ### SET apologists strike back! .framed[ > I felt compelled to write this blog after reading Philip Stark and Richard Freishtat’s opening salvo from their article, "An Evaluation of Course Evaluations” recently summarized in the Chronicle of Higher Education: >> … it is widely believed that [student ratings of instruction] are primarily a popularity contest; that it’s easy to "game” the ratings; that good teachers get bad ratings and vice versa; and that fear of bad ratings stifles pedagogical innovation and encourages faculty to water down course content (p. 1). > Some people also believe that climate change is a hoax. But does thinking make it so? > Were these the only unsubstantiated claims the authors made, I might have been able to resist writing this blog. [Steve Benton's blog at IDEA](http://ideaedu.org/ideablog/2014/09/evaluation-%E2%80%9C-evaluation-course-evaluations%E2%80%9D-part-i), 29 September 2014 ] Benton's criticisms are hilarious for several reasons, among them, _the very next sentence_ in our paper is .blue["What is the truth?"] --- ### Who supports SET? .framed[ .blue[ >> It is difficult to get a man to understand something, when his salary depends upon his not understanding it! —Upton Sinclair ] ] -- #### Benton's _job_ is to sell SET; that's what his firm, IDEA, does. --- ### Benton & Cashin, 2012: exemplar SET apologists + Widely cited, but it's a technical report from IDEA, a business that sells SET teaching evaluations. -- + Claims SET are reliable and valid. -- + Does not cite Carrell & West (2008) or Braga et al. (2011), the only two randomized experiments I know of published before B&C (2012) -- + As far as I can tell, no study B&C cite in support of validity used randomization. --- ### Benton & Cashin's argument + Straw man: they rebut absolutist positions no sane person would take, e.g. (at p.2) - Students cannot make consistent judgments. - Student ratings are just popularity contests. - Students will not appreciate good teaching until they are out of college a few years. - Students just want easy courses. - Student feedback cannot be used to help improve instruction. -- + The two non-absolutist statements they oppose are demonstrably true: - Student ratings are unreliable and invalid. - The time of day the course is offered affects ratings. -- + The remaining statement they oppose is true, in my experience as a teacher and department chair: - Emphasis on student ratings has led to grade inflation.
See also Ewing, 2012; Isely & Singh, 2005; Krautmann & Sander, 1999; McPherson, 2006 --- ### Benton & Cashin on validity >> Theoretically, the best indicant of effective teaching is student learning. Other things being equal, the students of more effective teachers should learn more. -- I agree. -- >> A number of studies have attempted to examine this hypothesis by comparing multiple-section courses. For example, Benton and colleagues (Benton, Duchon, & Pallett, 2011) examined student ratings in multiple sections of the same course taught by the same instructor. They correlated student ratings of progress on objectives the instructor identified as relevant to the course (using IDEA student ratings) with their performance on exams tied to those objectives. Student ratings correlated positively with four out of five exams and with the course total points (r = .32). .blue[What's wrong with this argument?] --- #### Again, address a straw man hypothesis: - I've never seen a claim that SET have **absolutely no connection** to teaching effectiveness -- - indeed, SET are associated with class enjoyment, which may affect engagement & learning >UCB Dept. Stat, fall 2012, 1486 students rated instructor overall effectiveness & enjoyment of the course. -- .blue[ > Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8. ] -- - .red[the question is not whether there's _any_ association between SET and effectiveness.] .blue[the question is _how well_ SET measure effectiveness, and whether factors unrelated to effectiveness are confounded enough that SET is misleading or discriminatory] -- + The association at the student level Benton et al. find means individual students who rate _the same instructor_ higher get higher scores. -- + How does that show that SET are valid? It seems to show that they are not reliable! --- + As a practical matter, r=0.32 is very weak for n=188 - if relationship is linear (implicit in r), indicates that whatever SET measures, it accounts for only 10% of the variance of student performance - c.f. r=0.8 for course effectiveness & enjoyment, n=1486 -- + The underlying data do not come from a randomized experiment. - no real controls (e.g., pretest); no basis for a statistical conclusion - likely confounding from many factors, including - time of day, which could affect the students enjoyment and performance, and the instructor's teaching. (Benton claims time of day does not influence SET, but there's evidence it does) - whether the section is 1st, 2nd, or 3rd in the week, which could affect the instructor's level of preparation, energy, & enthusiasm - differences among the students in the sections --- ### Benton & Cashin Statistical abuses - Pearson r measures only linear association; the data here are categorical ordinal. Significance levels seem to have been computed (incorrectly) using normal theory. - statistical significance is generally not meaningful absent randomization, and they have none - lack of statistical significance does not imply lack of practical significance - null hypothesis of *no association* between SET & learning is silly: the question is whether SET primarily measure effectiveness or are strongly influenced by other factors - ignore sample sizes and lack of randomization in arguments from lack of statistical significance - I haven't read all their references (yet), but none appears to use randomized assignments: uncontrolled confounding - they report things such as the average r across a variety of observational studies: meaningless --- ### Even Benton & Cashin concede: .framed.looser[ .blue[Writers on faculty evaluation are almost universal in recommending the use of multiple sources of data. No single source of information — including student ratings — provides sufficient information to make a valid judgment about an instructor’s overall teaching effectiveness. Further, there are important aspects of teaching that students are not competent to rate. ]] --- ### SET do not measure teaching effectiveness + Calling something "teaching effectiveness" does not make it so + Averaging Likert ratings is silly: scale isn't proportional + Compute averages to 2 decimals doesn't make the averages meaningful or comparable + Courses are largely incommensurable: comparing averages across sizes, levels, disciplines, types of course, electives v. requirements, etc., is silly --- ## Response rates + Sample cannot be treated as random. Cannot extrapolate beyond the sample. Margin of error meaningless. + Suppose 70% of the class respond, with an average of 4 on a 7-point scale. Class average could be anywhere between 3.1 & 4.9 + Who responds?  --- ### What might we be able to discover about teaching? .looser[ + Is this a dedicated teacher? + Is she engaged in her teaching? + Is she following pedagogical practices found to work in the discipline? + Is she available to students? + Is she putting in appropriate effort? Is she creating new materials, new courses, or new pedagogical approaches? + Is she revising, refreshing, and reworking existing courses using feedback and on-going experiment? + Is she helping keep the department's curriculum up to date? + Is she trying to improve? + Is she improving? + Is she contributing to the college’s teaching mission in a serious way? + Is she supervising undergraduates for research, internships, and honors theses? + Is she advising and mentoring students? + Do her students do well when they graduate? ] --- ### Peer observation In 2013, UC Berkeley Department of Statistics adopted as standard practice a more holistic assessment of teaching. Candidates prepare a teaching portfolio, including teaching statement, syllabi, notes, websites, assignments, exams, videos, statements on mentoring, & anything else the candidate wants to include. Dept. chair & promotion committee assess the portfolio. At least before every "milestone” review (mid-career, tenure, full, step VI), a faculty member watches at least one of the candidate's lectures. Complements the portfolio & student comments. Distributions of SET scores are reported, along with response rates. Averages of scores are not reported. Themes of comments are summarized. --- ### How hard/expensive is it? .blue[Classroom observation took the reviewer about four hours, including the observation time itself.] Process included conversations between the candidate and the observer, opportunity for the candidate to respond to the written comments, & provision for a "no-fault do-over." .blue[If done for every milestone review, would be ≈16h over a 40-year _career_: de minimis.] Candidates & reviewer reported that the process was valuable and interesting. Based on that experience, the dean recommended peer observation prior to milestone reviews; the next dean reversed that decision. Room for improvement: Observing more than one class session and more than one course would be better. Adding informal classroom observation and discussion between reviews would be better. Periodic surveys of former students, advisees, and teaching assistants would be useful. But still this improves on using SET alone. --- ### Example letter for a strong teacher (almalgam of real letters) Smith is, by all accounts, an excellent teacher, as confirmed by the classroom observations of Prof. Jones, who calls out Smith's ability to explain key concepts in a broad variety of ways, to hold the attention of the class throughout a 90-minute session, to use both the board and slides effectively, and to engage a large class in discussion. Prof. Jones's peer observation report is included in the case materials; conversations with Jones confirm that the report is Jones's candid opinion: Jones was impressed, and commented in particular on Smith's rapport with the class, Smith's sensitivity to the mood in the room and whether students were following the presentation, Smith's facility in blending derivations on the board with projected computer simulations to illustrate the mathematics, and Smith's ability to construct alternative explanations and illustrations of difficult concepts when students did not follow the first exposition. While interpreting "effectiveness" scores is problematic, Smith's teaching evaluation scores are consistently high: in courses with a response rate of 80% or above, less than 1% of students rate Smith below a 6. Smith's classroom skills are evidenced by student comments in teaching evaluations and by the teaching materials in her portfolio. --- #### letter (contd) Examples of comments on Smith's teaching include: > I was dreading taking a statistics course, but after this class, I decided to major in statistics. > the best I've ever met … hands down best teacher I've had in 10 years of university education > overall amazing … she is the best teacher I have ever had > absolutely love it > loves to teach, humble, always helpful > extremely clear … amazing professor > awesome, clear > highly recommended > just an amazing lecturer > great teacher … best instructor to date > inspiring and an excellent role model > the professor is GREAT Critical student comments primarily concerned the difficulty of the material or the homework. None of the critical comments reflected on the pedagogy or teaching effectiveness, only the workload. --- #### letter (contd) I reviewed Smith's syllabus, assignments, exams, lecture notes, and other materials for Statistics X (a prerequisite for many majors), Y (a seminar course she developed), Z (a graduate course she developed for the revised MA program, which she has spearheaded), and Q (a topics course in her research area). They are very high quality and clearly the result of considerable thought and effort. In particular, Smith devoted an enormous amount of time to developing online materials for X over the last five years. The materials required designing and creating a substantial amount of supporting technology, representing at least 500 hours per year of effort to build and maintain. T he undertaking is highly creative and advanced the state of the art. Not only are those online materials superb, they are having an impact on pedagogy elsewhere: a Google search shows over 1,200 links to those materials, of which more than half are from other countries. I am quite impressed with the pedagogy, novelty, and functionality. I have a few minor suggestions about the content, which I will discuss with Smith, but those are a matter of taste, not of correctness. The materials for X and Y are extremely polished. Notably, Smith assigned a term project in an introductory course, harnessing the power of inquiry-based learning. I reviewed a handful of the term projects, which were ambitious and impressive. The materials for Z and Q are also well organized and interesting, and demand an impressively high level of performance from the students. The materials for Q include a great selection of data sets and computational examples that are documented well. Overall, the materials are exemplary; I would estimate that they represent well over 1,500 hours of development during the review period. --- #### letter (contd) Smith's lectures in X were webcast in fall, 2013. I watched portions of a dozen of Smith's recorded lectures for X—a course I have taught many times. Smith's lectures are excellent: clear, correct, engaging, interactive, well paced, and with well organized and legible boardwork. Smith does an admirable job keeping the students involved in discussion, even in large (300+ student) lectures. Smith is particularly good at keeping the students thinking during the lecture and of inviting questions and comments. Smith responds generously and sensitively to questions, and is tuned in well to the mood of the class. Notably, some of Smith's lecture videos have been viewed nearly 300,000 times! This is a testament to the quality of Smith's pedagogy and reach. Moreover, these recorded lectures increase the visibility of the Department and the University, and have garnered unsolicited effusive thanks and praise from across the world. Conversations with teaching assistants indicate that Smith spent a considerable amount of time mentoring them, including weekly meetings and observing their classes several times each semester. She also played a leading role in revising the PhD curriculum in the department. --- #### letter (contd) Smith participated in two campus-wide seminars on improving teaching during the review period, and led a breakout session on working with GSIs. Smith also taught the GSI pedagogy course last semester. Smith has been quite active as an advisor to graduate students. In addition to serving as a member of sixteen exam committees and more than a dozen MA and PhD committees, she advised three PhD recipients (all of whom got jobs in top-ten departments), co-advised two others, and is currently advising three more. Smith advised two MA recipients who went to jobs in industry, co-advised another who went to a job in government, advised one who changed advisors. Smith is currently advising a fifth. Smith supervised three undergraduate honors theses and two undergraduate internships during the review period. This is an exceptionally strong record of teaching and mentoring for an assistant professor. Smith's teaching greatly exceeds expectations. --- ### Example letter for a less engaged teacher During the review period, Smythe taught the standard departmental load of three courses per year, a total of nine courses. Smythe has taught these courses for the last 15 years. The course materials were last updated in 2003. Student comments suggest that the examples could stand refreshing. Students reported having trouble reading Smythe's handwritten transparencies; students have made similar comments for the last 15 years. Students also reported that Smythe cancelled class on several occasions and did not schedule office hours. Smythe did not serve on any PhD oral exam committees or thesis committees during the review period, nor did Smythe supervise any graduate students or undergraduate research students. --- ### How can students help evaluate teaching? Report their own experience of the class. + Did you enjoy the class? + Did you find the class easy or difficult? Interesting or boring? + Did you leave it more enthusiastic or less enthusiastic about the subject matter? + If this was an elective: - did you plan to take a sequel before taking this course? - do you now plan to take a sequel course? + Could you hear the instructor during lectures? + Was the instructor’s handwriting legible? + What was your favorite part of the course? + What was your least favorite part of the course? + Would you recommend this course to other students? --- ### Whom to ask for what + To know whether the teaching is good, look at the teaching. - Subcontracting to students doesn't work. Sad, but true. - Doing better takes some effort. But if we really value teaching, we can put a little effort into evaluating and improving teaching. + Ask the students to find out _about their experience_. + Student comments can be valuable, but interpreting them is knotty. + Be wary of extrapolation, especially if response rates are low. --- ### Are there economic reasons to use SET? + Generally believed that cost of better evaluations is high - it's higher, but it's not high in absolute terms - if we actually care about teaching, there's no option - but the cost falls on faculty and administrators + Who is harmed? - female instructors, typically - students, who otherwise might get better instruction (albeit perhaps instruction they don't like as much) - other good instructors who get bad ratings + Who benefits? - low cost to the institution - the fact that it's numerical gives an air of objectivity - some faculty feel "safe" with SET (the devil you know) --- ### What is the goal? .framed.looser[ .blue[ #### Are we trying to assess and improve teaching or market a service? #### Which do we care about more: learning or customer satisfaction?] ] --- ### Summary of statistical errors behind the policy + naming something "effectiveness" does not make it that: measurement bias + failure to heed randomized, controlled experiments + reliance on observational studies with large potential confounding + misinterpretation of "lack of significance" + misuse of the correlation coefficient; misuse of hypothesis tests + mistreatment of ordinal categorical data + failure to account for data variability + failure to account nonresponse bias; treating samples of convenience as if they were random samples + comparing incommensurables + irrational belief in quantification --- ### Policy implications 1. SET should not be used as a measure of teaching effectiveness. 1. Well-designed, randomized experiments show that SET have gender biases. Continuing to rely on SET for hiring and promotions invites lawsuits. 1. .blue[Predicted impact of relying on SET for hiring, promotion, etc.:] Discrimination against women, rewarding less effective teaching, punishing more effective teaching 1. To know whether a teacher is good, look at the teaching. Don't subcontract evaluation to students. This will cost more, but if we are serious about teaching, we should stop relying on SET as a proxy for effectiveness. --- ### Recommendations 1. Drop omnibus items about "overall teaching effectiveness" and "value of the course" 1. Do not average or compare averages of SET scores: Such averages do not make sense statistically. Instead, report the distribution of scores, the number of responders, and the response rate. 1. Responders are not a random sample and there's no reason their responses should be representative of the class as a whole: do not extrapolate. 1. Pay attention to student comments but understand their limitations and heed differences in language usage. 1. Avoid comparing teaching effectiveness across courses of different types, levels, sizes, functions, or disciplines. 1. Use teaching portfolios as part of the review process. 1. Use classroom observation as part of milestone reviews. 1. To improve teaching and evaluate teaching fairly and honestly, spend time observing teaching & teaching materials. --- ### Meta-message + It's easy to think we're being objective and rational when we base our decisions on data and numbers. -- + But if the data are subjective (or low quality) or the numbers and not well connected to the goal, it's irrational to rely on them: they are unfit for the purpose. -- + It may be far better to use a qualitative approach involving careful observation and judgement. -- .framed.looser[ .red[ + Not all evidence is numerical. + Not all numbers are evidence. + Beware quantifauxcation! ] ]