class: blueBack ## Teaching Evaluations (Mostly) Do Not Measure Teaching Effectiveness ### Department of Applied Mathematics and Statistics
University of California, Santa Cruz
1 February 2016 #### Philip B. Stark
Department of Statistics
University of California, Berkeley
http://www.stat.berkeley.edu/~stark | [@philipbstark](https://twitter.com/philipbstark) #### With Anne Boring (SciencesPo), Kellie Ottoboni (UCB), Richard Freishtat (UCB) --- .center[***The truth will set you free, but first it will piss you off.***] .align-right[—Gloria Steinem]
+ Stark, P.B., & R. Freishtat, 2014. [An Evaluation of Course Evaluations](https://www.scienceopen.com/document/vid/42e6aae5-246b-4900-8015-dc99b467b6e4), [ScienceOpen](www.scienceopen.com), DOI: 10.14293/S2199-1006.1.SOR-EDU.AOFRQA.v1 + Boring, A., K. Ottoboni and P.B. Stark, 2016. [Teaching Evaluations (Mostly) Do Not Measure Teaching Effectiveness](https://www.scienceopen.com/document/vid/818d8ec0-5908-47d8-86b4-5dc38f04b23e), [ScienceOpen](www.scienceopen.com), DOI: 10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1 --- ### Abstract Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure of teaching effectiveness. We show: + SET are biased against female instructors by an amount that is large and statistically significant + the bias affects how students rate even putatively objective aspects of teaching, such as how promptly assignments are graded + he bias varies by discipline and by student gender, among other things + it is not possible to adjust for the bias, because it depends on so many factors + SET are more sensitive to students’ gender bias and grade expectations than they are to teaching effectiveness + gender biases can be large enough to cause more effective instructors to get lower SET than less effective instructors These findings are based on nonparametric statistical tests applied to two datasets: 23,001 SET of 379 instructors by 4,423 students in six mandatory first-year courses in a five-year natural experiment at a French university, and 43 SET for four sections of an online course in a randomized, controlled, blind experiment at a US university --- ## Student Evaluations of Teaching (SET) + most common method to evaluate teaching + used for hiring, firing, tenure, promotion + simple, cheap, fast to administer --- ## Truthful or Truthy? [Truthiness (Colbert Report: The Word)](http://thecolbertreport.cc.com/videos/63ite2/the-word---truthiness)
"Feels right" that students can rate teaching. --- ### Typical SET items: -- - Considering the limitations & possibilities of the subject matter & the course, how would you rate the overall effectiveness of this instructor? -- - Considering the limitations & possibilities of the subject matter, how effective was the course? -- - Instructor: presentation, explanations, availability, feedback -- - Course: organization, developed skills, developed critical thinking -- - .blue[On average, how many hours per week did you spend on the course?] -- ≈40% of students report spending >20 h/w on every course --- ## Dr. Fox Lecture
Does the survey q. "how effective was the instructor?" measure effectiveness? --- #### Rutgers law dean asks students not to comment on women faculty dress ([IHE](https://www.insidehighered.com/news/2015/01/29/rutgers-camden-law-dean-asks-students-stop-talking-about-women-professors-attire))  --- #### Adjectival frequency by gender on RateMyProfessor ([Ben Schmidt](http://benschmidt.org/profGender/#%7B%22database%22%3A%22RMP%22%2C%22plotType%22%3A%22pointchart%22%2C%22method%22%3A%22return_json%22%2C%22search_limits%22%3A%7B%22word%22%3A%5B%22cold%22%5D%2C%22department__id%22%3A%7B%22%24lte%22%3A25%7D%7D%2C%22aesthetic%22%3A%7B%22x%22%3A%22WordsPerMillion%22%2C%22y%22%3A%22department%22%2C%22color%22%3A%22gender%22%7D%2C%22counttype%22%3A%5B%22WordsPerMillion%22%5D%2C%22groups%22%3A%5B%22department%22%2C%22gender%22%5D%2C%22testGroup%22%3A%22A%22%7D))  --- ## What's effective teaching? -- + Should facilitate learning -- + Grades usually not a good proxy for learning -- + Students generally can't judge how much they learned -- + Serious problems with confounding --  https://xkcd.com/552/ -- + .blue[Need controlled, randomized experiments] --- ## Reliability - Do different students rate the same instructor similarly? - Would a student rate the same instructor consistently later? -- Unrelated to effectiveness. -- 100 scales might all report your weight to be exactly the same. -- .blue[Doesn't mean they measured your _height_ accurately.] -- .blue[(Or your weight.)] -- ## Validity Do SET primarily measure teaching effectiveness? -- Are they biased by, e.g., gender, attractiveness, ethnicity, math content, rigor, time of day, class size? --- ### Crunching the numbers + Averages make sense only if the scale is _proportional_ -- + Is the difference between 1 & 2 the same as the difference between 5 & 6? -- + Does a 1 balance a 7 to make two 4s? -- + Does a 3 mean the same thing to every student, in every class—even approximately? -- + Does a 5 in an upper-division elective architecture studio mean the same as a 5 in a required freshman Econ course with 500 students? -- .red[Averaging SET scores doesn't make sense. Comparing average SET scores across courses, instructors, levels, types of classes, and disciplines doesn't make sense. ] --- ### The importance of variability -- Three statisticians go deer hunting. -- The first shoots and misses a meter to the left. -- The second shoots and misses a meter to the right. -- .blue[The third yells **"We got it!"**] -- + .blue[Things that are equal _on average_ are not necessarily similar.] + Averages throw away valuable information about variability: - (1+1+7+7)/4 = (2+3+5+6)/4 = (1+5+5+5)/4 = (4+4+4+4)/4 = 4 - Polarizing teacher not same as teacher w/ mediocre ratings --- ## What do SET measure? No consensus. - .blue[SET scores are highly correlated with students' grade expectations]
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2012; Worthington, 2002 -- - .blue[SET scores & enjoyment scores _very_ strongly correlated]
Stark, unpublished, 2014 -- - .blue[SET can be predicted from the students' reaction to 30 seconds of silent video of the instructor; physical attractiveness matters]
Ambady & Rosenthal, 1993 -- - .blue[instructor gender, ethnicity, & age matter]
Anderson & Miller, 1997; Arbuckle & Williams, 2003; Basow, 1995; Boring, 2014; Boring et al., 2016; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014; Wachtel, 1998; Weinberg et al., 2007; Worthington, 2002 -- - .blue[omnibus questions about curriculum design, effectiveness, etc. appear most influenced by factors unrelated to learning]
Worthington, 2002 --- ## Gold standard Three randomized, controlled experiments: .framed.looser[ .blue[ + The US Air Force Academy: Carrell & West, 2008 + Bocconi University, Milan: Braga, Paccagnella, & Pellizzari, 2011 + NC State online course: MacNell, Driscoll, & Hunt, 2014 ]] Also "natural experiment": .framed.looser[ .blue[ + SciencesPo: Boring, 2014 ]] --- ### Carrell & West, 2008 USAF Academy assigns students to instructors at random in core courses, including follow-on courses. Sections have identical syllabi and exams. .framed[.blue[ > professors who excel at promoting contemporaneous student achievement teach in ways that improve their student evaluations but harm the follow-on achievement of their students in more advanced classes. … > students appear to reward higher grades in the introductory course but punish professors who increase deep learning (introductory course professor value-added in follow-on courses). ]] --- ### Braga, Paccagnella, & Pellizzari, 2011 Randomized assignment of students to instructors at Bocconi University, Milan, fixed syllabus: > .framed.looser[ .blue[teacher effectiveness is negatively correlated with the students' evaluations of professors: in other words, teachers who are associated with better subsequent performance receive worst evaluations from their students.] ] --- .left-column[ MacNell, Driscoll & Hunt, 2014 NC State online course. Students randomized into 6 groups, 2 taught by primary prof, 4 by GSIs. 2 GSIs: 1 male 1 female. GSIs identified to students by actual gender in 1 section, false gender in 1 section. 5-point scale. ] .right-column[ .small[
Characteristic
M - F
perm \(P\)
t-test \(P\)
Overall
0.47
0.12
0.128
Professional
0.61
0.07
0.124
Respectful
0.61
0.06
0.124
Caring
0.52
0.10
0.071
Enthusiastic
0.57
0.06
0.112
Communicate
0.57
0.07
NA
Helpful
0.46
0.17
0.049
Feedback
0.47
0.16
0.054
Prompt
0.80
0.01
0.191
Consistent
0.46
0.21
0.045
Fair
0.76
0.01
0.188
Responsive
0.22
0.48
0.013
Praise
0.67
0.01
0.153
Knowledge
0.35
0.29
0.038
Clear
0.41
0.29
NA
] ] --- ### Omnibus tests 99% confidence intervals for \\(P\\) 1. perceived instructor gender plays no role \\([0.0, 5.3\times 10^{-5}]\\) 2. male students rate perceived male and female instructors the same \\([0.460, 0.468]\\) 3. female students rate perceived male and female instructors the same \\([0.0, 5.3\times 10^{-5}]\\) --- ### Exam performance and instructor gender Mean grade and instructor gender (male minus female)
difference in means
\(P\)-value
Perceived
1.76
0.54
Actual
-6.81
0.02
--- ### Boring et al., 2016. SciencesPo data 5 years of data for 6 mandatory freshman classes: History, Political Institutions, Microeconomics, Macroeconomics, Political Science, Sociology + SET mandatory; response rate nearly 100% + anonymous final exams except PI + interim grades before final + 23,001 SET + 379 instructors + 4,423 students + 1,194 sections (950 without PI) + 21 year-by-course strata --- ### Test statistics Correlation between SET and gender within each stratum, averaged across strata. Correlation between SET and average final exam score within each stratum, averaged across strata. --- ### SciencesPo data Average correlation between SET and final exam score
strata
\(\bar{\rho}\)
\(P\)
Overall
26 (21)
0.04
0.09
History
5
0.16
0.01
Political Institutions
5
N/A
N/A
Macroeconomics
5
0.06
0.19
Microeconomics
5
-0.01
0.55
Political science
3
-0.03
0.62
Sociology
3
-0.02
0.61
--- Average correlation between SET and instructor gender
\(\bar{\rho}\)
\(P\)
Overall
0.09
0.00
History
0.11
0.08
Political institutions
0.11
0.10
Macroeconomics
0.10
0.16
Microeconomics
0.09
0.16
Political science
0.04
0.63
Sociology
0.08
0.34
--- Average correlation between final exam scores and instructor gender
\(\bar{\rho}\)
\(P\)
Overall
-0.06
0.07
History
-0.08
0.22
Macroeconomics
-0.06
0.37
Microeconomics
-0.06
0.37
Political science
-0.03
0.70
Sociology
-0.05
0.55
--- Average correlation between SET and gender concordance
Male student
Female student
\(\bar{\rho}\)
\(P\)
\(\bar{\rho}\)
\(P\)
Overall
0.15
0.00
0.05
0.09
History
0.17
0.01
-0.03
0.60
Political institutions
0.12
0.08
-0.11
0.12
Macroeconomics
0.14
0.04
-0.05
0.49
Microeconomics
0.18
0.01
-0.00
0.97
Political science
0.17
0.06
0.04
0.64
Sociology
0.12
0.16
-0.03
0.76
--- Average correlation between student performance and gender concordance
Male student
Female student
\(\bar{\rho}\)
\(P\)
\(\bar{\rho}\)
\(P\)
Overall
-0.01
0.75
0.06
0.07
History
-0.15
0.03
-0.02
0.74
Macroeconomics
0.04
0.60
0.11
0.10
Microeconomics
0.02
0.80
0.07
0.29
Political science
0.08
0.37
0.11
0.23
Sociology
0.01
0.94
0.06
0.47
--- Average correlation between SET and interim grades
\(\bar{\rho}\)
\(P\)
Overall
0.16
0.00
History
0.32
0.00
Political institutions
-0.02
0.61
Macroeconomics
0.15
0.01
Microeconomics
0.13
0.03
Political science
0.17
0.02
Sociology
0.24
0.00
--- ### Main conclusions; multiplicity 1. lack of association between SET and final exam scores (negative result, so multiplicity not an issue) 1. lack of association between instructor gender and final exam scores (negative result, so multiplicity not an issue) 1. association between SET and instructor gender 1. association between SET and interim grades Bonferroni's adjustment for four tests leaves the associations highly significant: adjusted \\(P < 0.01\\). --- ### Whence the \\(P\\)-values? Why not t-test, ANOVA, regression, etc.? -- #### Permutation tests + Under the null hypothesis, the probability distribution of the data is invariant with respect to the action of some group. -- + Condition on the orbit of the data under the action of the group (or a subgroup). -- + Every element of that orbit is equally likely. -- No assumptions about populations, data distributions, etc. --- ### Examples + 2-sample problem + spherical symmetry + association + randomized experiments under "strong null" --- ### Conditional testing Common for permutation tests to condition on some aspect of observed data. Let \\(\mathcal{F} = \\{ F\_j \\}\\) be a countable, measurable partition of the outcome space \\( \mathcal{X} \\). Suppose null is true, and that we test conditionally at level \\(\alpha\\), no matter which \\(F\_j\\) occurs. Then $$ \Pr \\{\mbox{ reject } \\} = \sum\_j \Pr \\{\mbox{reject} | X \in F\_j\\} \Pr \\{ X \in F\_j\\} $$ $$ \le \sum\_j \alpha \times \Pr \\{X \in F\_j \\} $$ $$ = \alpha. $$ --- ### Neyman model generalized for MacNell et al. data Student \\(i\\) represented by ticket w 4 numbers, response to each "treatment." $$ r_{ijk} = \mbox{ SET given by student } i \mbox{ to instructor } j \mbox{ when appears to have gender } k. $$ $$ i = 1, \ldots, N; \;\;\; j = 1, 2; \;\;\; k \in \\{ \mbox{male}, \mbox{female} \\} $$ -- Numbers fixed before randomization into 1 of 4 treatments. -- Randomization reveals 1 of the numbers. -- If gender doesn't matter, $$ r\_{ij\mbox{ male}} = r\_{ij\mbox{ female}}. $$ --- ### Randomization All $$ {{N}\choose{N\_1 N\_2 N\_3 N\_4}} = \frac{N!}{N\_1! N\_2! N\_3! N\_4!} $$ possible assignments of \\(N\_1\\) students to TA 1 identified as male, \\(N\_2\\) students to TA 1 identified as female, etc., were equally likely. -- Hence, so are the $$ {{N\_1 + N\_2} \choose {N\_1}} \times {{N\_3+N\_4} \choose {N\_3}} $$ assignments that keep the same \\(N_1 + N_2\\) students in TA 1's sections & the same \\(N_3 + N_4\\) students in TA 2's sections. For those, we know what the data would have been, if the null hypothesis were true. Condition on assignments to prof & 2 TAs and on nonresponders. -- .blue[Determines (conditional) null distribution of any test statistic] Conditional \\(P\\)-value is fraction of assignments with at least as large test statistic. --- ### Finding \\(P\\) In principle, enumerate all assignments and calculate test statistic for each. But $$ {{23}\choose{11}}{{24}\choose{11}} > 3.3\times 10^{12} $$ possible assignments of 47 students to the 4 TA-led sections that keep constant which students get which TA. Use \\(10^5\\) random assignments. \\(\mbox{SE}(\hat{P}) \le (1/2)/ \sqrt{10^5} \approx 0.0016\\) Can also find confidence intervals for \\(P\\) by inverting binomial tests. --- ### French data If instructor characteristics or grade expectations are unrelated to SET, every pairing within a stratum is equally likely. Condition on all numerical values, but not the pairings. Strata (course by year) assumed to be independent. --- ### Omnibus alternatives: nonparametric combination of tests Combine test statistics across strata or across variables for a test of conjunction Fisher's combining function: $$ X^2 = -2 \sum\_{j=1}^J \ln P\_j. $$ -- \\( X^2 \sim \chi^2\_{2J} \\) if \\(\\{ P\_j \\}\_{j=1}^J\\) are independent & all nulls are true -- But here, tests are dependent: calibrate combination by simulation. -- + Each replication is a Bernoulli(\\(P\\)) trial; trials are independent. + Number of replications with test statistic ≥ observed is binomial. + Invert binomial hypothesis tests to find confidence bounds for \\(P\\) --- ## [Permute](http://statlab.github.io/permute/) library Millman, Ottoboni, Stark, van der Walt, 2016 [Source code on GitHub](https://github.com/statlab/permute) [Installable from Pypi](https://pypi.python.org/pypi/permute/) --- ### Algorithmic & design considerations + Reproducibility + Appropriate level of abstraction - tests, test statistics - effect sizes & confidence bounds in \\(k\\)-sample problem + Utilities & building blocks to craft "permutation tests for all occasions" + Stratification supported for all tests + Conditioned permutations (e.g., tables w/ fixed margins) + Memory & CPU time - efficient permutation in place (numpy.random.permutation() v numpy.random.shuffle()) - nonparametric combination of tests methodology: "aligned" permutations - cython & C --- ### What are we measuring? US data: controls for _everything_ but the name, since compares each TA to him/herself. French data: controls for subject, year, teaching effectiveness --- ### Lauer, 2012: Student comments knotty, too Survey of 185 students, 45 faculty at Rollins College, Winter Park, Florida > .blue[I once believed that narrative comments on course evaluation forms were straightforward and useful.] -- Faculty & students use the same vocabulary quite differently, ascribing quite different meanings to words such as "fair," "professional," "organized," "challenging," & "respectful." --
not fair
means …
student %
instructor %
plays favorites
45.8
31.7
grading problematic
2.3
49.2
work is too hard
12.7
0
won't "work with you" on problems
12.3
0
other
6.9
19
--- ### Contrary views .framed[ > I felt compelled to write this blog after reading Philip Stark and Richard Freishtat’s opening salvo from their article, "An Evaluation of Course Evaluations” recently summarized in the Chronicle of Higher Education: >> … it is widely believed that [student ratings of instruction] are primarily a popularity contest; that it’s easy to "game” the ratings; that good teachers get bad ratings and vice versa; and that fear of bad ratings stifles pedagogical innovation and encourages faculty to water down course content (p. 1). > Some people also believe that climate change is a hoax. But does thinking make it so? > Were these the only unsubstantiated claims the authors made, I might have been able to resist writing this blog. [Steve Benton's blog at IDEA](http://ideaedu.org/ideablog/2014/09/evaluation-%E2%80%9C-evaluation-course-evaluations%E2%80%9D-part-i), 29 September 2014 ] _The very next sentence_ in our paper is .blue["What is the truth?"] --- ### Who supports SET? .framed[ .blue[ >> It is difficult to get a man to understand something, when his salary depends upon his not understanding it! —Upton Sinclair ] ] -- #### Benton's _job_ at IDEA is to sell SET. --- ### Benton & Cashin, 2012: exemplar SET apologists + Widely cited, but it's a technical report from IDEA, a business that sells SET teaching evaluations. -- + Claims SET are reliable and valid. -- + Does not cite Carrell & West (2008) or Braga et al. (2011), the only two randomized experiments I know of published before B&C (2012) -- + As far as I can tell, no study B&C cite in support of validity used randomization. --- ### Benton & Cashin on validity >> Theoretically, the best indicant of effective teaching is student learning. Other things being equal, the students of more effective teachers should learn more. -- I agree. -- >> A number of studies have attempted to examine this hypothesis by comparing multiple-section courses. For example, Benton and colleagues (Benton, Duchon, & Pallett, 2011) examined student ratings in multiple sections of the same course taught by the same instructor. They correlated student ratings of progress on objectives the instructor identified as relevant to the course (using IDEA student ratings) with their performance on exams tied to those objectives. Student ratings correlated positively with four out of five exams and with the course total points (r = .32). .blue[What's wrong with this argument?] --- #### Again, address a straw man hypothesis: - Who claims SET have **absolutely no connection** to teaching effectiveness? -- - SET are associated with class enjoyment, which may affect engagement & learning >UCB Dept. Stat, fall 2012, 1486 students rated instructor overall effectiveness & enjoyment of the course. -- .blue[ > Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8. ] -- - .red[the question is not whether there's _any_ association between SET and effectiveness.] .blue[the question is _how well_ SET measure effectiveness, and whether factors unrelated to effectiveness are confounded enough that SET is misleading or discriminatory] -- + Association at the student level Benton et al. find means individual students who rate _the same instructor_ higher get higher scores. -- + How does that show that SET are valid? It seems to show that they are not reliable! --- + \\(r =0.32\\) is very weak for \\(n=188\\) - if relationship is linear, SET accounts for just 10% of the variance of performance - c.f. \\(r=0.8\\) for course effectiveness & enjoyment \\(n=1486\\) -- + The underlying data do not come from a randomized experiment. - no real controls (e.g., pretest); no basis for a statistical conclusion - likely confounding from many factors --- ## What's the right question? + Are SET more sensitive to effectiveness or to something else? + Do women and men get comparable SET? + But for their gender, would women get higher SET than they do? (And but for their gender, would men get lower SET than they do?) Need to compare like teaching with like teaching, not an arbitrary collection of women with an arbitrary collection of men. Boring (2014) finds _costs_ of improving SET very different for men and women. --- ## These are not the only biases! + Ethnicity and race + Attractiveness + Accents ... --- ### SET do not measure teaching effectiveness + Calling something "teaching effectiveness" does not make it so + Averaging Likert ratings is silly: scale isn't proportional + Compute averages to 2 decimals doesn't make the averages meaningful or comparable + Courses are largely incommensurable: comparing averages across sizes, levels, disciplines, types of course, electives v. requirements, etc., is silly --- ## Response rates + Sample cannot be treated as random. Cannot extrapolate beyond the sample. Margin of error meaningless. + Suppose 70% of the class respond, with an average of 4 on a 7-point scale. Class average could be anywhere between 3.1 & 4.9 + Who responds?  --- ### What might we be able to discover about teaching? .looser[ + Is she dedicated to and engaged in her teaching? + Is she available to students? + Is she putting in appropriate effort? Is she creating new materials, new courses, or new pedagogical approaches? + Is she revising, refreshing, and reworking existing courses using feedback and on-going experiment? + Is she helping keep the department's curriculum up to date? + Is she trying to improve? + Is she contributing to the college’s teaching mission in a serious way? + Is she supervising undergraduates for research, internships, and honors theses? + Is she advising and mentoring students? + Do her students do well when they graduate? ] --- ### Peer observation In 2013, UC Berkeley Department of Statistics adopted as standard practice a more holistic assessment of teaching. Candidates prepare a teaching portfolio, including teaching statement, syllabi, notes, websites, assignments, exams, videos, statements on mentoring, & anything else the candidate wants to include. Dept. chair & promotion committee assess portfolio. At least before every "milestone” review (mid-career, tenure, full, step VI), a faculty member watches at least one of the candidate's lectures. Complements the portfolio. Distributions of SET scores are reported, along with response rates. Average SET not reported. Themes of comments are summarized. --- ### How hard/expensive is it? .blue[Classroom observation took the reviewer about four hours, including the observation time itself.] Process included conversations between the candidate and the observer, opportunity for the candidate to respond to the written comments, & provision for a "no-fault do-over." .blue[If done for every milestone review, would be ≈16h over a 40-year _career_: de minimis.] Candidates & reviewer reported that the process was valuable and interesting. Based on that experience, the dean recommended peer observation prior to milestone reviews; the next dean reversed that decision. Room for improvement: Observing more than one class session and more than one course would be better. Adding informal classroom observation and discussion between reviews would be better. Periodic surveys of former students, advisees, and teaching assistants would be useful. But still this improves on using SET alone. --- ### Example letter for a strong teacher (almalgam of real letters) Smith is, by all accounts, an excellent teacher, as confirmed by the classroom observations of Prof. Jones, who calls out Smith's ability to explain key concepts in a broad variety of ways, to hold the attention of the class throughout a 90-minute session, to use both the board and slides effectively, and to engage a large class in discussion. Prof. Jones's peer observation report is included in the case materials; conversations with Jones confirm that the report is Jones's candid opinion: Jones was impressed, and commented in particular on Smith's rapport with the class, Smith's sensitivity to the mood in the room and whether students were following the presentation, Smith's facility in blending derivations on the board with projected computer simulations to illustrate the mathematics, and Smith's ability to construct alternative explanations and illustrations of difficult concepts when students did not follow the first exposition. While interpreting "effectiveness" scores is problematic, Smith's teaching evaluation scores are consistently high: in courses with a response rate of 80% or above, less than 1% of students rate Smith below a 6. Smith's classroom skills are evidenced by student comments in teaching evaluations and by the teaching materials in her portfolio. --- #### letter (contd) Examples of comments on Smith's teaching include: > I was dreading taking a statistics course, but after this class, I decided to major in statistics. > the best I've ever met … hands down best teacher I've had in 10 years of university education > overall amazing … she is the best teacher I have ever had > absolutely love it > loves to teach, humble, always helpful > extremely clear … amazing professor > awesome, clear > highly recommended > just an amazing lecturer > great teacher … best instructor to date > inspiring and an excellent role model > the professor is GREAT Critical student comments primarily concerned the difficulty of the material or the homework. None of the critical comments reflected on the pedagogy or teaching effectiveness, only the workload. --- #### letter (contd) I reviewed Smith's syllabus, assignments, exams, lecture notes, and other materials for Statistics X (a prerequisite for many majors), Y (a seminar course she developed), Z (a graduate course she developed for the revised MA program, which she has spearheaded), and Q (a topics course in her research area). They are very high quality and clearly the result of considerable thought and effort. In particular, Smith devoted an enormous amount of time to developing online materials for X over the last five years. The materials required designing and creating a substantial amount of supporting technology, representing at least 500 hours per year of effort to build and maintain. T he undertaking is highly creative and advanced the state of the art. Not only are those online materials superb, they are having an impact on pedagogy elsewhere: a Google search shows over 1,200 links to those materials, of which more than half are from other countries. I am quite impressed with the pedagogy, novelty, and functionality. I have a few minor suggestions about the content, which I will discuss with Smith, but those are a matter of taste, not of correctness. The materials for X and Y are extremely polished. Notably, Smith assigned a term project in an introductory course, harnessing the power of inquiry-based learning. I reviewed a handful of the term projects, which were ambitious and impressive. The materials for Z and Q are also well organized and interesting, and demand an impressively high level of performance from the students. The materials for Q include a great selection of data sets and computational examples that are documented well. Overall, the materials are exemplary; I would estimate that they represent well over 1,500 hours of development during the review period. --- #### letter (contd) Smith's lectures in X were webcast in fall, 2013. I watched portions of a dozen of Smith's recorded lectures for X—a course I have taught many times. Smith's lectures are excellent: clear, correct, engaging, interactive, well paced, and with well organized and legible boardwork. Smith does an admirable job keeping the students involved in discussion, even in large (300+ student) lectures. Smith is particularly good at keeping the students thinking during the lecture and of inviting questions and comments. Smith responds generously and sensitively to questions, and is tuned in well to the mood of the class. Notably, some of Smith's lecture videos have been viewed nearly 300,000 times! This is a testament to the quality of Smith's pedagogy and reach. Moreover, these recorded lectures increase the visibility of the Department and the University, and have garnered unsolicited effusive thanks and praise from across the world. Conversations with teaching assistants indicate that Smith spent a considerable amount of time mentoring them, including weekly meetings and observing their classes several times each semester. She also played a leading role in revising the PhD curriculum in the department. --- #### letter (contd) Smith participated in two campus-wide seminars on improving teaching during the review period, and led a breakout session on working with GSIs. Smith also taught the GSI pedagogy course last semester. Smith has been quite active as an advisor to graduate students. In addition to serving as a member of sixteen exam committees and more than a dozen MA and PhD committees, she advised three PhD recipients (all of whom got jobs in top-ten departments), co-advised two others, and is currently advising three more. Smith advised two MA recipients who went to jobs in industry, co-advised another who went to a job in government, advised one who changed advisors. Smith is currently advising a fifth. Smith supervised three undergraduate honors theses and two undergraduate internships during the review period. This is an exceptionally strong record of teaching and mentoring for an assistant professor. Smith's teaching greatly exceeds expectations. --- ### Example letter for a less engaged teacher During the review period, Smythe taught the standard departmental load of three courses per year, a total of nine courses. Smythe has taught these courses for the last 15 years. The course materials were last updated in 2003. Student comments suggest that the examples could stand refreshing. Students reported having trouble reading Smythe's handwritten transparencies; students have made similar comments for the last 15 years. Students also reported that Smythe cancelled class on several occasions with no notice and did not schedule office hours. Smythe did not serve on any PhD oral exam committees or thesis committees during the review period, nor did Smythe supervise any graduate students or undergraduate research students. --- ### How can students help evaluate teaching? Report their own experience of the class. + Did you enjoy the class? + Did you find the class easy or difficult? Interesting or boring? + Did you leave it more enthusiastic or less enthusiastic about the subject matter? + If this was an elective: - did you plan to take a sequel before taking this course? - do you now plan to take a sequel course? + Could you hear the instructor during lectures? + Was the instructor’s handwriting legible? + What was your favorite part of the course? + What was your least favorite part of the course? + Would you recommend this course to other students? --- ### Policy implications 1. Well-designed, randomized experiments show that SET measure gender biases and grade expectations better than they measure teaching effectiveness. 1. .blue[Impact of relying on SET for employment decisions:] Discrimination against women, rewarding less effective teaching, punishing more effective teaching 1. Therefore SET should not be used as a measure of teaching effectiveness and should play no role in employment decisions. 1. To know whether a teacher is good, look at the teaching. Don't subcontract evaluation to students. This will cost more, but if we are serious about teaching, can't rely on SET as proxy for effectiveness. 1. Pay attention to student comments but understand their limitations and heed differences in language usage. 1. Responders are not a random sample and there's no reason their responses should be representative of the class as a whole: do not extrapolate. 1. Use teaching portfolios as part of the review process. 1. Use classroom observation as part of milestone reviews. 1. To improve teaching and evaluate teaching fairly and honestly, spend time observing teaching & teaching materials. --- ### A note on open, transparent science MacNell et al. data available at http://n2t.net/ark:/b6078/d1mw2k (Merritt) French law prohibits publishing SciencesPo data  Python code for all analyses is available as [Jupyter](http://jupyter.org) notebook on [GitHub](https://github.com) at https://github.com/kellieotto/SET-and-Gender-Bias Analysis is in an open language in an open scientific notebook format, relies on open _permute_ library) These slides in Markdown and MathJax using remark.js All files are ASCII. --- Paper is in [ScienceOpen](https://www.scienceopen.com) https://www.scienceopen.com/document/vid/818d8ec0-5908-47d8-86b4-5dc38f04b23e (open, post-publication non-anonymous review)  --- ### Meta-message + It's easy to think we're being objective and rational when we base our decisions on data and numbers. -- + But if the data are subjective (or low quality) or the numbers and not well connected to the goal, it's irrational to rely on them: they are unfit for the purpose. -- + It may be far better to use a qualitative approach involving careful observation and judgement. -- .framed.looser[ .red[ + Not all evidence is numerical. + Not all numbers are evidence. + Beware quantifauxcation! ] ]