CHAPTER 11: Interpreting Test Results“Statistics is the science of producing unreliable facts from reliable figures.”—Evan EscarTest development is a process that continues even after a test is administered. In fact, post hoc test analysis is a crucial aspect of the process. One advantage of selected-response exams is that item and test analysis data can be generated from the test results. Multiple-choice questions are particularly amenable to data analysis. These data reports provide valuable information that assists the teacher in assigning fair scores and improving individual items for future use. Test analysis has three goals: to identify whether any of the questions are flawed, to correct any errors and adjust the raw scores, and to improve the items for future use.Qualitative and quantitative test reviews are equally important; they complement each other. Once you have the statistical data for a test, you can look at the items from a much more objective viewpoint. Inevitably, you and your colleagues who reviewed the exam before it was administered will be surprised by the results of at least 10% of the new items on a test, even though you followed test development guidelines. Sometimes, student responses to even the most expertly written questions are just unpredictable.Consider the time involved in item writing and revision as an investment. Multiple-choice items can be analyzed, revised, and banked for future use. These items can be polished over time and adapted for reuse on future tests. In fact, the more you refine your items based on data, the better your tests become. Often, the qualitative student review, as discussed in Chapter 9, “Assembling, Administering, and Scoring a Test,” explains the statistical results of an item and provides suggestions for revising the item for future use. Reviewing the quantitative data not only provides an invaluable tool for making objective decisions about individual test items and overall test scores, it also guides you to use your time efficiently to improve your questions and develop a valuable testing resource: an item bank. Keep in mind that the more items you analyze, the more proficient you become at writing and identifying high-quality test items that you can bank and use repeatedly. Therefore, the time you invest is time well spent.Before the advent of reasonably priced testing software, calculating the statistical results of an exam was a task that was impractical for a classroom teacher. Today, many colleges and universities provide machine scoring with statistical reports of tests and item analyses for multiple-choice classroom exams. The aim of this chapter is to demonstrate just how valuable these data reports are as tools for test interpretation and development; without statistical analysis, you have no assurance that your tests are functioning as you intended.
Overall Test Data AnalysisMost test development software packages provide two levels of test analysis data: the overall analysis of the test and a detailed analysis of each item as it relates to the test as a whole. While your first consideration should be to look at the overall picture, both these data sets are essential for a thorough test analysis. Examining test data is well within your purview once you understand the meaning of each of the values. Remember, you do not have to do any actual calculating. Once you use the data, you will appreciate their value to such an extent that I guarantee you will never again assign grades to a multiple-choice exam without reviewing the statistical analysis.When a test is scored, the initial result is reported as a raw score, or the number of items that a student answered correctly on the test. Statistical analysis assists you with transforming the raw scores into test grades. Appendix B, “Basic Test Statistics,” provides an overview of the terminology of statistical analysis. Take the time to review some basic statistical references before examining the example of a statistical test report in Table 11.1.Table 11.1 Sample Test StatisticsNumber of items 100Number of examinees 92Mean 75.4Median 77Low score 52High score 93Alpha 0.754Standard deviation 7.7Standard error of measurement (SEM) 3.8Mean p value 0.754Mean point biserial index (PBI) 0.36Table 11.1 is a sample test analysis report that contains the typical data you would receive from a testing software program. In fact, this report presents more than enough data to help you make informed decisions about test results. Some programs provide even more comprehensive statistics. It is not necessary to make your review too complicated, however; this sample data report is more than sufficient for analysis of a classroom test.Generally, item statistics for small groups of students are relatively unstable. The stability of test analysis data increases as the number of test takers approaches 100. Therefore, when you have a very small group (50 or fewer), you must consider the relative instability of the data when you interpret the analysis. In fact, test and item analysis should not be interpreted dogmatically, no matter how large the number of students. As this discussion illustrates, analysis of test data requires a variety of interpretations, both qualitative and quantitative. The size of the sample is one of the factors you must consider.The first step in test analysis is to review the report to make sure that the data report is complete. Check the number of items and examinees, and verify their accuracy. This sample has 100 items, which means the raw score is equal to the percentage correct, and 92 examinees had their answer sheets scored. Once you verify that these figures are correct, you are ready to analyze the results of the test.Measures of Central TendencyMeasures of central tendency provide a single value that best represents the typical score in a distribution. The mean, the median, and the mode are the three most commonly used measures of central tendency in education. While the mode, which represents the most frequently obtained score in a distribution, has limited usefulness for interpreting classroom test scores, both the mean and the median provide valuable information.MeanMost test analysis programs report the mean, or arithmetic average, of the raw scores on the test; in this case (Table 11.1), it is 75.4. The mean percent score is determined by dividing the mean raw score by the total number of items on the test. The mean percent score is equal to the mean raw score in this example because there are 100 items on the test.One disadvantage of the mean is that it is sensitive to extreme scores. An extreme score, whether very high or very low, can pull the mean toward its direction. This effect is particularly problematic when there is a small group of scores (Reynolds, Livingston, & Wilson, 2008).For example, let’s look at the distributions in Table 11.2. Each distribution has an extreme score of 15, yet the smaller number of scores is affected more dramatically than the larger number of scores. The mean of the “A Scores” is 7.4, a score that does not even appear in the distribution, while the score of 15 has a less dramatic effect on the mean of the “B Scores,” which is 5.95. It is obvious from these examples that you must consider the effect of extreme scores when interpreting the mean on a test.Table 11.2 Effect of an Extreme Score on the Mean of a DistributionA Scores B Scores5 55 56 56 515 55555566666666615Mean = 7.4 Mean = 5.95It is important to examine the relationship of the mean to the passing standard that you have set. If, for example, your passing standard is 75%, this test has an average score at the passing level. Several factors must be considered when interpreting the mean:Were there extreme scores in the distribution?What was the quality of teaching, on a range between effective and ineffective?What was the students’ level of effort to achieve the outcomes, on a range between maximal and minimal?Where did the material/objectives fit, on a range between too easy and too difficult?How difficult were the items, on a range between too easy and too hard?If a test has a very low mean, you should investigate whether there is a problem with one of the questions listed above. As Haladyna (1997) points out, if you intentionally give difficult tests, you should adjust your grading policy to ensure that you assign grades fairly in relation to the other courses that students take. Similar consideration should be made if your tests are consistently too easy. The ideal goal is to have a test with a mean that reflects a range of student abilities. A test should be neither too easy nor too difficult, but should reward those students who are high achievers and should identify those who have not met the course objectives. Chapter 13, “Assigning Grades,” discusses the issues surrounding grade assignment.MedianThe median in this sample is 77, which represents the middle point in this group of raw scores. In a normal distribution, the mean and the median are the same. If a distribution is positively skewed, meaning that the test is very difficult, with most scores at the low end of the distribution and few very high scores, the mean is pulled to the positive end of the distribution and is higher than the median (Reynolds et al., 2008).A positively skewed distribution, such as the one depicted in Figure 11.1, signals a problem. Why are there so few high scores? What went wrong in the instructional process?Figure 11.1 A positively skewed distributionIf a distribution is negatively skewed, it means the test was easy for the group, with most scores at the high end of the distribution and few very low scores. The distribution depicted in Figure 11.2 is one you might expect in a nursing class. After all, all students who are admitted to a nursing program are capable of achieving the objectives.Figure 11.2 A negatively skewed distributionThe mean in a negatively skewed distribution is pulled toward the negative end of the distribution and is lower than the median (Reynolds et al., 2008). Therefore, the mean can give the wrong impression whenever a distribution is seriously skewed. The terms positively skewed and negatively skewed can seem counterintuitive. Remember this tip: A “positively” skewed distribution has its tail in the positive end of the distribution, while a “negatively” skewed distribution has its tail in the negative end of the distribution. A distribution that is not skewed has the same or very close values for the mean and the median; it resembles a bell and is referred to as a normal distribution or a bell curve.The mean and median in the example of Table 11.1 are close, which means there are probably not many extreme scores in the distribution and the mean is close to the median. The mean in this case can be interpreted as representing a typical score. Of course, you should review the score distribution, which is also included in the test analysis report, before you decide whether the mean actually represents a typical score. Interpretation of graphic and frequency distributions is discussed later in this chapter.Measures of VariabilityBecause it is impossible to predict the range of scores for a test based on measures of central tendency alone, it is necessary for us to look further when interpreting a set of scores. In fact, two sets of scores could have the same mean and have a very different spread of scores. We need to examine measures of variability to determine how much the scores spread out from the mean or how much dispersion there is in a distribution.RangeIn the example in Table 11.1, the range of the distribution is 41, the difference between the high (93) and low (52) scores on the test.Range = highest score − lowest scoreA small range indicates a concentrated distribution of scores, while a large range means that the scores are spread out and that some students have not done well on the test. This value gives us a rough idea of the variability of the test scores.Standard DeviationThe standard deviation is a more useful measure than the range of the variability of a score distribution. Standard deviation indicates the average distance that scores in a distribution vary from the mean (7.665 in Table 11.1). A large variance indicates that the scores are spread out from the mean, as Chapter 10, “Establishing Evidence of Reliability and Validity,” discusses. Conversely, the smaller the variance, the greater the similarity of the scores and the closer they are to the mean. Most software programs report the standard deviation, which is the square root of the variance, in their test analyses reports. The standard deviation tells you the average distance of the scores from the mean, or how much the scores differ, either positively or negatively, from the mean. A small standard deviation means that the scores were bunched together around the mean, while a large standard deviation means that the scores were spread out from the mean.The standard deviation is most useful for making interpretations about the normal curve, which is a grading method that is discouraged for classroom testing, as Chapter 13, “Assigning Grades,” discusses. In the classroom setting, the most useful application of the standard deviation is to help you to understand the reliability coefficient and the standard error of measurement (SEM) for a test.Reliability CoefficientReliability refers to the amount of confidence you can have in a test score, which Chapter 2, “The Language of Assessment,” and Chapter 10, “Establishing Evidence of Reliability and Validity,” discuss at length The reliability coefficient for our sample in Table 11.1 is reported as alpha, and its value is 0.754.What reliability coefficient should you expect from the results of your classroom exams? The answer to this question varies, but it relates directly to the level of confidence you must have in the decisions made based on the test results. High-stakes decisions require measurement results with high reliability. In other words, the results of a test that decides whether or not a student graduates from a program of study would require a high level of reliability. For this reason, you should never base such a serious decision on just one classroom exam.Miller, Linn, and Gronlund (2009) agree that the degree of reliability you must require for the results of a classroom test depends largely on the decision to be made based on the test results. Consider the importance of the decision and whether the decision can be reversed. If the reliability coefficient of a test’s results is low, make sure that you make tentative decisions; obtain additional data; and, most important, are willing to reverse your decision.Miller et al. (2009) report that the reliability coefficients of teacher-made tests usually vary between 0.60 and 0.85. Kehoe (1995) maintains that the results of tests of more than 50 items should have reliability coefficients of greater than 0.80, while Frisbie (1988) asserts that teacher-made test results should yield reliability coefficients that average about 0.50 and that 0.85 is the generally acceptable minimum reliability standard when decisions are being made about individuals based on a single test score. Frisbie also states that reliability coefficients of about 0.50 for the results of teacher-made tests can be tolerated when the scores are combined with other scores to assign a grade. In that case, you should be concerned with the reliability of the score that results from combining the scores.Our sample’s reliability coefficient of 0.754 looks respectable at first glance, according to these standards. This value should not be considered in isolation, however. The factors that affect the reliability coefficient of a test must be taken into account. These factors are discussed in detail in Chapter 10, “Establishing Evidence of Reliability and Validity,” and include the following:Quality of the test itemsItem difficultyItem discriminationHomogeneity of the test contentHomogeneity of the test groupTest lengthNumber of examineesSpeedTest design, administration, and scoringWhen reviewing the reliability coefficients of your classroom test results, consider all these factors. If you have a class that consists of a homogeneous group of high-achieving students, you might get a low reliability coefficient on a test of difficult, well-written, heterogeneous items that follow all the guidelines outlined in this text. It is also possible that a low reliability coefficient indicates that the items are either too difficult or too easy for the group of students. On the other hand, you could obtain a high reliability coefficient for a speeded test with a large number of items on narrowly defined content that is administered to a large heterogeneous group of students. Also remember that the testing conditions, quality of teaching, and number of questions and/or examinees are all factors that can affect the reliability of test scores. Low reliability coefficients are most often due to an excess of very easy or very hard items, poorly written items that do not discriminate, or test items that do not represent a unified body of content (Kehoe, 1995).You must consider all influencing factors when interpreting a reliability coefficient for the results of a test. A test that has a low reliability coefficient could be providing reliable results. Your judgment is a very important part of the equation. As Mark Twain said, “There are three kinds of lies: lies, damned lies, and statistics.” Statistical findings are meaningless in themselves, and they can be distorted to fit erroneous interpretations. It is your informed interpretation of the data that adds the ingredient of fairness to your grade assignments. Refer to Chapter 10, “Establishing Evidence of Reliability and Validity,” for a detailed discussion related to reliability estimates of classroom exams.Standard Error of Measurement (SEM)The SEM enables us to estimate the amount by which a student’s obtained score might differ from the student’s true score. Chapter 10, “Establishing Evidence of Reliability and Validity,” discuss the implications of the SEM on test score interpretation.The SEM is very important because, if we assume that an obtained score on a test is necessarily the student’s true score, we will misinterpret the test results (Reynolds et al., 2008). The SEM for our sample in Table 11.1 is 3.8, which means that the true score for an observed raw score of 72 would range between 68.2 and 75.8. This range is referred to as the confidence band.Suppose that the passing score on a test was predetermined to be 75. Would you consider adding 4 points to all scores on the test if the SEM was a 4? While it is not advisable to add points, or scale grades, for each individual test, if you are inclined to scale the grades, it is a better practice to wait for the final grade. Suppose you scale up the grades three points for exam one and the scores on exam two are very high. Will you scale down the scores for exam two? You probably would cause a student revolt.Scaling individual test scores can alter the predetermined weighting of the components of the course grade. The better practice is to wait until the end of the course; look carefully at the means, medians, reliability coefficients, and SEMs for all exams; consider the final score spread; and then add points (if you judge it necessary) to the final grade assignment (refer to the discussion on grading in Chapter 13, “Assigning Grades”).The important lesson to be learned from the SEM is that classroom test scores are not absolute, and they do not represent students’ true scores. Measurement error is present in all scores, so you must look at the margin of error in a test and be flexible when translating raw scores into test scores and test scores into course grades. If the statistical analysis provided by your test development software does not provide the SEM, you can calculate it by using the formula in Exhibit 10.5.
Score DistributionA test score distribution complements the analysis of test data by providing you with a description of how the class as a whole performed on the test. A distribution helps you visualize test results and makes the scores easier to interpret. Score distributions are typically reported in a frequency table or in a graphic format.Table 11.3 is a grouped frequency distribution associated with the sample data in Table 11.1. This frequency distribution groups the raw scores on the test into four-point intervals. In this distribution, you can see, for example, that six students scored from 84 to 87. This distribution provides a visual representation for interpreting the range of raw scores on a test. It enables you to visualize the significance of test analysis data such as the mean and the median and to identify how the scores on the test are clustered.Table 11.3 Grouped Frequency Distribution92–95 188–91 184–87 680–83 1576–79 2472–75 1668–71 1364–67 1060–63 256–59 152–55 3N 92
HistogramFor many, a graphic representation of a score distribution provides the clearest visualization of a set of scores. A histogram is a bar graph of a score distribution, with the height of each column indicating the number of students who scored in the interval represented by the horizontal axis. What form should the distribution of a classroom test take? The normal curve is the distribution with which most teachers are familiar. A normal curve is a symmetrical, bell-shaped distribution with the mean and median at the midpoint and scores tapering off toward each extreme, as depicted in Figure 11.3. In a normal distribution, more than two-thirds of the scores are located close to the mean—that is, plus or minus one standard deviation (while only a few scores are at the very high or very low extremes of the distribution). Virtually all cases fall within three standard deviations of the mean.Figure 11.3 Normal distribution histogramThe results of most achievement tests will approximate a normal curve when they are administered to large numbers of people (Reynolds et al., 2008). The key word here is large. It is most unlikely that the relatively small number of students in your class represents a normal distribution. You should expect a normal distribution only when you test a large number of people of varying abilities. Chapter 13, “Assigning Grades,” discusses why the use of the normal curve is inadvisable for classroom grading.Refer again to Figure 11.1; it depicts a positively skewed distribution, meaning that most of the test scores are low. The shape of the distribution does not explain the cause of the problem; rather, it is a red flag that indicates that the test needs further analysis. It might mean that the test was too difficult, the items were poorly written or confusing, the teaching/learning activities were inadequate, student motivation was low, or the objectives were unrealistic. This type of distribution alerts you to investigate the cause of the problem and to take corrective action.A negatively skewed distribution, such as the one depicted in Figure 11.2, means that most of the scores are high, with only a few students at the low end. There could be several reasons for this distribution, the most desirable one being that the teaching was effective and that the students were highly motivated. It could also mean that the test was too easy or that a copy of the test or the answer key circulated among the students. Whatever the case, further investigation is warranted.In reality, the histograms that result from your classroom tests will not be as clear cut as the samples provided here. What is important is to examine both the frequency distribution and the histogram that represents the distribution to make sense of the raw scores for the test. Ask yourself what results you expected from the test and then determine what actually happened. A pretest should be positively skewed. A medication calculation test where all students are expected to pass with 90% correct should be negatively skewed. The score distribution for a test is a powerful tool for examining test score results.Figure 11.4 is a histogram of the frequency distribution from Table 11.3. In the case of our sample, the score interval is four points. Note that the sample histogram illustrates most of the scores clustered between the scores of 64 and 87 and clarifies that the mean of 75.4 represents the average score for this test.Figure 11.4 Histogram for sample test dataThe analysis for this test is not over yet. This histogram should alert the teacher to closely examine the test. Why did 29 students attain scores of 71 or lower? What was the cause for the extremely poor performance of the six students who achieved a raw score between 52 and 63? Answers to these questions can be found in the analysis of the individual items. The analysis of the data, the frequency distribution, and the histogram (in conjunction with the analysis of the individual items) provide the best approach for determining the grade assignment for a test.Mean Item DifficultyThe item difficulty index (p value) of an item is the percentage of examinees who answered the item correctly. The mean p value identifies the average p value of the items on a test and tells you how difficult the total test is. The mean p value of a test translates easily into the mean percent correct score on the test (75.4%, in the case of our sample in Table 11.1). It is easy to see that the difficulty of the individual items is what determines the difficulty of the overall test.Developing an item bank helps to control the difficulty of future exams. Each item should be examined carefully and revised based on the data provided before it is entered into a bank. When you store items with their data, you have an indication of how difficult the items are when you select them for use on future tests. Remember, the difficulty of data associated with an item pertains to its use on a specific exam; it could perform differently in another item set or with a different group of students. You can get a pretty good idea of how an item will work with a similar group from its history, however.Your ability to write items at particular difficulty levels improves as you develop expertise in item writing. In addition, the more items you analyze, the more proficient you become at recognizing good items and adapting your personal style to write effective test questions. Item analysis is examined in depth later in this chapter.It is important to note that an item’s difficulty level is directly related to its ability to discriminate. If the items on a test are too easy (for example, if the average p value is 100%), there is no discrimination between the high and low scorers because everyone answered every item correctly. According to Kehoe (1995), a good test contains items with p values between 0.30 and 0.80; items that are answered correctly or incorrectly by more than 85% of the examinees have poor discrimination power.Nursing faculty frequently ask, “How difficult should my classroom test be?” The reason there is no standard response to this important question is because the answer is a very individual one (see Chapter 4, “Implementing Systematic Test Development”). Several factors must be considered. A test is only as hard or as easy as the teacher makes it. First, you have to look at what the passing score is for the course. In many nursing programs, a grade of “C” is the minimum passing score. In fact, most institutions require that students maintain a grade of “C” or better in their major. Because “C” is considered average, you must decide what is considered average performance in relation to your course objectives. Aim to write items to measure student ability in relation to the course content and objectives only. Ask yourself for each item that you design to measure minimal competence (an average-level question), “Should every student be able to answer this question correctly?”Keeping the p value of your items in the average range of 0.70 to 0.80, for example, yields a test with a mean between 0.70 and 0.80. You will also want to include some easier items (p value above 0.80) to encourage students at the beginning of a test, and you should include some difficult items (p value below 0.70) that are challenging to identify the high-achieving students (Frary, 1995). Reynolds et al. (2008) claim that, for maximum reliability, the optimal mean p value for a 4-option multiple-choice exam is 0.74. Frary recommends limiting the number of items that more than 90% of the students can answer correctly because these items do not contribute to the reliability of a test’s results.As long as the items relate to the blueprint, having a range of difficulty levels for test items increases the variability of your test scores. Remember, the higher the score variability, the higher the reliability coefficient of the test results will be. Refer again to Chapter 4, “Implementing Systematic Test Development,” for additional discussion of this important topic.Mean BiserialThe point biserial index (PBI) represents the discrimination ability of an item, which is the basic measure of item quality for multiple-choice tests (Kehoe, 1995). It identifies the capability of an item to distinguish between those who scored at the high end on the test and those who scored at the low end.The PBI is calculated with a complicated statistical formula that determines the correlation between the answers to a particular item on a test (correct or incorrect) and the mean scores on the test. Basically, the formula compares the responses of the students with the highest overall test scores to the students with the lowest overall test scores.The PBI ranges from −1.0 to +1.0; the higher the PBI, the better the item discriminates between the high and low achievers on the test. A positive PBI for an item indicates that students who achieved high scores on the test chose the correct answer for that item more frequently than students who had low scores. If students who achieved low scores on the test chose the correct answer for an item more frequently than students who have high scores, the item will have a negative PBI. When there is little or no difference between the proportion of high-scoring and low-scoring students who select the correct answer, the item will have a low PBI and will contribute nothing to the reliability coefficient of the test because it does not discriminate. The higher the average item discrimination ability, the higher the reliability coefficient of the test results (Ebel, 1979, p. 268).The PBI is easily available to classroom teachers who have access to test development software. The PBI provides valuable information for refining test items. If you do not have access to test development software to calculate the PBI, it is possible to calculate a similar value that will help you determine the quality of your test items.The item discrimination index (D) of an item is the difference between the percentage of high-scoring students who answered the item correctly and the percentage of low-scoring students who answered the item correctly. While commercial test developers seldom use the D value today, it is very useful for classroom teachers who do not have access to the PBI (Brookhart & Nitko, 2014). There are dozens of formulas for calculating the D value; Brookhart and Nitko suggest using the formula in Exhibit 11.1. The D value for the item is the D value for the correct answer.Exhibit 11.1 Formula for calculating D for each optionD = pu − pipu = percent (as a decimal) of the upper group selecting the optionpi = percent (as a decimal) of the lower group selecting the optionItem analysis is the most valuable tool for increasing the fairness of a test and improving the quality of your test items for future use. If you do not have access to this data, Exbibit 11.2 explains how you can calculate the statistics by hand. While the process may seem tedious, it will provide you with important information that is crucial for examining your test items.Exhibit 11.2 Hand-calculating item analysis dataArrange the tests in order from the highest to the lowest score.Put the answer sheets of the top-third scorers in a pile.Put the answer sheets of the bottom-third scorers in a pile.For each item on the test, tally how many students from each group selected each option.Compute the difficulty index (p) for each item (this calculation includes only the top and bottom third, but it is sufficient for classroom test analysis).Compute the response frequencies for each option (the percentage of students who selected each option).Compute the discrimination index (D) for the correct answer.Compute the discrimination index (D) for each distractor option.Use the data to analyze the quality of the item (refer to the detailed discussion later in this chapter).Item 10 A B *C D TotalTop group 4 (0.13) 3 (0.10) 21 (0.70) 2 (0.06) 30Bottom group 9 (0.30) 6 (0.20) 10 (0.33) 5 (0.17) 30Total 13 9 31 7 60p value of option 0.21 0.15 0.52 0.12 D value of option −0.17 −0.10 0.37 −0.09 Just as the mean p value represents the average difficulty of the items on the test, the mean PBI represents the average discrimination power of the items on the test. A high mean PBI indicates that the test contains high-quality items (that the items on the test discriminated between the high and low achievers). Table 11.4 is adapted from Ebel’s (1979, p. 267) proposed range for evaluating discrimination indices on the correct answers for items on classroom tests.Table 11.4 Range for PBI on the Correct Answers in a Classroom Test>0.40 Very good0.30–0.39 Good: examine stem and options for clarity0.20–0.29 Marginal: identify problems with stem and/or options0.10–0.19 Weak: revise stem and/or options before banking item0.00–0.10 Very weak: consider rejecting or accepting multiple answers<0.00 Unacceptable: reject or accept multiple answers for the itemHaladyna (1997) defines a highly discriminating item on a classroom exam as one that has a PBI above 0.20 and a p value between 0.60 and 0.90 (p. 240). Kehoe (1995) maintains that items that have a PBI of less than 0.15 should be restructured. The mean PBI for our sample in Table 11.1 is 0.36. According to these criteria, our sample is well within the reasonably good category. For the purposes of overall test analysis, both the mean PBI (0.36) and the mean p value (0.754) for our sample are acceptable values, which means that the items on average were challenging and provided distinction between those who scored at the high end on the test and those who scored at the low end. However, you need to examine these values for each individual item before final decisions can be made about the exam.
Individual Item AnalysisOnce you have examined the big picture of the test, it is time to analyze each of the pieces. In addition to summary statistics for a test, test development software programs provide statistical analysis of each item on the test. This analysis includes the p value and PBI for the correct answer and each of the distractors for every item. Multiple-choice classroom tests can be improved by developing a pool of good items that can be used for future tests. The statistical information from a test item analysis is an invaluable tool for both interpreting test results and improving your items for future use. However, the quality of an item cannot be determined from its statistical data alone. A qualitative review of the actual item alongside the data is the most effective way to determine the need for revision of an item.General guidelines for using item analysis data to evaluate the items on multiple-choice exams include the following:Interpret the data with the size of the sample in mind; the larger the sample, the more dependable the data. Many software programs accumulate item data over time, which increases the overall sample size. However, it is important to note that item analysis data refer to an item’s performance on a particular test with a particular group. When the item is included with a different set of items to form a new test or is used in the same test with a different group of students, the data are subject to change. Therefore, accumulated data provide only a general estimate of how an item is performing. Once an item bank is established, however, you will find that most of the items perform consistently over time, particularly those that are edited based on item analysis data.Do not use item analysis data dogmatically. The data should be used as a guide for your professional judgment and expertise.Do not use item analysis data in isolation. Good data do not guarantee a good item. It is important to review the item to determine its cognitive level.Review the data analysis for every option in an item. As the previous discussion of the quantitative test data analysis points out, the p value and the PBI for an item correspond to the p value and the PBI for the item’s correct answer. These data give a picture of how well the item is working. However, it is also important to examine the analysis data for each of the distractors and to examine the question itself before deciding whether or not the question is a valuable oneConsider potential revisions for every item, but items that have a PBI of less than 0.20 should definitely be restructured. These items are not discriminating well between the high and low achievers on the test. Frequently, you will find that these items violate the guidelines for item writing. Examine the data for each option and rewrite the items; refer to Chapter 5, “Selected-Response Format: Developing Multiple-Choice Items,” for guidelines for item construction.Determine the desirability of very difficult items. When an item has a p value of less than 0.30, it usually indicates that the item is too difficult for the group. The general rule is to eliminate an item that fewer than 30% of the group answers correctly. However, it is not wise to set rigid parameters for retaining or eliminating an item; you should use your professional judgment to determine whether an item is appropriately difficult. If a very difficult item discriminates well, it may be a challenging item that should be retained. Each item should be carefully reviewed to determine whether the item difficulty level is desirable or if item revision is indicated. Determine the desirability of very easy items as well; an item that has a difficulty level of greater than 0.90 may be too easy for the group. You cannot remove very easy items from a test (unless you want to cause a student revolt). However, items with very high p values should be examined carefully and revised before they are entered into an item bank.Check that every distractor has been selected. Distractors should be distracting. Those that are not chosen by any test taker are not working as you planned. For example, if only the correct answer and one distractor are chosen for a four-option multiple-choice item, then the item is actually a two-option item. Options that are not selected are not contributing to the test and probably should be revised. In many cases, distractors that are not selected may be implausible. Be sure to consider the size of the group when determining why an option was not selected. For example, if your sample is 10 students, you may decide that no one selected an option because there were too few students taking the test. You may decide to try the option on a future test. If no one selects an option after a few administrations of the item, it would be wise to revise the option.Look carefully at each option’s PBI. An incorrect option that has a positive PBI or a correct option that has a negative PBI usually means that ambiguity in the stem or the option confused the students. If an item’s correct option has a negative PBI and one or more distractors have a positive PBI, it means that the students who were the low achievers on the test selected the correct response more frequently than the students who were the high achievers on the test. This finding usually indicates that the high-achieving students were misled by item ambiguity or that the item is a trick item. It may also mean that all the students were so confused by the item they were guessing. One situation that should raise a red flag occurs when an item’s correct option has a positive PBI and one of the distractors also has a positive PBI. These results indicate that the distractors were confusing because they attracted some of the high-achieving students. These items require careful examination, both quantitatively and qualitatively, to determine whether to remove them from the test and how to revise them for future use.Table 11.5 provides a sample of the data printout that would be included in a detailed item analysis of a test. A careful review of these data provides insight into the role that item data can play to assist you in translating how items work on a test. You will see how important it is to examine all aspects of data analysis before making scoring decisions on a test.Table 11.5 Detailed Item AnalysisItem Statistics Options Statisticsp Value PBI Option p Value for Option PBI Key1. 0.71 0.42 A 0.21 −0.30 B 0.03 −0.47 C 0.05 −0.29 D 0.71 0.42 D2. 0.70 0.26 A 0.13 −0.19 B 0.11 0.02 C 0.70 0.26 CD 0.06 −0.23 3. 0.77 0.10 A 0.07 −0.26 B 0.04 −0.19 C 0.12 0.09 D 0.77 0.10 D4. 0.76 0.27 A 0.76 0.27 AB 0.00 — C 0.24 −0.27 D 0.00 — 5. 0.258 −0.013 A 0.045 −0.063 B 0.258 −0.013 BC 0.000 — D 0.697 0.040 6. 0.32 0.48 A 0.32 0.48 AB 0.13 −0.24 C 0.18 −0.04 D 0.38 −0.29 7. 1.00 — A 0.00 B 0.00 C 0.00 — CD 0.00 8. 0.57 0.168 A 0.04 −0.147 B 0.07 −0.147 C 0.32 0.14 D 0.57 0.168 D9. 0.352 −0.22 A 0.402 −0.01 B 0.183 0.15 C 0.352 −0.22 CD 0.422 0.10 10. 0.714 0.43 A 0.159 −0.18 B 0.0635 −0.20 C 0.714 0.43 CD 0.0635 −0.33 Note that the p value and the PBI for each item in the Item Statistics category correspond to the p value and PBI of the correct option in the Option Statistics category. For example, item 1 has a p value of 0.71 and a PBI of 0.42—the same data as for option D (the correct answer). The p value figure in the Option Statistics category refers to the percentage of students who chose each option, while the PBI value refers to the biserial for each option. A comprehensive assessment requires that you examine both the difficulty (p value) and the PBI of each item on the test and each distractor for each item.The difficulty level of the items determines the mean of the test. Your professional judgment is the guide for determining the acceptability of the difficulty level of the items. The suggestions for acceptability of the PBI (see Table 11.4) can be used as a guide to evaluate the discrimination ability of the items. Remember, the high end of this range is a goal for which you should strive. Analysis of these data will not only help you make decisions about the test scores, it will also guide you in improving the items for future use.Keep in mind that you cannot look at the data in isolation. The data alert you to review the actual item—use the data as a guide. If you have a very small student sample, the data will not be as accurate as they would be with a large sample. You have a dual objective with item analysis review: to help you make scoring decisions about the test at hand and to improve the items in order to bank them for future use. Although your judgment is the key factor, the data are very valuable tools for guiding your judgment. The item analysis data in Table 11.5 is an example of the item analysis data printout many test development software programs provide after a test is scored. The table illustrates how data can be utilized to make informed decisions about test items. Each of the first four items has an acceptable difficulty level. If difficulty was the only available statistical information, you might infer that the items were all operating equally well on the test. Close examination of all the data provides insight into the true functioning of the items and alerts you to examine the items that have questionable data.Item 1 in Table 11.6 is an example of an item that worked well statistically. The item has a difficulty level of 0.71 and a PBI of 0.42; these findings indicate that the item discriminated very well between the high achievers and the low achievers on the test. In addition, each of the incorrect options has a negative PBI, which indicates that the low-scoring students chose the distractors more frequently than the high-scoring students. This picture is exactly what you want to have for your items: the correct answer in an acceptable difficulty range with a high positive PBI and all distractors have a negative PBI. Remember, however, that data are not the only parameter for measuring the value of an item. Qualitative review of the item is just as important as the quantitative review: Just because an item has good statistical data does not mean that it is measuring anything of value.Table 11.6 Item with Very Good DataItem 1p Value PBI Option p Value for Option PBI Key0.71 0.42 A 0.21 −0.30 B 0.03 −0.47 C 0.05 −0.29 D 0.71 0.42 DItem 2 in Table 11.7 has approximately the same p value as item one. Although the PBI of the item is acceptable, this item was not as statistically effective as item one because option B, an incorrect option, has a positive PBI, which means that high-achieving students chose that option. Perhaps the distractor was confusing or tricky. In any event, this information is a red flag for you to take a closer look to determine whether there is a problem with the option. If you looked only at the difficulty level for this item, you would miss the problem with distractor B.Table 11.7 Item with Positive PBI on an Incorrect OptionItem 2p Value PBI Option p Value for Option PBI Key0.70 0.26 A 0.13 −0.19 B 0.11 0.02 C 0.70 0.26 CD 0.06 −0.23 When you examine the actual item and discuss students’ reasoning with those students who selected the option, you might decide that the test scores should be adjusted to accept option B as a correct answer. On the other hand, you may decide that it is not appropriate to give credit for option B. You are not obligated to give credit for an incorrect option, only to examine questionable data.Once you identify what attracted the high achievers to choose option B, you can edit the option so that it is clearly incorrect before you put the item in your item bank. The next time you use the item, it is likely that the high achievers will not be attracted to option B and will choose the correct option, thus increasing the discrimination ability of the item.Another example of the problems that result from restricting your data review to the difficulty level of the items is exemplified by item 3 in Table 11.8. While the item’s difficulty level could be considered in the acceptable range the correct answer (D) has very poor discrimination ability, and option C has a positive PBI. This item should be reviewed and option C considered for acceptance as a correct answer on this test.Table 11.8 Item with Low PBI on Key and Positive PBI on Incorrect OptionItem 3p Value PBI Option p Value for Option PBI Key0.77 0.10 A 0.07 −0.26 B 0.04 −0.19 C 0.12 0.09 D 0.77 0.10 DOnce you identify what attracted the high achievers to choose option C, you can edit the option so that it is clearly incorrect before you put the item in your item bank. Doing so will make it more likely that the high achievers will choose the correct option if you use the question in a future test because they will not be distracted by the flawed distractor, and the discrimination ability of the item will increase.One goal of effective item writing is to develop plausible distractors that are attractive to the uninformed students. For a distractor to contribute to an item, someone must choose it. When no one chooses an option, it is considered a nondistractor. Item 4 in Table 11.9 has two nondistractors: No one chose either option B or D, and thus it is most likely that these options are not plausible. Although the item has an acceptable difficulty level and it had acceptable discrimination, it was effectively a true–false item. The item should be reviewed and the two options that were not chosen should be revised before the item is banked. You will probably find that when you use the item in a future test, its discrimination power will increase because the low achievers will be attracted to the revised B and D options.Table 11.9 Item with Two NondistractorsItem 4p Value PBI Option p Value for Option PBI Key0.76 0.27 A 0.76 0.27 AB 0.00 — C 0.24 −0.27 D 0.00 — Item 5 in Table 11.10 is an example of an item where the correct answer has a negative PBI. In addition, option C was not chosen, and item D has a positive PBI and was chosen by most students. These findings indicate that the students were confused and were probably guessing. Perhaps the item is miskeyed, but even if it is miskeyed, option D has very weak discrimination power and needs to be revised. Based on the data of this item, you should review it with the students and consider discarding it from this test and revising it extensively before entering it into your item bank.Table 11.10 Very Difficult Item with Negative PBI on Key and One NondistractorItem 5p Value PBI Option p Value for Option PBI Key0.258 −0.013 A 0.045 −0.063 B 0.258 −0.013 BC 0.000 — D 0.697 0.040 This is a perfect example of why a student review is so important. You want to find out why almost 70% of the students selected the incorrect response. Was the item flawed? Were the learning experiences ineffective? If you can identify the reasons for the flaws in this item, you have a great opportunity to revise the options so the item will provide the information you require the next time you use it in a test.The results in Table 11.10 wave another red flag. If this question is testing an important concept, the students need another opportunity to master the content. Perhaps you need to offer new learning opportunities and retest the content on a future test.Look carefully at item 6 in Table 11.11. If you were to look only at the difficulty level for this item, you might discard it because it is so difficult. The data indicate, however, that the item has excellent discrimination capability and each of the three distractors have a negative PBI. Perhaps the item is a challenging one that enables the test to identify high-achieving students. Or maybe the item was misleading; even though option D has a negative PBI, 38% of the students selected this incorrect option. To determine the true value of this item, you need to keep the data in mind while you conduct a qualitative review of the item itself.Table 11.11 Difficult Item with Acceptable DataItem 6p Value PBI Option p Value for Option PBI Key0.32 0.48 A 0.32 0.48 AB 0.13 −0.24 C 0.18 −0.04 D 0.38 −0.29 Item 7 in Table 11.12 illustrates how an item that is too easy cannot discriminate. Everyone answered this question correctly, and no one chose any of the distractors. This item might be too easy, or maybe it is testing a concept that you want to be certain that the students understand. Be sure you are really honest with yourself when evaluating this item. As was mentioned previously, items that more than 90% of students or fewer than 30% answer correctly lend very little to the quality of a test. Items such as number 7 must be carefully reviewed and revised before they are entered into an item bank. When you have an item that has a very high or very low p value, be especially careful to examine each option to determine its plausibility.Table 11.12 Easy ItemItem 7p Value PBI Option p Value for Option PBI Key1.00 — A 0.00 — B 0.00 — CC 1.00 — D 0.00 — While items 2 and 3 each have positive a positive PBI on an incorrect option, the incorrect options on both items had very low p values. In contrast, item 8 in Table 11.13 has a positive PBI on distractor C, and distractor C has a substantially high p value: 32% of the students chose option C. This data requires close investigation of the item. Why were so many students attracted to option C? Was there an element of truth in it? If so, option C should be accepted as correct. This finding is another example of the need to have a colleague review your test before you administer it. Remember to ask your colleague to pay special attention to the incorrect options to ensure that they are absolutely incorrect because the incorrect options usually cause the flaws in an item. Better to identify a flawed item before an exam than to have to make adjustments after it is given.Table 11.13 Item with Weak PBI on Key and Positive PBI on One DistractorItem 8p Value PBI Option p Value for Option PBI Key0.57 0.168 A 0.04 −0.147 B 0.07 −0.417 C 0.32 0.12 D 0.57 0.168 DThe data for item 9 in Table 11.14 indicates that the question has serious flaws. First, the correct answer has a negative PBI, which means that the low achievers on the test selected the option more frequently than the high achievers did. In addition, options A and D were chosen more frequently than the correct answer, and options B and D both have a positive PBI. This data is consistent with a very confusing or negative stem. Or perhaps it was keyed incorrectly and option D is the correct answer. This data is typical for an item with a negative stem and underscores how confusing negative stems can be. Obviously, the item must be reviewed and, unless it was keyed incorrectly, the item should be removed from the test.Table 11.14 Item with Negative PBI on Key and Positive PBI on Two DistractorsItem 9p Value PBI Option p Value for Option PBI Key0.352 −0.022 A 0.402 −0.01 B 0.183 0.15 CC 0.352 −0.22 D 0.422 0.10 Item 10 in Table 11.15 is another example of an item that has very good data. Every option was selected, the correct option has a positive PBI, and each of the distractors has a negative PBI. Remember, the data is not the whole story. An item can have very good data and not measure anything of importance. Therefore, it is essential to review the actual item to determine its value.Table 11.15 Item with Very Good DataItem 10p Value PBI Option p Value for Option PBI Key0.714 0.43 A 0.159 −0.147 B 0.0635 −0.217 C 0.714 0.43 CD 0.0635 −0.168 It is important to examine your very good items carefully. Faculty members often focus on what doesn’t work and overlook the items that make a positive contribution to the test. When you identify items that have very good data and that measure important concepts, use them as models to write parallel items. The more exposure you have to what constitutes a good item, the better your item-writing skills will be. Table 11.16 summarized the item data in Tables 11.6 through 11.15.Table 11.16 Examples of Item DataItem 1 Table 11.6 Item with very good dataItem 2 Table 11.7 Item with positive PBI on incorrect optionItem 3 Table 11.8 Item with low PBI on key and positive PBI on incorrect optionItem 4 Table 11.9 Item with two nondistractorsItem 5 Table 11.10 Difficult item with negative PBI on key and one nondistractorItem 6 Table 11.11 Difficult item with acceptable dataItem 7 Table 11.12 Easy itemItem 8 Table 11.13 Item with weak PBI on key and positive PBI on one distractorItem 9 Table 11.14 Item with negative PBI on key and positive PBI on two distractorsItem 10 Table 11.15 Item with very good dataIt is clear from the sample item analysis review in this chapter that the statistical data provided in most software reports can provide valuable information for making scoring decisions and for improving your items for future use. An understanding of the basic concepts goes a long way toward helping you translate these data. Follow the steps outlined in Exhibit 11.3 to guide you through the test and item analysis process and review.Exhibit 11.3 Statistical test analysisOverall AnalysisCheck data for completeness.Assess the relationship of the mean and the median.Examine the relationship of the mean to the passing standard.Check the score range and standard deviation.Examine the reliability coefficient.Determine the SEM.Examine the mean p value.Evaluate the mean PBI.Examine the score frequency distribution.Assess the test’s histogram.Individual Item AnalysisAssess each item’s p value.Examine each item’s PBI.Identify that the key has a positive PBI.Identify whether any distractor has a positive PBI.Identify distractors that were not chosen.Review items that breach minimum standards.Consider discarding or accepting multiple answers on items that are flawed.Revise items based on data and student review before entering them into an item bank.
Using Item Analysis to Improve ItemsIn addition to using item analysis data to analyze the functioning of an item so that decisions can be made about an overall test score, you can also use the data as a powerful vehicle for improving the reliability and validity of classroom test results by guiding the revision and improvement of the individual test items. The revised items can then be included in an item bank, which provides a valuable resource for future test development.Multiple-choice items lend themselves most effectively to item analysis. The statistics obtained for each item provide information to guide you in improving the item. Remember that there are no dogmatic criteria for revising items; the data should be used as a guide in conjunction with your professional judgment. Item revision should be based on both qualitative and quantitative analyses of the item.Examine the items in Exhibits 11.4 through 11.9. These examples illustrate how qualitative and quantitative analyses can be combined to improve your items for item banking. These specific examples illustrate the potential of item analysis. The combination of statistical data and your professional expertise provides a powerful tool for test development that should not be overlooked.Exhibit 11.4 Item analysis example 1OriginalA client has returned to her room from the recovery room S/P Billroth procedure. The nurse’s initial assessment includes BP 110/65, pulse 86 and regular, respirations 18, and temperature 100.8. Abdominal dressing is dry and intact. Levin tube is set to low suction draining 25–30 cc/hr; yellow-brown drainage is noted.The nurse caring for the client notes the amount and color of the drainage from the Levin tube. The nurse concludes that the color indicates the presence of:normal gastric contents with residual bile.normal gastric contents with old blood.empty stomach with residual fecal matter.empty stomach with residual gastric contents.RevisedA nurse assesses a client who had a Billroth procedure 24 hours ago and identifies the following client findings: blood pressure, 110/65 mmHg; pulse, 87/min and regular; respiration, 18/min; oral temperature, 100.0°F (37.8°C); nasogastric tube to low suction, draining 20–25 mL/hr of yellow-brown drainage. The nurse should recognize that these findings indicate:normal stomach contents with old blood.impending hemorrhage from the stomach.reflux of intestinal contents with old blood.leakage of stomach contents into the peritoneum.Correct answer: Bp value: 0.97PBI: 0.02 Correct answer: Ap value: 0.72PBI: 0.29Choice Proportion Response PBI Choice Proportion Response PBIABCD 0.000.970.030.00 —0.02−0.02— ABCD 0.720.180.070.03 .29−0.09−0.22−0.23Exhibit 11.5 Item analysis example 2OriginalA young man who has tumbled down a ski slope without one of his skies is lying at the foot of the slope complaining of severe pain in his lower leg. The nurse should:elevate his injured leg.cover both legs with a blanket.assess for compartment syndrome.splint his injured leg.RevisedA high school student falls down a flight of stairs at school and reports severe pain in the right lower leg. Which of these actions should a school nurse take?Splint the student’s right leg.Elevate the student’s right leg.Cover the student’s right leg with a blanket.Assess the student’s right leg for range of motion.Correct answer: Dp value: 0.87PBI: 0.14 Correct answer: Ap value: 0.71PBI: 0.37Choice Proportion Response PBI Choice Proportion Response PBIABCD 0.080.050.000.87 −0.15−0.050.000.14 ABCD 0.710.070.030.19 0.37−0.20−0.10−0.48This item illustrates an easy item; 87% of the students answered it correctly.The options in the original item are not homogeneous, which probably accounts for the poor biserial (0.14) for the question. While the correct answer has the only positive biserial (0.14) and two of the distractors have a negative biserial, option C was not chosen at all.The revised item removes extraneous information, eliminates gender, increases the homogeneity of the options, and substitutes an alternative for option C. Note that the item analysis data indicates that the rewritten item was very effective. Because of the improvements, the revised item discriminates much more effectively than the original, it is more difficult, every option was selected, and each of the distractors has a negative biserial.Exhibit 11.6 Item analysis example 3OriginalThe pharmacological action of antacids such as Mylanta is to:coat the stomach mucosa.decrease gastric motility.elevate gastric pH.decrease duodenal pH.RevisedA nurse should explain to a client that antacids, such as aluminum hydroxide (Mylanta), act to:elevate gastric pH.coat the gastric mucosa.inhibit acid production.neutralize lactose intolerance.Correct answer: Cp value: 0.56PBI: 0.24 Correct answer: Ap value: 0.67PBI: 0.43Choice Proportion Response PBI Choice Proportion Response PBIABCD 0.440.000.560.00 −0.24—0.24— ABCD 0.670.100.180.04 0.43−0.33−0.26−0.29This item demonstrates how important it is to look at the whole picture. The item was difficult for the group, with a p value of 0.56 and a reasonable point biserial (0.24). If your analysis stopped at this point, you would not see that options B and D were not chosen. You would miss the fact that the item is really a two-option item.The revised item replaces both distractors that were not selected in the original item. It is evident from the statistical analysis that the revised item is a much more effective item.Although it is still relatively difficult, it has an excellent point biserial, every option was selected, and all distractors have negative biserials. Option D was the weakest selection.This is a case where your professional judgment is needed to decide whether to revise the option or retest the item as it is to collect additional data.Exhibit 11.7 Item analysis example 4OriginalIncreasing the flow rate of total parenteral nutrition (TPN) above the prescribed rate is dangerous because it can result in:osmotic diuresis and hypoglycemia.hypoglycemia and dumping syndrome.dumping syndrome and electrolyte imbalance.electrolyte imbalance and osmotic diuresis.RevisedA client is receiving total parenteral nutrition (TPN). A nurse should ensure that the TPN does not exceed the prescribed flow rate to prevent:hypoglycemia.pneumothorax.dumping syndrome.electrolyte imbalance.Correct answer: Dp value: 0.17PBI: 0.03 Correct answer: Dp value: 0.66PBI: 0.23Choice Proportion Response PBI Choice Proportion Response PBIABCD 0.700.110.030.17 −0.130.17−0.010.03 ABCD 0.090.090.150.66 −0.23−0.12−0.10−0.23With a p value of 0.17 and a point biserial of 0.03, this item would definitely be a candidate for elimination from the test. More than 70% of the students chose distractor A, which has a negative biserial (−0.13), whereas distractor B, which was chosen by almost 11% of the students, has a positive biserial (0.17). The difficulty level of the item and point biserial results indicate that the question is ambiguous, confusing, and in need of revision.An examination of the item helps to explain the item analysis. This item violates item-writing guidelines: Options A and C are partially correct, and all the options overlap with at least one other option. This probably accounts for the students’ confusion with the question. Should the item simply be discarded? If faculty members believe that the item is testing important information, it is worth it to attempt a rewrite, especially because you can use the item analysis as a guide.The rewrite of the item presented here simplifies the question and still tests the key concept. The p value of the revised item (0.66) keeps it in the difficult range. However, the item analysis shows that this version is much more effective than the original. The item has an acceptable point biserial, every option was selected, and every distractor has a negative biserial. Remember, every item should be considered for revision. Would you make any changes before reusing this item on a test?Exhibit 11.8 Item analysis example 5OriginalA nurse caring for a patient with a chest tube connected to a Pleur-Evac system knows that:bubbling in the water seal should be intermittent.bubbling should be continuous and constant.there should be no bubbling in the water seal.bubbling will be seen in the suction regulator.RevisedA client has a chest tube connected to an underwater-seal drainage system. Which of these observations of the drainage system should the nurse recognize as indicating that the system is functioning properly?Fluctuations in the water-seal chamberFluctuations in the collection chamberContinuous bubbling in the collection chamberContinuous bubbling in the water-seal chamberCorrect answer: Cp value: 0.84PBI: 0.07 Correct answer: Ap value: 0.78PBI: 0.30Choice Proportion Response PBI Choice Proportion Response PBIABCD 0.050.110.840.00 −0.160.030.07— ABCD 0.780.050.030.12 0.30−0.27−0.02−0.16The original item has a p value of 0.84 and a very weak biserial (0.07). One likely reason that the item is a poor discriminator is because distractor B has a positive biserial and no one chose distractor D. This is another item whose faults would be overlooked without careful item analysis.The revision is composed of more homogeneous options that increased both the difficulty and discrimination values of the item. The revised item is more difficult than the original, many of the flaws are removed, and the item no longer confuses the high-achieving students. All options are chosen, and all distractors have negative biserials.Further revision of this item for retesting is a matter of professional judgment.Exhibit 11.9 Item analysis example 6OriginalA nurse would recognize that a patient is attempting to resist infection when diagnostic laboratory values reveal an elevated:red blood cell count.white blood cell count.partial thromboplastin time.hematocrit.RevisedWhen assessing a male client, a nurse identifies that the client has the laboratory findings identified in the chart below.Normal ClientRed blood cells 4.2–6.9 million/cu mm 4.5 million/cu mmWhite blood cells 4,300–10,800/cu mL 15,000/cu mLHematocrit 45%–62 % 58%Which of these measures should the nurse include in the client’s care plan?Monitor the client’s body temperature.Move the client to an isolation room.Observe the client for bleeding.Advise the client to eat iron-rich foods.Correct answer: Bp value: 1PBI: 0 Correct answer: Ap value: 0.76PBI: 0.52Choice Proportion Response PBI Choice Proportion Response PBIABCD 01.000 0.0 ABCD 0.760.050.030.12 0.52−0.27−0.02−0.16The original is an example of an item that is too easy. An item such as this does not contribute to the validity of the test’s results. First, it represents recall. All the student has to do is remember that an elevated white blood cell count indicates that the patient has an infection. No interpretation is required. No judgment needs to be made. No action needs to be taken.However, the question does address an important concept: It is important for the student to recognize the signs of infection. And just as important, the student should know what to do when client has an infection.The revision addresses the criteria for developing items that assess critical thinking. Instead of telling the student that the client has elevated WBCs, the student is required to interpret a chart of laboratory values. Once the student identifies the problem, the student has to decide which of the actions is appropriate.The data indicates that the revision is effective. The item’s difficulty is increased, and the biserial is an excellent one. In addition, all the options are chosen, and each of the distractors has a negative biserial.Incorporating Student Comments The item analysis examples illustrate that quantitative analysis cannot be used in isolation; your professional judgment is critical to successful item revision. Your qualitative review and the comments that accompany posttest review by both students and your colleagues provide a valuable resource for editing items before they are banked.When revising items it is most important to write distractors that attract the uninformed students. Student comments during test review often lend themselves to the creation of effective distractors. Be sure to take careful notes of student remarks during review sessions. These remarks provide you with valuable leads for developing effective distractors.
Assigning Test ScoresOnce you have collected data from qualitative, statistical, and student review, you can assign scores to an exam. While you might decide to discard an item or accept more than one correct answer for a question, it is usually best not to add points to individual exams. This practice is referred to as scaling scores; see Chapter 13, “Assigning Grades,” for more information.Flawed ItemsItems that are seriously flawed should not be counted as part of the final test score. Eliminating poorly functioning items from a test can increase the test’s reliability coefficient. Kehoe (1995) provides an example of how eliminating seven items that had PBIs below 0.20 from a test of 30 items with a reliability coefficient of 0.79 resulted in a 23-item test with a reliability coefficient of 0.88.It is impossible to determine whether an item is seriously flawed based on either quantitative or qualitative analysis alone. To assess the items fairly, you must include the overall test data, item analysis data, student review, and your qualitative review of the items in question. Chapter 14, “Instituting Item Banking and Test Development Software,” offers guidelines for using computerized item banking to examine all aspects of an item to determine its quality. If you conclude that an item is flawed based on your comprehensive analysis, you might decide to accept more than one option as the correct answer for the item or to eliminate the item from the test. It is important to note that, because of measurement error, giving the benefit of the doubt to the students is usually the fairest approach.Adjusting Test Scores to Account for Flawed ItemsHow do you adjust a test score to account for a flawed item? You could simply add a point to everyone’s score, which would obviously increase everyone’s score. However, those who answered the flawed item correctly in the first place would actually receive two points for the flawed item. This approach would be considered psychometrically unsound. Two alternate approaches are commonly followed. Let’s examine them both so you can decide which one is best suited to your grading philosophy.The first approach is to discard the flawed item by adjusting the key to accept all the possible answers while keeping the original number of possible points. Table 11.17 illustrates this approach for a 10-item test with 1 flawed item.Table 11.17 Flawed Item Removed, Key Adjusted, Possible Points MaintainedStudent A Passing = 70%Initial Possible Points Initial Raw Score Percent Correct Flawed Items Removed Initial Answer on Flawed Item Revised Score Revised Percent Correct10 7/10 70% 1 Correct 7/10 70%Item Key Student A Answers Raw Score* Adjusted Score*1. A A c c2. C C c c3. B B c c4. C D x x5. A B C D D c c6. B A x x7. D C x x8. A A c c9. C C c c10. B B c cSCORE 7/10 7/10Student B Passing = 70%Initial Possible Points Initial Raw Score Percentage Correct Flawed Items Removed Initial Answer on Flawed Item Revised Score Revised Percentage Correct50 7/10 70% 1 Incorrect 8/10 80%Item Key Student B Answers Raw Score* Adjusted Score*1. A A c c2. C C c c3. B B c c4. C C c c5. A B C D B x c6. B A c c7. D C x x8. A A c c9. C C c c10. B A x xSCORE 7/10 8/10* c = correct, x = incorrectNote that Student A answered the flawed item (number 5) correctly and has a raw score of 7/10. When the key is adjusted to accept all the possible answer options, Student A’s score remains at 7/10, or 70%, because the number of possible points is kept at 10.Table 11.17 illustrates that Student B also had an original raw score of 7/10, or 70%. Because Student B initially answered the flawed item incorrectly, however, Student B’s score increases to 8/10 when all the possible options are accepted as correct. Student B’s adjusted score is 80%.Now, let’s examine Table 11.18, This table illustrates the results of four students on a test of 50 items with three flawed items removed. All four students have a raw score of 76%, which is passing for this course.Table 11.18 Three Flawed Items Removed from 50-Item Test, Key Adjusted, Possible Points MaintainedStudent A Passing = 76%Possible Points Raw Score Percent Correct Flawed Items Raw Score Answer Points Added to Score Adjusted Score Adjusted Percent Correct50 38 76% 123 CorrectCorrectCorrect 000 38/50 76%Student B Passing = 76%Possible Points Raw Score Percent Correct Flawed Items Raw Score Answer Points Added to Score Adjusted Score Adjusted Percent Correct50 38 76% 123 CorrectCorrectIncorrect 001 39/50 76%Student C Passing = 76%Possible Points Raw Score Percent Correct Flawed Items Raw Score Answer Points Added to Score Adjusted Score Adjusted Percent Correct50 38 76% 123 CorrectIncorrectIncorrect 011 40/50 80%Student D Passing = 76%Possible Points Raw Score Percent Correct Flawed Items Raw Score Answer Points Added to Score Adjusted Score Adjusted Percent Correct50 38 76% 123 IncorrectIncorrectIncorrect 111 41/50 82%Student A answered all the flawed items correctly, so Student A’s revised score remains at 76% once the flawed items are removed. Student B answered one of the flawed items incorrectly, which means that Student B’s adjusted score increases to 78% when all the options on the flawed items are accepted. Similarly, Student C’s score increases to 80% and Student D’s score increases to 82% when all the options on the flawed items are accepted as correct.The second approach to accounting for flawed items involves removing the flawed items and adjusting the total number of possible points. Table 11.19 illustrates this approach for the same 10-item test with one flawed item that was described in Table 11.17. As Table 11.19 illustrates, however, the final results are quite different from the results in Table 11.17.Table 11.19 Flawed Item Removed, Possible Points AdjustedStudent A Passing = 70%Initial Possible Points Initial Raw Score Percent Correct Flawed Items Removed Adjusted Possible Points Initial Answer on Flawed Item Revised Score Revised Percent Correct10 7/10 70% 1 9 Correct 6/9 66.7%Item Key Student A Answers Raw Score* Adjusted Score*1. A A c c2. C C c c3. B B c c4. C D x x5. D D c Removed6. B A x x7. D C x x8. A A c c9. C C c c10. B B c cSCORE 7/10 6/9Student B Passing = 70%Initial Possible Points Initial Raw Score Percent Correct Flawed Items Removed Adjusted Possible Points Initial Answer on Flawed Item Revised Score Revised Percent Correct10 7/10 70% 1 9 Correct 6/9 66.7%Item Key Student B Answers Raw Score* Adjusted Score*1. A A c c2. C C c c3. B B c c4. C C c c5. D B x Removed6. A A c c7. D C x x8. A A c c9. C C c c10. B B x xSCORE 7/10 7/9* c = correct, x = incorrect.Let’s look first at Student A in Table 11.19. Student A has an initial raw score of 7/10, which is a passing score of 70%. As in Table 11.17, student A answered the flawed item (number 5) correctly. When the flawed item is removed from the test, however, Student A’s passing score of 7/10 changes to 6/9, which is 66.7%, a failing score.On the other hand, Student B, who also has an initial passing score of 7/10, has answered the flawed item (number 5) incorrectly. So, when item number 5 is eliminated from the test, Student B’s score becomes 7/9, or 77.7%. As you can see, adjusting the possible number of points has a more dramatic effect on the students’ final scores than simply adjusting the key and accepting all the responses as correct.The effect of adjusting the possible number of points is illustrated even more clearly in Table 11.20.Table 11.20 Three Flawed Items Removed from a 50-Item Test, Possible Points AdjustedStudent A Passing = 76%Possible Points Raw Score Percent Correct Flawed Items Raw Score Answer Adjusted Possible Points Adjusted Score Adjusted Percent Correct50 38 76% 1 Correct −1 2 Correct −1 3 Correct −1 47 35/47 74.5%Student B Passing = 76%Possible Points Raw Score Percent Correct Flawed Items Raw Score Answer Adjusted Possible Points Adjusted Score Adjusted Percent Correct50 38 76% 1 Correct −1 2 Correct −1 3 Incorrect 0 47 36/47 76.5%Student C Passing = 76%Possible Points Raw Score Percent Correct Flawed Items Raw Score Answer Adjusted Possible Points Adjusted Score Adjusted Percent Correct50 38 76% 1 Correct −1 2 Incorrect Incorrect 0 3 0 47 37/47 78.7%Student D Passing = 76%Possible Points Raw Score Percent Correct Flawed Items Raw Score Answer Adjusted Possible Points Adjusted Score Adjusted Percent Correct50 38 76% 1 Incorrect Incorrect −1 2 Incorrect 0 3 0 47 38/47 80.9%Both Table 11.18 and Table 11.20 illustrate the results for four students on the same 50-item test that has three flawed items. The key-adjusted approach in Table 11.18 results in all four students obtaining a passing grade, while the adjusting-points approach illustrated in Table 11.20 results in Student A receiving a failing grade.Which approach will you choose to use for flawed items? The choice is up to you. Remember, however, that fairness to students is the pivotal issue here. Students should not be held accountable for defective items in a test. Also keep in mind that the students will have great difficulty accepting that they have gone from a passing to a failing score, which is why I recommend using the key adjusted approach.However, the choice is a faculty decision. As long as the students are informed of the method that will be implemented, faculty members are justified in implementing whichever method they deem appropriate. The ultimate goal is to remove all defective items from the test bank so you are not caught in this dilemma. This goal can be accomplished by following the guidelines presented in this text.Whichever approach you decide to use to adjust scores for flawed items, make sure that these items are revised before they are entered into the item bank to ensure they are not reused in their defective condition. Flawed items often have the potential to be revised as very effective items. Careful editing that considers the qualitative and quantitative analysis and that incorporates student comments can assist you with transforming these items into items that contribute to valid and reliable results from your measurement instruments.Returning Scores to StudentsTeachers must carefully consider the issue of confidentiality when returning scores to students. Several test development programs provide individual score reports that can be distributed to students confidentially, and many schools have the ability to distribute scores confidentially on the Internet. If your school’s practice is to post student grades, be careful to follow the school’s protocol and assign secret identification numbers to each student.Another important consideration is timeliness. Teachers often assign strict deadlines for assignment submissions, yet they are very lax with returning the same assignments in a timely manner. Although careful consideration of grade assignment requires time, students should not be required to wait so long that the feedback from an exam or written assignment is meaningless.On the other hand, students often want their test scores before they leave the classroom. It is to the students’ advantage to wait. Faculty members need several days to review an exam carefully and examine the data before returning scores to students. For a discussion related to scoring and student test review, refer to Chapter 9, “Assembling, Administering, and Scoring a Test.” Keep in mind that, in the interest of fairness, you should set a return deadline with the students and adhere to that deadline.
SummarySystematic test and item analysis procedures ensure the fairness and accuracy of individual items and the test as a whole. It is impossible to evaluate the effectiveness of an item on a test without examining all the relevant data. Statistical test and item analysis data provide the essential tools for objective review of test results. This chapter provides actual examples of test analysis data to illustrate how these data can be used for objective interpretation of test results. While your first attempt at conducting these analyses will be time consuming, the procedure will become streamlined as you become more proficient and as you incorporate improved items from your item bank in your exams.Although these data can be very useful for improving your assessments, you must remember that the data are only useful as general indicators, not as precise measures. Data interpretation should be used as only one of the considerations to make when determining the fairness of an assessment; your professional judgment must guide the process.Learning ActivitiesDefine the measures of central tendency included in Table 11.1. Explain how the measures of central tendency would skew a distribution negatively or positively.Define standard deviation. Explain why you would expect a small standard deviation on a test administered to a homogenous group of high-achieving students.Which of these tests would most likely yield a low reliability coefficient?A test of 50 challenging items administered to a large group of heterogeneous studentsA test of 10 easy items administered to a small group of homogeneous studentsExplain your answer.Define p value. Define discrimination ability. How are the difficulty and discrimination power of an item related? Explain why a high mean PBI indicates that a test contains high-quality items.How would you adjust test scores for flawed items? Explain your rationale for the approach you select.Review and interpret the data from the item analysis in the table below. Discuss your analysis with your colleagues.Item Statistics Options Statistics p Value PBI Option p Values PBI Key1. 0.737 0.15 A 0.053 −0.21 B 0.053 −0.33 C 0.737 0.15 CD 0.158 0.15 2. 0.789 0.02 A 0.105 0.06 B 0.789 0.02 BC 0.053 0.16 D 0.053 0.21 3. 0.632 0.44 A 0.053 0.04 B 0.056 0.45 C 0.263 0.28 D 0.632 0.44 D4. 0.632 0.01 A 0.211 0.12 B 0.00 0.00 C 0.632 0.01 CD 0.158 0.15 5. 0.895 0.57 A 0.053 −0.33 B 0.053 0.45 C 0.895 0.57 CD 0.00 0.00 Review the items and their data in the tables below. Rewrite the items to improve their performance. Consider both the data and the item analysis for each item. Discuss your revisions with your colleagues.A nurse is assessing a patient who has insulin dependent diabetes mellitus (type 1). The patient has a blood sugar of 60 mg/dL. Which of these additional findings should the nurse expect to identify?Weakness, diaphoresis, and confusionKussmaul’s respirations, acetone breath, and headachePolydypsia, polyuria, and polyphagiaLethargy, flushed face, and somnolenceA* B C DPBI 0.05 −0.06 −0.22 0.17p value 0.809 0.048 0.067 0.076Which of these measures should a nurse include in the care plan for a patient who had a graft of the femoral artery 8 hours ago?Keeping the patient’s legs elevated above the level of the heartEncouraging increased fluid intakeComparing pedal pulses every 2 hoursAssisting the patient to do isometric leg exercisesA B C* DPBI −0.039 0.00 0.31 −0.08p value 0.144 0.00 0.81 0.046A patient who had a transurethral prostatectomy (TURP) 6 hours ago has a continuous bladder irrigation (CBI). The patient asks a nurse, “Why do I need this irrigation?” Which of these explanations should the nurse offer the patient?“The irrigation will stop the bleeding in your bladder.”“The irrigation keeps the catheter from being blocked by blood clots.”“The irrigation promotes normal urine production until healing occurs.”“The irrigation provides a route for giving antibiotics directly into the bladder.”A B* C DPBI 0.04 −0.18 0.06 0.00p value 0.095 0.841 0.064 0.00A patient who had abdominal surgery 2 hours ago is in the postanesthesia unit. The patient has an intravenous infusing at 100 mL/hr. A nurse assesses that the patient has dyspnea, moist cough, and an O2 saturation of 92%. Which of these actions should the nurse take first?Monitor the patient’s heart rate and blood pressureAssess the patient for peripheral edemaNotify the physicianSlow the patient’s intravenous rate to 10 mL/hrA B C D*PBI 0.07 −0.21 0.07 0.14p value 0.011 0.492 0.064 0.333A patient who had a Billroth II procedure this morning has all of these prescriptions. Which one should the nurse question?Isotonic leg exercises every 2 hoursAmbulate tomorrow morningIrrigate the nasogastric tube every 2 hoursAssist the patient to cough and deep breathA B C* DPBI −0.15 0.28 0.20 −0.23p value 0.064 0.032 0.571 0.333Web LinksEXCEL Spreadsheets for Classical Test Analysishttp://languagetesting.info/statistics/excel.htmlIntroductory Statisticshttp://www.psychstat.missouristate.edu/introbook/sbk13m.htmSchreyer Institute for Teaching Excellencehttp://www.schreyerinstitute.psu.edu/Tools/ItemAnalysis/Test Item Analysis Using an Excel Spreadsheethttp://www.eflclub.com/elvin/publications/2003/itemanalysis.htmlReferencesBrookhart, S. M., & Nitko, A. J. (2014). Educational assessment of students (7th ed.). Upper Saddle River, NJ: Pearson Education.Ebel, R. L. (1979). Essentials of educational measurement (3rd ed.). Englewood Cliffs, NJ: Prentice Hall.Frary R. (1995). More multiple-choice item writing do’s and don’ts. Blacksburg, VA: Virginia Polytechnic Institute and State University.Frisbie, D. A. (1988). Reliability of scores from teacher-made tests. Educational Measurement: Issues and Practice, 7, 25–35.Haladyna, T. M. (1997). Writing test items to evaluate higher order thinking. Needham Heights, MA: Allyn and Bacon.Kehoe, J. (1995). Basic item analysis for multiple-choice tests. Practical Assessment, Research & Evaluation, 4(10). Retrieved from http://pareonline.net/getvn.asp?v=4&n=10Miller, M. D., Linn, R. L., & Gronlund, N. E. (2009). Measurement and assessment in teaching (10th ed.). Upper Saddle River, NJ: Pearson Education.Reynolds, C. R., Livingston, R. B., & Wilson, V. (2008). Measurement and assessment in education (2nd ed.). Boston, MA: Allyn & Bacon.