
The Item Analysis Report provides information about overall test reliability and about individual items for use in item and test improvement. Listed at the top of the Item Analysis Report are:
The Kuder-Richardson Formula 20 reliability coefficients is an index of the internal consistency of the test. The test is made up of items intended to measure knowledge of the subject being tested. The items in the test are not the only items that could have been used; they are simply a sample of all possible items. If a different sample of items were used, students would obtain somewhat different scores, but it would be expected that the higher-scoring students on one examination would be the higher-scoring students on the other examination and vice versa. The reliability coefficient indicates the degree to which this expectation of consistency is met.
The reliability coefficient can range from 0 to 1. A coefficient of 0 for a test indicates no consistency. A person cannot generalize from a score on the test to the broader domain of knowledge the test is supposed to represent. A coefficient of 1 indicates that exactly the same relative performance of one student would be expected on another test of the same domain of knowledge. Most commercial standardized achievement tests have reliabilities of about .9, but reliabilities of classroom examinations is seldom that high.
The most important factors affecting the reliability coefficient are:
The reliability coefficient does not indicate anything about the validity of a test. The test can be highly reliable, but if the content of the items does not represent what is supposed to be measured, the test will not be valid.
TopItem statistics provide data about students' responses to each test item in order to help judge its effectiveness. Two characteristics of the items of most interest are: DIFFICULTY AND DISCRIMINATION. Also of interest is the effectiveness of the distracters (wrong answers). The following information is printed for each item in the Item Analysis Report.
ITEM NUMBER. The number of the item in the test. Any items omitted from scoring are not listed.
CORRECT RESP*. The keyed correct response. The asterisk indicates that the correct response is also marked by an asterisk next to the percent of responses to each choice listed to the right.
ITEM DIFF. Item difficulty. This shows the percent of students who answered the item correctly. Although this statistic (also referred to as p-value) is called item difficulty, note that the higher the percentage, the easier the item.
Items in the middle range of difficulty provide the best discrimination among students. Items that are quite difficult or quite easy do not discriminate among students. Such items may be useful if the intent of the test is to determine whether the students have all mastered the material, but they contribute little to the test if the intent is to determine which students know the most and which know the least. For tests intended to discriminate among students, optimum results are obtained when the items are similar in difficulty and have average p-values somewhat higher than halfway between a chance level and 1.00. For 5-, 4-, 3-, and 2- response (T/F) multiple-choice tests these p-values are 60, 62, 67, and 75, respectively.
If an item is very difficult (low p-value), it may cover content that students have not learned, or it may be a faulty item, possibly even keyed incorrectly.
P-values are extremely dependent on the group for which they are computed. They do not indicate difficulty level in groups of different ability. Also, the p-values in standard and extended item analyses are appropriate for tests where most students have attempted every item. P-values for items not reached by substantial numbers of students will be seriously distorted. For such tests, the TESTAN item analysis is more appropriate.
POINT-BISERIAL CORREL. The point-bi serial correlation coefficient shows the correlation between the item and the total score on the test and is used as an index of item discrimination. On highly discriminating items, students who know more about the subject matter in general (i.e., have higher total scores on the test) do better than those who know less. Point-bi serial correlations can have negative values, indicating negative discrimination, when students who scored well on the total test did less well on the item than students with lower scores. It has been suggested that most items on a test should have point-bi serial correlations of .29 or greater in a class of about 50 students or greater than .20 in a class of about 100 students, if the items are of moderate difficulty.
AVG SCORE COR. RES. The average score on the test of those who got the item correct. This score should be higher than the mean test score for the entire class, which can be assured if the point-bi serial correlation is positive.
FREQUENCY AND PERCENT OF RESPONSE CHOICES. These columns show the number (FRQ) and percent (%) of students who chose each of the possible answers to the item. The figures for the correct response are in bold face and identified with an asterisk. The last column on the right shows the number and percent of students for whom the item was not scored because it was either omitted or because the student marked more than one answer. Incorrect answers (distracters) that are chosen by very few students (less than 5 percent) are serving no purpose in the test and should be examined for possible improvement.
MEAN. The mean is the arithmetic average, the sum of all scores divided by the total number of scores. The mean is the statistic that most people think of as "the average." It is ordinarily the best measure of central tendency if the units on the score scale are equal, a reasonable assumption for most tests, and if the distribution of scores is symmetrical. Extreme scores heavily influence the mean. Consequently, if the distribution is skewed, i.e., there are extreme scores at one end that are not balanced by extreme scores at the other; the mean may not adequately represent the center of the distribution.
STANDARD DEVIATION. The variance is the average squared deviation around the mean. The mean is subtracted from each score; the resulting difference is squared; and the sum of the squared differences is divided by the total number of scores. The standard deviation is the square root of the variance, and it represents a distance on the score scale. If the test scores are distributed as a normal curve, 68 percent of the scores are between one standard deviation below the mean and one standard deviation above while 98 percent of the scores are between plus and minus two standard deviations.
An Extended Item Analysis Report is available at the same charge as the standard report. It gives all the information described above and other additional information. Whereas the standard item analysis can be run for part scores, the extended item analysis is available only for the total test.
Include in the Extended Item Analysis the number (FRQ) and percent (%) of each response for students in the highest and lowest 27% of scores, and the difference between the percentages. The figures for the correct response are in bold face and marked with an asterisk. The percentage difference for the correct response is a discrimination index that was originally developed as an efficient proxy for item-score correlation coefficients before computers made the latter easier to calculate than the former. The percentage difference is an easily understood index that some users are accustomed to using. In addition, the differences for the incorrect responses (which should be negative) provide additional evidence such as the total percentages of endorsement as well as the effectiveness of these options as distracters.
The mean score of the group who answered the item incorrectly and the mean score of the group who selected the correct response is shown on the extended analysis.
CONFIDENCE LEVEL (DISC) OF HI-LO %. This number shows the probability that the difference between the high and low groups is due to chance. The confidence level depends on the size of both the actual difference and the group.
TopREGARDING THE PRESENT TEST. If examination of the item analysis results indicate one or more seriously flawed items, you may want to modify the test scores, either by re-scoring the entire group with a new key or by adjusting the scores of students affected by the faulty item(s).
REGARDING THE INSTRUCTION. Extremely difficult items, if they have no apparent defects, may indicate that the material was not adequately covered in the instruction. A review or a ifferent approach to the topic may be indicated. Similarly, the most popular answer to a negatively discriminating item may point to a misunderstanding shared by many students.
REGARDING FUTURE TESTS. The primary application of item analyses is to improve future tests by identifying items that are not performing as expected, so that they can be improved.
Item data are influenced by chance errors, the nature of the group tested, the number of students tested and the instruction the class has received. The other items in the test also are important if most of the items in the test relate to a certain content area or a small number of items related to different content are likely to have lower discrimination indices. Whether or not an item measures an important instructional objective is a more important consideration than the magnitude of the difficulty and validity indices. One should not be too hasty in discarding items with poor statistics from a single administration. If an item discriminates positively, is clear and unambiguous, is free from technical defects and measures an important instructional objective, it may be retained for another try in the future. Item statistics should probably be used more for item improvement than for discarding items.
Remember that the item analysis applications described above apply to tests whose objective is to provide maximum discrimination among all students taking the test. Different test characteristics are required if the objective is to determine whether all students have achieved mastery of certain material or to provide the greatest reliability of measurement at a specific cutting point.
TopTESTAN is a more comprehensive test scoring, analysis, and reporting system. A selection of reports similar to the standard reports is available, but each one includes choice of output format, choice of high and low group sizes, choice of validity criterion (may be other than total test score) and choice of scatter plots of difficulty and validity indices. The standard, extended and TESTAN item analyses are based on classical test theory, which has the advantage of relative simplicity and familiarity. Item response theory analysis, which may be of interest to some users, especially with large classes, also is available in TESTAN. The costs of class lists, individual reports, and item analysis are the same for TESTAN as for the standard reports, but descriptive statistics, which are required for all other reports, cost $11.70, including all total and part scores and histograms. TESTAN item analysis labels are $6.50.
Top