ITEM ANALYSIS REPORT

The Item Analysis Report provides information about overall test reliability and about individual items for use in item and test improvement. The top of Item Analysis Report lists:

  • The size of the group
  • The number of items in the test
  • The mean and standard deviation of test scores
  • The reliability coefficient

MEAN

The mean is the arithmetic average, the sum of all scores divided by the total number of scores, and is the statistic most people think of as "the average". If the units on the score scale are equal, which is true of most tests, and if the distribution of scores is symmetrical, the mean is generally the best measure of central tendency. However, extreme scores heavily influence the mean. If the distribution is skewed the mean may not be a good indicator of the center of the distribution. (For more information about central tendency, see below for the Descriptive Statistics Report.)

STANDARD DEVIATION

The standard deviation is the square root of the variance, and it represents a distance on the score scale. If the test scores are distributed as a normal curve, 68 percent of the scores are between one standard deviation below the mean and one standard deviation above while 98 percent of the scores are within two standard deviations below and above the mean.

RELIABILITY

Reliability is an index of the degree to which a test is consistent and stable in measuring what it is intended to measure. USAS uses the Kuder-Richardson Formula 20 reliability coefficient.

The reliability coefficient can range from 0 to 1. A coefficient of 0 indicates there is a lot of error in how the test measures knowledge. A coefficient of 1 indicates there is no error in how the test measures the domain of knowledge; if the individual took another test of the same domain, their score would be expected to demonstrate the same amount of knowledge.

Commercial standardized achievement tests may have reliabilities of up to .9, but the reliability of classroom examinations is seldom that high.

The most important factors affecting the reliability coefficient are:

  • The homogeneity of the test-takers: The greater the variability within the group, the higher the reliability coefficient
  • The homogeneity of the items in the test: The higher the inter-correlations among the items, the higher the reliability coefficient
  • The length of the test: The more items in the test, the higher the reliability coefficient

The reliability coefficient does not indicate anything about the validity of a test. The test can be highly reliable, but if the content of the items does not represent what is supposed to be measured, the test will not be valid. An example is a scale that is always 10% off. It will be consistent, but the weight displayed will be wrong.

ITEM STATISTICS

Item statistics provide data about test-takers' responses to each test item in order to help judge item effectiveness. The following information is printed for each item in the Item Analysis Report.

  1. Item. The number of the item in the test. Any items omitted from scoring are not listed.
  2. Correct Response(s). The keyed correct response(s) for the item. The correct item response is also indicated in bold in the Response Frequency (Percent) column.
  3. Item Difficulty. Item difficulty shows the percent of test-takers who answered the item correctly. Although this statistic is called item difficulty, note that the higher the percentage, the easier the item.

    Items in the middle range of difficulty provide the best discrimination among test-takers. Items with quite high or quite low difficulty may be useful for indicating whether all the test-takers mastered the material.

    For tests intended to discriminate among test-takers (that is, determine who knows the most and who knows the least), optimum results are obtained when the test items are similar in difficulty, and have average Item Difficulty values somewhat higher than halfway between “chance” and 1.00. For multiple choice tests with 5-, 4-, 3-, or 2-response items (T/F), these Item Difficulty Values are 60, 62, 67, and 75, respectively.

    If an item is very difficult (low Item Difficulty), it may cover content that test-takers have not learned, or it may be a faulty item, possibly even keyed incorrectly.

    Item Difficulty values are extremely dependent on the group for which they are computed. They do not indicate difficulty level in groups of different ability. Also, the Item Difficulty values in standard and extended item analyses are appropriate for tests where most test-takers have attempted every item.
  4. Point Biserial Correlation. The point biserial correlation coefficient shows the correlation between the item and the total score on the test and is used as an index of item discrimination. On highly discriminating items, test-takers who know more about the subject matter in general (i.e., have higher total scores on the test) do better than those who know less. Point-biserial correlations can have negative values, indicating negative discrimination, when test-takers who scored well on the total test did less well on the item than those with lower scores. It has been suggested that most items on a test should have point biserial correlations of .29 or greater in a class of about 50 test-takers or greater than .20 in a class of about 100 test-takers, if the items are of moderate difficulty.
  5. Average Score Cor. Res. The average score on the test of those who responded correctly to the item. This score should be higher than the mean test score for the entire class, which can be assured if the point biserial correlation is positive.
  6. Response Frequency (Percent). These columns show the number (Frequency) and percent of test-takers who chose each of the possible answers to the item. The figures for the correct response are in bold face. The last row for each item, labeled Omit/Mult, shows the number and percent of test-takers for whom the item was not scored because it was either omitted or because the test-taker marked more than one answer. Incorrect answers (distracters) that are chosen by very few test-takers (less than 5 percent) are serving no purpose in the test and should be examined for possible improvement.

EXTENDED ITEM ANALYSIS

An Extended Item Analysis Report is available at the same charge as the standard report. It includes everything from the Item Statistics Report plus additional information. While the standard item statistics can be run for part scores, the extended item analysis is available only for the total test.

The Extended Item Analysis includes the number (Response Frequency) and percent (%) of each response for test-takers in the highest and lowest 27% of scores, and the difference between the percentages. The figures for the correct response are in bold face. The percentage difference for the correct response is a discrimination index originally developed as a proxy for item-score correlation coefficients, which were difficult to calculate before computers were readily available for statistical analysis. The percentage difference is an easily understood index that some users are accustomed to using. In addition, the differences for the incorrect responses (which should be negative) provide additional evidence such as the total percentages of endorsement as well as the effectiveness of these options as distracters.

The mean score of the group who answered the item incorrectly and the mean score of the group who selected the correct response is shown on the extended analysis.

The Confidence Level shows the probability that the difference between the high and low groups is due to chance, and depends on the size of both the actual difference and the group.

USING ITEM ANALYSIS RESULTS

Regarding the Present Test: If examination of the item analysis results indicate one or more seriously flawed items, you may want to modify the test scores, either by re-scoring the entire group with a new key or by adjusting the scores of test-takers affected by the faulty item(s).

Regarding the Instruction: Extremely difficult items, if they have no apparent defects, may indicate that the material was not adequately covered in the instruction. A review or a different approach to the topic may be indicated. Similarly, the most popular answer to a negatively discriminating item may point to a misunderstanding shared by many test-takers.

Regarding Future Tests: The primary application of item analyses is to improve future tests by identifying items that are not performing as expected, so that they can be improved.

  • An extremely easy item may identify a topic that all test-takers have learned; alternatively, the item may have no plausible distracters
  • Difficult or negatively discriminating items may be confusing or ambiguous, or may have more than one reasonably correct answer
  • Seldom chosen incorrect answers should be examined to see if they contain irrelevant clues. If no more than 5 percent of test-takers, over time, select a given response, that response is contributing little to the item
  • Incorrect answers chosen more frequently by high scoring test-takers than by low-scoring test-takers should be examined to determine why they are discriminating negatively
  • Plotting the items on a chart with the difficulty level as one axis and the validity index as the other may be helpful in differentiating items that contribute to the test's objectives from those that may require modification

Item data are influenced by chance errors, the nature of the group tested, the number of people tested and the instruction they have received. The other items in the test also are important if most of the items in the test relate to a certain content area or a small number of items related to different content are likely to have lower discrimination indices. Whether an item measures an important instructional objective is a more important consideration than the magnitude of the difficulty and validity indices. One should not be too hasty in discarding items with poor statistics from a single administration. If an item discriminates positively, is clear and unambiguous, is free from technical defects and measures an important instructional objective, it may be retained for another try in the future. Item statistics should probably be used more for item improvement than for discarding items.