In light of Amnesty International’s decision to not publish a report covering events taking place in 2013, we have received a number of inquiries concerning the comparability of Amnesty International, State Department, and Human Rights Watch scores. Inquiries typically raise the following questions:
Can State Department scores be substituted for missing Amnesty scores?
Should only one set of scores be used or is it better to average across Amnesty, State Department, and HRW scores?
In the past we have responded to these questions with a fair amount of hand-waving. We suggested that there are no clear answers and recommended that scholars err on the side of caution. Adding to the ambiguity, our main data release includes average scores for each year (PTS2014.xls). I here want to advocate for a more cautious approach and highlight that averaging scores to deal with missing values is likely problematic.
It is worth remembering that the Political Terror Scale aims to produce a standards-based measure of states’ physical integrity rights abuse. To produce this measure, we rely on annual reports published by Amnesty International, the U.S. State Department, and more recently Human Rights Watch. For each set of reports we code separate scores. PTS-A is coded relying exclusively on the reports published by Amnesty International. PTS-S relies only on the State Department reports and PTS-H on reports by Human Rights Watch.
While we believe that the resulting scores provide a good indication of the state of physical integrity rights in a given country for a given year, we cannot make the claim that they represent the true state of a country’s human rights practices. In other words, scores represent the state of human rights as reported by a given reporting organization. Amnesty, for example may report abuses which are not mentioned in the State Department’s report. The State Department could focus on areas which did not receive the attention of Human Rights Watch. By and large then, PTS scores are a reflection of human rights records as seen and reported by the reporting organization. They may approximate the “truth” but they likely contain some amount of subjectivity or bias depending on each organization’s mandate, focus, agenda, monitoring capacity, and/or resource constraints.
Whether scores are comparable thus depends on one’s assumptions about these potential reporting biases. What follows below is a comparison of PTS scores based on Amnesty International and U.S. State Department reports. As we only have two years worth of Human Rights Watch scores, they are not considered in the discussion. We do expect, however, that some of the patterns we find, as well as the conclusions we reach, apply similarly to PTS scores based on the reports published by Human Rights Watch.
Figure 1, shows average PTS scores for both Amnesty International and the State Department. For the purpose of this comparison, only countries for which both an Amnesty and a State Department score is available, were considered. As can be seen, for the period from 1976 to about 1990, the State Department average was consistently lower than the average for Amnesty scores. In other words Amnesty’s reports were perhaps overly “harsh” or critical compared to the State Department’s. Alternatively, it could be that the State Department was particularly lenient or uncritical. Since then yearly averages have converged, with the average of State Department scores being slightly higher in recent years.
Some have argued that non-governmental organizations (NGOs) such as Amnesty International have incentives to consistently report bad news even if states’ human rights records improve (Simmons 2009). If human rights records across the world improve sufficiently, Amnesty International’s ability to mobilize members and attract donations would arguably be eroded. In short, Amnesty International has an incentive to change its standards, or to focus its attention to violations ignored in the past to remain relevant. Incentives to strategically adjust reporting standards may also be a problem for the U.S. State Department’s annual reports and these reports have been criticized, as well. Its reports were allegedly biased in their content to make U.S. allies and U.S. foreign aid recipients appear more favorable (Poe and Tate 1994; Poe, Carey, and Vazquez 2001).
Replacing a missing Amnesty score prior to 1990 with one based on the State Department source material is thus problematic. Even more recent scores (from 1990 onward) may not be substitutable. Consider Figure 2a and Figure 2b.
The differences between State Department and Amnesty scores (for the 4853 observations for which both scores exist), are generally not more than one level or category. For 58 percent of the observations, the State Department and Amnesty scores are identical. For about 39 percent of the observations scores are different by one category or level (e.g. PTS-S = 4 and PTS-A = 3). Only for about 3 percent of observations is the difference between the two scores larger than one category. For a quarter of the cases Amnesty scores are higher (worse) than the State Department’s. Amnesty scores are lower for 17 percent of the observations.
It is important to note that these differences are not limited to the pre-1990 period. Disagreement remains substantial for more recent scores. As can be seen in Figure 2b, there is a high amount of disagreement between State Department and Amnesty scores prior to the 1990s – peaking in 1978 when the two PTS scales were in agreement for only 25 percent of the coded observations. Even after 1990, however, disagreement remains relatively high, fluctuating about 40 percent. Assuming disagreement is not random, this calls uncritical averaging and or substitution of scores into question.
The case against averaging is further strengthened when considering disagreement of PTS scores by country. Consider the histogram presented in Figure 3. Shown is distribution of the average differences by country.
As should be expected the average differences cluster around zero. As such, for most countries, differences in scores appear random and cancel each other out. The State Department score for Country X may be high compared to the Amnesty score in one year but is lower in another – so on average the scores are in agreement. For a substantial number of countries, however, Amnesty scores are consistently higher (left tail of the figure) or consistently lower (right tail). This is troubling news for those seeking to replace missing scores for these countries.
To add some substance to the discussion, below find the results of comparing the two measures (PTS-S and PTS-A) for each country by means of a set of paired t-tests. Paired t-tests allow for a quick and dirty evaluation of ratings or biases of two raters (in our case two reporting agencies – Amnesty International and the State Department). Presented are the average differences of PTS-S and PTS-A scores by country, as well as 90 percent confidence intervals.
Countries, with dots on the right side of the red line (indicating no differences in scores), receive on average higher (worse) State Department scores. Countries with dots left of the red line, receive higher (worse) Amnesty scores. Magenta colored dots indicate that these differences are statistically discernible from zero (i.e. Amnesty scores are significantly lower or the State Department scores significantly higher). Blue colored dots indicate the opposite (i.e. Amnesty scores are significantly higher or the State Department scores significantly lower).
The results presented above do suggest that disagreement of scores is limited to a set of specific countries. The State Department’s reports are arguably more critical of former Eastern Bloc countries (e.g. Azerbaijan, Turkmenistan, Kazakhstan, Belarus, Georgia, and Russia). Other perhaps unsurprising examples include Vietnam, Mozambique, Angola, and Nicaragua. Amnesty, on the other hand, appears to be more critical of prominent U.S. allies such as West Germany, Colombia, Saudi Arabia, and Israel.
The take-away should be plain. Replacing missing scores and averaging of the two (three) scales is likely problematic. Scholars ought to take great care to justify either approach and evaluate on a case by case basis if and when Political Terror Scale scores are indeed comparable. The above results suggest that it is unlikely that the bias in one of the scales can easily be “fixed” or “washed out” by replacing it with the existing score of the other scale.
Extreme care should also be taken when interpreting scores as representations of the true human rights conditions in a given country. It is safer to rather view them as representations of human rights records from the perspectives of varying monitoring organizations.
Both PTS-A and PTS-S present a biased view of physical integrity rights violations – this is especially likely for the 1970s and early to mid 1980s. Yet, even with almost 40 years worth of data these biases remain problematic.