Citation Statistics: International Mathematical Union Report



SUMMARY: The IMU‘s cautions are welcome: Research metrics do need to be validated; they need to be multiple, rather than a single, unidimensional index; they need to be separately validated for each discipline, and the weights on the multiple metrics need to be calibrated and adjusted both for the discipline being assessed and for the properties on which it is being ranked. The RAE 2008 database provides the ideal opportunity to do all this discipline-specific validation and calibration, because it is providing parallel data from both peer panel rankings and metrics. The metrics, however, should be as rich and diverse as possible, to capitalize on this unique opportunity for joint validation. Open Access not only provides more metrics; it also augments them, as well as providing open safeguards against manipulation and misuse.


Numbers with a number of problems
International Mathematical Union Citation Statistics Report. Robert Adler, John Ewing (Chair), Peter Taylor

Charles Oppenheim wrote (in the American Scientist Open Access Forum):

CHARLES OPPENHEIM: “I’ve now read the whole report. Yes, it tilts at the usual windmills, and rightly dismissed the use of Impact factors for anything but crude comparisons, but it fails to address the fundamental issue, which is: citation and other metrics correlate superbly with subjective peer review. Both methods have their faults, but they are clearly measuring the same (or closely related) things. Ergo, if you have evaluate research in some way, there is no reason NOT to use them! It also keeps referring to examples from the field of maths, which is a very strange subject citation-wise.”

I have now read the IMU report too, and agree with Charles that it makes many valid points but it misunderstands the one fundamental point concerning the question at hand: Can and should metrics be used in place of peer-panel based rankings in the UK Research Assessment Exercise (RAE) and its successors and homologues elsewhere? And there the answer is a definite Yes.

The IMU critique points out that research metrics in particular and statistics in general are often misused, and this is certainly true. It also points out that metrics are often used without validation. This too is correct. There is also a simplistic tendency to try to use one single metric, rather than multiple metrics that can complement and correct one another. There too, a practical and methodological error is correctly pointed out. It is also true that the “journal impact factor” has many flaws, and should on no account be used to rank individual papers or researchers, and especially not alone, as a single metric.

But what all this valuable, valid cautionary discussion overlooks is not only the possibility but the empirically demonstrated fact that there exist metrics that are highly correlated with human expert rankings. It follows that to the degree that such metrics account for the same variance, they can substitute for the human rankings. The substitution is desirable, because expert rankings are extremely costly in terms of expert time and resources. Moreover, a metric that can be shown to be highly correlated with an already validated variable predictor variable (such as expert rankings) thereby itself becomes a validated predictor variable. And this is why the answer to the basic question of whether the RAE’s decision to convert to metrics was a sound one is: Yes.

Nevertheless, the IMU’s cautions are welcome: Metrics do need to be validated; they do need to be multiple, rather than a single, unidimensional index; they do have to be separately validated for each discipline, and the weights on the multiple metrics need to be calibrated and adjusted both for the discipline being assessed and for the properties on which it is being ranked. The RAE 2008 database provides the ideal opportunity to do all this discipline-specific validation and calibration, because it is providing parallel data from both peer panel rankings and metrics. The metrics, however, should be as rich and diverse as possible, to capitalize on this unique opportunity for joint validation.

Here are some comments on particular points in the IMU report. (All quotes are from the report):

The meaning of a citation can be even more subjective than peer review.

True. But if there is a non-metric criterion measure — such as peer review — on which we already rely, then metrics can be cross-validated against that criterion measure, and this is exactly what the RAE 2008 database makes it possible to do, for all disciplines, at the level of an entire sizeable nation’s total research output…

The sole reliance on citation data provides at best an incomplete and often shallow understanding of research — an understanding that is valid only when reinforced by other judgments.

This is correct. But the empirical fact has turned out to be that a department’s total article/author citation counts are highly correlated with its peer rankings in the RAE in every discipline tested. This does not mean that citation counts are the only metric that should be used, or that they account for 100% of the variance in peer rankings. But it is strong evidence that citation counts should be among the metrics used, and it constitutes a (pairwise) validation.

Using the impact factor alone to judge a journal is like using weight alone to judge a person’s health.
For papers, instead of relying on the actual count of citations to compare individual papers, people frequently substitute the impact factor of the journals in which the papers appear.

As noted, this is a foolish error if the journal impact factor is used alone, but it may enhance predictivity and hence validity if added to a battery of jointly validated metrics.

The validity of statistics such as the impact factor and h-index is neither well understood nor well studied.

The h-index (and its variants) were created ad hoc, without validation. They turn out to be highly correlated with citation counts (for obvious reasons, since they are in part based on them). Again, they are all welcome in a battery of metrics to be jointly cross-validated against peer rankings or other already-validated or face-valid metrics.

citation data provide only a limited and incomplete view of research quality, and the statistics derived from citation data are sometimes poorly understood and misused.

It is certainly true that there are many more potential metrics of research performance productivity, impact and quality than just citation metrics (e.g., download counts, student counts, research funding, etc.). They should all be jointly validated, discipline by discipline and each metric should be weighted according to what percentage of the criterion variance (e.g., RAE 2008 peer rankings) it predicts.

relying primarily on metrics (statistics) derived from citation data rather than a variety of methods, including judgments by scientists themselves…

The whole point is to cross-validate the metrics against the peer judgments, and then use the weighted metrics in place of the peer judgments, in accordance with their validated predictive power.

bibliometrics (using counts of journal articles and their citations) will be a central quality index in this system [RAE]

Yes, but the successor of RAE is not yet clear on which metrics it will use, and whether and how it will validate them. There is still some risk that a small number of metrics will simply be picked a priori, without systematic validation. It is to be hoped that the IMU critique, along with other critiques and recommendations, will result in the use of the 2008 parallel metric/peer data for a systematic and exhaustive cross-validation exercise, separately for each discipline. Future assessments can then use the metric battery, with initialized weights (specific to each discipline), and can calibrate and optimize them across the years, as more data accumulates — including spot-checks cross-validating periodically against “light-touch” peer rankings and other validated or face-valid measures.

sole reliance on citation-based metrics replaces one kind of judgment with another. Instead of subjective peer review one has the subjective interpretation of a citation’s meaning.

Correct. This is why multiple metrics are needed, and why they need to be systematically cross-validated against already-validated or face-valid criteria (such as peer judgment).

Research usually has multiple goals, both short-term and long, and it is therefore reasonable that its value must be judged by multiple criteria.

Yes, and this means multiple, validated metrics. (Time-course parameters, such as growth and decay rates of download, citation and other metrics are themselves metrics.)

many things, both real and abstract, that cannot be simply ordered, in the sense that each two can be compared

Yes, we should not compare the incomparable and incommensurable. But whatever we are already comparing, by other means, can be used to cross-validate metrics. (And of course it should be done discipline by discipline, and sometimes even by sub-discipline, rather than by treating all research as if it were of the same kind, with the same metrics and weights.)

lea to use multiple methods to assess the quality of research

Valid plea, but the multiple “methods” means multiple metrics, to be tested for reliability and validity against already validated methods.

Measures of esteem such as invitations, membership on editorial boards, and awards often measure quality. In some disciplines and in some countries, grant funding can play a role. And peer review — the judgment of fellow scientists — is an important component of assessment.

These are all sensible candidate metrics to be included, alongside citation and other candidate metrics, in the multiple regression equation to be cross-validated jointly against already validated criteria, such as peer rankings (especially in RAE 2008).

lure of a simple process and simple numbers (preferably a single number) seems to overcome common sense and good judgment.

Validation should definitely be done with multiple metrics, jointly, using multiple regression analysis, not with a single metric, and not one at a time.

special citation culture of mathematics, with low citation counts for journals, papers, and authors, makes it especially vulnerable to the abuse of citation statistics.

Metric validation and weighting should been done separately, field by field.

For some fields, such as bio-medical sciences, this is appropriate because most published articles receive most of their citations soon after publication. In other fields, such as mathematics, most citations occur beyond the two-year period.

Chronometrics — growth and decay rates and other time-based parameters for download, citations and other time-based, cumulative measures — should be among the battery of candidate metrics for validation.

The impact factor varies considerably among disciplines… The impact factor can vary considerably from year to year, and the variation tends to be larger for smaller journals.

All true. Hence the journal impact factor — perhaps with various time constants — should be part of the battery of candidate metrics, not simply used a priori.

The most important criticism of the impact factor is that its meaning is not well understood. When using the impact factor to compare two journals, there is no a priori model that defines what it means to be “better”. The only model derives from the impact factor itself — a larger impact factor means a better journal… How does the impact factor measure quality? Is it the best statistic to measure quality? What precisely does it measure? Remarkably little is known…

And this is because the journal impact factor (like most other metrics) has not been cross-validated against face-valid criteria, such as peer rankings.

employing other criteria to refine the ranking and verify that the groups make sense

In other words, systematic cross-validation is needed.

impact factor cannot be used to compare journals across disciplines

All metrics should be independently validated for each discipline.

impact factor may not accurately reflect the full range of citation activity in some disciplines, both because not all journals are indexed and because the time period is too short. Other statistics based on longer periods of time and more journals may be better indicators of quality. Finally, citations are only one way to judge journals, and should be supplemented with other information

Chronometrics. And multiple metrics

The impact factor and similar citation-based statistics can be misused when ranking journals, but there is a more fundamental and more insidious misuse: Using the impact factor to compare individual papers, people, programs, or even disciplines

Individual citation counts and other metrics: Multiple metrics, jointly validated.

the distribution of citation counts for individual papers in a journal is highly skewed, approximating a so-called power law… highly skewed distribution and the narrow window of time used to compute the impact factor

To the extent that distributions are pertinent, they too can be parametrized and taken into account in validating metrics. Comparing like with like (e.g., discipline by discipline) should also help maximize comparability.

using the impact factor as a proxy for actual citation counts for individual papers

No need to use one metric as a proxy for another. Jointly validate them all.

if you want to rank a person’s papers using only citations to measure the quality of a particular paper, you must begin by counting that paper’s citations. The impact factor of the journal in which the paper appears is not a reliable substitute.

Correct, but this obvious truth does not need to be repeated so many times, and it is an argument against single metrics in general; and journal impact factor as a single factor in particular. But there’s nothing wrong with using it in a battery of metrics for validation.

h-index Hirsch extols the virtues of the h-index by claiming that “h is preferable to other single-number criteria commonly used to evaluate scientific output of a researcher…”[Hirsch 2005, p. 1], but he neither defines “preferable” nor explains why one wants to find “single-number criteria.”… Much of the analysis consists of showing “convergent validity,” that is, the h-index correlates well with other publication/citation metrics, such as the number of published papers or the total number of citations. This correlation is unremarkable, since all these variables are functions of the same basic phenomenon…

The h-index is again a single metric. And cross-validation only works against either an already validated or a face-valid criterion, not just another unvalidated metric. And the only way multiple metrics, all inter-correlated, can be partitioned and weighted is with multiple regression analysis — and once again against a criterion, such as peer rankings.

Some might argue that the meaning of citations is immaterial because citation-based statistics are highly correlated with some other measure of research quality (such as peer review).

Not only might some say it: Many have said it, and they are quite right. That means citation counts have been validated against peer review, pairwise. Now it is time to cross-validate and entire spectrum of candidate metrics, so each can be weighted for its predictive contribution.

The conclusion seems to be that citation-based statistics, regardless of their precise meaning, should replace other methods of assessment, because they often agree with them. Aside from the circularity of this argument, the fallacy of such reasoning is easy to see.

The argument is circular only if unvalidated metrics are being cross-correlated with other unvalidated metrics. Then it’s a skyhook. But when they are cross-validated against a criterion like peer rankings, which have been the predominant basis for the RAE for 20 years, they are being cross-validated against a face-valid criterion — for which they can indeed be subsequently substituted, if the correlation turns out to be high enough.

“Damned lies and statistics”

Yes, one can lie with unvalidated metrics and statistics. But we are talking here about validating metics against validated or face-valid criteria. In that case, the metrics lie no more (or less) than the criteria did, before the substitution.

Several groups have pushed the idea of using Google Scholar to implement citation-based statistics, such as the h-index, but the data contained in Google Scholar is often inaccurate (since things like author names are automatically extracted from web postings)…

This is correct. But Google Scholar’s accuracy is growing daily, with growing content, and there are ways to triangulate author identity from such data even before the (inevitable) unique author identifier is adopted.

Citation statistics for individual scientists are sometimes difficult to obtain because authors are not uniquely identified…

True, but a good approximation is — or will soon be — possible (not for arbitrary search on the works of “Lee,” but, for example, for all the works of all the authors in the UK university LDAPs).

Citation counts seem to be correlated with quality, and there is an intuitive understanding that high-quality articles are highly-cited.

The intuition is replaced by objective data once the correlation with peer rankings of quality is demonstrated (and replaced in proportion to the proportion of the criterion variance accounted for) by the predictor metric.

But as explained above, some articles, especially in some disciplines, are highly-cited for reasons other than high quality, and it does not follow that highly-cited articles are necessarily high quality.

This is why validation/weighting of metrics must be done separately, discipline by discipline, and why citation metrics alone are not enough: multiple metrics are needed to take into account multiple influences on quality and impact, and to weight them accordingly.

The precise interpretation of rankings based on citation statistics needs to be better understood.

Once a sufficiently broad and predictive battery of metrics is validated and its weights initialized (e.g., in RAE 2008), further interpretation and fine-tuning can follow.

In addition, if citation statistics play a central role in research assessment, it is clear that authors, editors, and even publishers will find ways to manipulate the system to their advantage.

True, but inasmuch as the new metric batteries will be Open Access, there will also be multiple metrics for detecting metric anomalies, inconsistency and manipulation, and for naming and shaming the manipulators, which will serve to control the temptation.

Harnad, S. (2001) Research access, impact and assessment. Times Higher Education Supplement 1487: p. 16.

Harnad, S., Carr, L., Brody, T. & Oppenheim, C. (2003) Mandated online RAE CVs Linked to University Eprint Archives: Improving the UK Research Assessment Exercise whilst making it cheaper and easier. Ariadne 35.

Brody, T., Kampa, S., Harnad, S., Carr, L. and Hitchcock, S. (2003) Digitometric Services for Open Archives Environments. In: Proceedings of European Conference on Digital Libraries 2003, pp. 207-220, Trondheim, Norway.

Harnad, S. (2007) Open Access Scientometrics and the UK Research Assessment Exercise. In Proceedings of 11th Annual Meeting of the International Society for Scientometrics and Informetrics 11(1), pp. 27-33, Madrid, Spain. Torres-Salinas, D. and Moed, H. F., Eds.

Brody, T., Carr, L., Harnad, S. and Swan, A. (2007) Time to Convert to Metrics. Research Fortnight pp. 17-18.

Brody, T., Carr, L., Gingras, Y., Hajjem, C., Harnad, S. and Swan, A. (2007) Incentivizing the Open Access Research Web: Publication-Archiving, Data-Archiving and Scientometrics. CTWatch Quarterly 3(3).

Harnad, S. (2008) Self-Archiving, Metrics and Mandates. Science Editor 31(2) 57-59

Harnad, S. (2008) Validating Research Performance Metrics Against Peer Rankings. Ethics in Science and Environmental Politics 8 (11) doi:10.3354/esep00088


Loet Leydesdorff wrote in the ASIS&T Special Interest Group on Metrics:
LL:It seems to me that it is difficult to generalize from one setting in which human experts and certain ranks coincided to the existence of such correlations across the board. Much may depend on how the experts are selected. I did some research in which referee reports did not correlate with citation and publication measures.

Much may depend on how the experts are selected, but that was just as true during the 20 years in which rankings by experts were the sole criterion for the rankings in the UR Research Assessment Exercise (RAE). (In validating predictive metrics one must not endeavor to be Holier than the Pope: Your predictor can at best hope to be as good as, but not better than, your criterion.)

That said: All correlations to date between total departmental author citation counts (not journal impact factors!) and RAE peer rankings have been positive, sizable, and statistically significant for the RAE, in all disciplines and all years tested. Variance there will be, always, but a good-sized component from citations alone seems to be well-established. Please see the studies of Professor Oppenheim and others, for example as cited in:

Harnad, S., Carr, L., Brody, T. & Oppenheim, C. (2003) Mandated online RAE CVs Linked to University Eprint Archives: Improving the UK Research Assessment Exercise whilst making it cheaper and easier. Ariadne 35.

LL:Human experts are necessarily selected from a population of experts, and it is often difficult to delineate between fields of expertise.”

Correct. And the RAE rankings are done separately, discipline by discipline; the validation of the metrics should be done that way too.

Perhaps there is sometimes a case for separate rankings even at sub-disciplinary level. I expect the departments will be able to sort that out. (And note that the RAE correlations do not constitute a validation of metrics for evaluating individuals: I am confident that that too will be possible, but it will require many more metrics and much more validation.)

LL: “Similarly, we know from quite some research that citation and publication practices are field-specific and that fields are not so easy to delineate. Results may be very sensitive to choices made, for example, in terms of citation windows.”

As noted, some of the variance in peer judgments will depend on the sample of peers chosen; that is unavoidable. That is also why “light touch” peer re-validation, spot-checks, updates and optimizations on the initialized metric weights are also a good idea, across the years.

As to the need to evaluate sub-disciplines independently: that question exceeds the scope of metrics and metric validation.

LL: “Thus, I am bit doubtful about your claims of an ’empirically demonstrated fact’.”

Within the scope mentioned — the RAE peer rankings, for disciplines such as they have been partitioned for the past two decades — there is ample grounds for confidence in the empirical results to date.

(And please note that this has nothing to do with journal impact factors, journal field classification, or journal rankings. It is about the RAE and the ranking of university departments by peer panels, as correlated with citation counts.)

Stevan Harnad
American Scientist Open Access Forum

Stealing Empire – read, listen and join the subversion

This weekend, from 14-17 June the
Cape Town Book Fair
takes over the Cape Town International
Convention Centre, so this blog is about a new book, Stealing Empire, by Adam Haupt, published by the HSRC Press. Last year  close on 50,000 visitors attended,
giving the lie to the idea that South Africans don't read and are not
attracted to books. As Dave Chislett said today in his new blog – the
Chiz
– on The Times
newspaper's blog site
, the problem is not that people don't read
– witness the high circulation of popular newspapers –  but rather that
publishers do not publish for them, nor bookshops target readers
beyond the safe urban middle class. 

In celebration of the Book Fair, today I am therefore pointing to
a book by a UCT colleague and partner in the PALM
project
, Adam Haupt, that does not target the popular readership
Dave is talking about, but explores some of the issues of global
media dominance that is part of the proplem. Published by the
HSRC Press
, this is a scholarly title, but provides an incisive
and lively account of the ways in which global coroporate media
interests dominate and appropriate 'aspects of youth, race, gender,
cultural expression and technology for their own enrichment – much to
the detriment of all society.' However the real appeal of the book is
not only the study of how this appropriation works, but also of how,
in a country like South Africa countercultures like that of the
hip-hop activists in the Cape Flats of Cape Town in turn use new
media and IP subversion to appropriate their own space. The book
explores the MP3 revolution and Napster and digital sampling in
hip-hop and explores alternatives to proprietary approaches to the
production of culture and knowledge. This is a theorised account of
dominant culture and subversion, drawing largely on Michael Hardt and
Antonio Negri's concept of Empire. This use of theory, said UCT
deputy-Vice-Chancellor at the launch a few weeks ago, is in itself an
act of appropriation and subversion. We in the developing world,
Martin argued, are not supposed to theorise; rather, we are required
to provide the raw materials for the theorists of the North. 

The extra treat is that you can listen to a
podcast
on the book that includes discussion of the book and
material from what was a very lively launch. The book is published by
the HSRC Press, which launched the book at the Book Lounge in Cape
Town, with perfromances from Burni,of the Cape Town feminist hip-hop
group, Godessa and Caco the Noble Savage, a hip-hop activist with a
wonderfully ironic take on the impact of globalisation that is the
subject of the book. Being able to listen to the artists that Adam is
talking about provides an added dimenstion to the reading of the book
-a must-read accompanied by a must-listen. 

Given that this is an HSRC Press book, it is available full text
online for free download. Print copies are available for sale in
South Africa and in many other countries through print-on-demand
distribution arrangements. So enjoy the Book Fair, but read Adam's
book, too to get a critical perspectiveof the forces at play

Adam will be speaking in a panel at the Book Fair on Saturday afternoon – “Holding us
together or pulling us apart?” The role of the South African Media
in the creation and mutation of identities." 

Position Statement on Open Access now on CLA website

The Canadian Library Association / Association canadienne des bibliothèques Position Statement on Open Access for Canadian Libraries, approved by the CLA Executive on May 21, 2008, has just been posted on the CLA website, at: http://www.cla.ca/AM/Template.cfm?Section=Position_Statements&Template=/CM/ContentDisplay.cfm&ContentID=5306

The text of the position statement is:

Whereas connecting users with the information they need is one of the library’s most essential functions, and access to information is one of librarianship’s most cherished values, therefore CLA recommends that Canadian libraries of all types strongly support and encourage open access.

CLA encourages Canadian libraries of all types to:

  • support and encourage policies requiring open access to research supported by Canadian public funding, as defined above. If delay or embargo periods are permitted to accommodate publisher concerns, these should be considered temporary, to provide publishers with an opportunity to adjust, and a review period should be built in, with a view to decreasing or eliminating any delay or embargo period.
  • raise awareness of library patrons and other key stakeholders about open access, both the concept and the many open access resources, through means appropriate to each library, such as education campaigns and promoting open access resources.
  • support the development of open access in all of its varieties, including gold (OA publishing) and green (OA self-archiving). Libraries should consider providing economic and technical support for open access publishing, by supporting open access journals or by participating in the payment of article processing fees for open access. The latter could occur through redirection of funds that would otherwise support journal subscriptions, or through taking a leadership position in coordinating payments by other bodies, such as academic or government departments or funding agencies.
  • support and encourage authors to retain their copyright, for example through the use of the CARL / SPARC Author’s Addendum, or through the use of Creative Commons licensing.

Editor’s Note

The Journal of Electronic Publishing Vol. 11 Issue 2, 2008-05-30.

“For more than a decade, electronic journals—periodicals that are distributed over computer networks—have operated on the periphery of academe, largely spurned by authors, publishers, and readers as no match for the traditional printed journal,” the Chronicle of Higher Education wrote in 1991.

Book Review: Modern Language Association of America. MLA Style Manual and Guide to Scholarly Publishing

The Journal of Electronic Publishing Vol. 11 Issue 2, 2008-05-30.

I purchased my first copy of the MLA Style Manual during my first year of college, and it has had a special place in my heart since then. So sensibly organized, and so easy to skim with its effective use of typography! While I’ve been pressed into relationships with other style guides since then—including an ongoing, troubled relationship with the Chicago Manual of Style—I find myself longing for the old days, when no other style guide clouded my thoughts.

O’Reilly Media’s Tools of Change Conference 2008

The Journal of Electronic Publishing Vol. 11 Issue 2, 2008-05-30.

If it can be argued that the value of a conference can be measured by the length and breadth of the discussion that it generates, the recent Tools of Change for Publishing conference in February 2008 exceeded expectations. The many debates among attendees and panelists, copious blog descriptions and analyses, and plenty of glowing reviews and conference reports have produced their own set of discussions on line, a valuable “social currency,” according to Douglas Rushkoff, who spoke at the conference.