Assessing a company’s environmental, social and governance (ESG) behavior is a qualitative, subjective undertaking. New studies show that the major firms that issue ESG “ratings” use sufficiently different criteria, which results in unreliable research findings when their databases are used.

Given that an estimated $30 trillion in assets are invested based on ESG ratings, providers are influential institutions that inform a wide range of decisions in both business and finance. The trend to ESG investing has also led to a large body of academic research on its impact. These studies often rely on ESG ratings for their empirical analyses. A problem is that while corporate bond credit ratings among different agencies are highly correlated – credit ratings from Moody’s and Standard & Poor’s are correlated at 0.994 – there is not the same consistency of ESG ratings by the various providers. This can lead to inconsistency in research findings.

Florian Berg, Julian Koelbel and Roberto Rigobon, authors of the August 2019 study “Aggregate Confusion: The Divergence of ESG Ratings,” contribute to the literature by investigating the divergence of ESG ratings. Their database is from six prominent rating agencies: KLD Research & Analytics (MSCI Stats), Sustainalytics, Vigeo Eiris (Moody’s), RobecoSAM (S&P Global), ASSET4 (Refinitiv) and MSCI IVA. They began by categorizing all indicators provided by different data providers into a common taxonomy of 64 categories and 641 indicators. They then calculated category scores for each rating by taking simple averages of the indicators that belong to the same category. Next, they estimated the original ratings to obtain comparable aggregation rules. Using the category scores established by the taxonomy, they then estimated weights of each category in a simple non-negative linear regression. They then decomposed the divergence in scores into three sources:

  • Different scope of categories, denoting all the elements that together constitute the overall concept of ESG performance. Attributes such as greenhouse gas emissions, employee turnover, human rights and lobbying, etc., may not be included in the scope of a rating.
  • Different measurement of categories – indicators that represent numerical measures of the attributes. For example, if two raters want to measure discrimination against women, one rater could look at the gender pay-gap, while the other rater would use the percentage of women on the board and/or in the workforce. The two measures may be correlated but likely deliver somewhat different results.
  • Different weights of categories – an aggregation rule that combines the set of indicators representing numerical measures of the attributes into a single rating. Rating agencies take different views on the relative importance of attributes and whether performance in one attribute compensates for another. For example, a rating agency that is more concerned with carbon emissions than electromagnetic fields will assign different weights than a rating agency that cares equally about both issues. Different industries might also have different weights, as some attributes are judged more important to some industries than others.