What is correspondence analysis?
Correspondence analysis, also called reciprocal averaging, is a useful data science visualization technique for finding out and displaying the relationship between categories. It uses a graph that plots data, visually showing the outcome of two or more data points.
It is a multivariate statistical tool that was first proposed in 1935 by Herman Otto Hartley. Hartley wrote a paper on contingency tables that paved the way for Jean-Paul Benzécri to develop the analysis technique in the 1960s that we know today. Since its development, it has grown in popularity and the ways that it is applied.
A correspondence analysis uses a contingency table—a table of frequencies—that shows how variables distribute categories. The data in the table undergoes a series of transformations in relation to the data around it to produce relational data. The resulting data is then graphed to show those relationships visually.
How does multiple correspondence analysis work?
Not everything in life runs on a perfect scale from zero to ten, nor does a simple scale cover all the attributes and categories needed. This is where correspondence analysis shines. Essentially, it takes a table of data and turns it into valuable comparisons that allow inferences to be drawn. For instance, sales data for a year is broken down into departments.
What this table does is compute expected values, which is row average, multiplied by the column average, and then divided by the overall value. This figure is then subtracted from the original figure in that square. These “residual” numbers show the association, or lack thereof, between the row and the column labels. So this is not showing how much money a department made in a certain month; it’s showing the association between that month and that department’s figures.
The figures on the graph clearly show a relationship between the figures; the distance between the two points shows the strength of that relationship. Do more people buy homewares in December? Is there any relationship between month and clothing sales? For instance, if a store has a big clothing sale in July, it could be expected the physical distance between clothing and July sales would be closer than other months. The horizontal and vertical dimensions explain the percentage of variance in the data.
But that is overly simplistic because correspondence analysis shows the relativities. It’s not showing which month has the highest sales; it shows clothing sales spiked only 29 percent in July, while homewares spiked 82 percent in December. The graph shows relativities.
If the organization is only interested in how sales have changed over time or which department sells the most, then raw data and simple tables will be a better way of showing the data.
Uses of correspondence analysis
For a business, correspondence analysis is important to be able to easily understand a variety of relationships. For example, brand mapping is a form of correspondence analysis. Brand maps are used to place business attributes and products on a graph. If products are placed closely together on the map, it shows a similarity between the image or profile, which can help inform strategy.
For marketing, a correspondence analysis can answer questions such as:
- Are there gaps in the market that could be filled by this business?
- Is the brand positioning correct?
- Could the business differentiate itself from the competition?
- What attributes do the competitors own, or, alternatively, do this business own?
For example, think of a very simple correspondence analysis. The X variable that runs across the horizontal line is value for money, with affordable at one end and high-end at the other. The Y variable, running vertically, is healthiness, running from very healthy to very unhealthy.
Fast food companies are plotted on the graph using a variety of data points. Being more affordable and more unhealthy, McDonald’s would be placed in one quadrant, while a make-your-own salad bar might be on the expensive but healthy quadrant. Placing all the major fast food companies on the graph shows very clearly where there’s a lot of competition or where there is literally a gap in the market.
Correspondence analysis is valuable in terms of brand perceptions for a few reasons. It does away with the interference from brand size; there’s no misleading effect from being an overly large company. It also gives a quick and intuitive overview of brand attribute relationships that aren’t presented by other graphical techniques.
Alternatives to correspondence analysis
The point of correspondence analysis is to compare categories. There are a few other statistical methods that go some of the way to performing the same or a similar task, including chi-square tests, principal components analysis, and factor analysis, which will be explored in greater detail below.
Chi-square tests show, in graphical form, the relationship between categories. They show you a “goodness of fit” statistic, measuring how well the observed data fits with expected distributions. However, they need to have one test square per relationship, and so once you have a group of variables to compare, they become cumbersome.
Chi-square tests also examine if rows and columns have a statistically significant association. While correspondence analysis is related to the chi-square, it is not an inferential method for testing theories and hypotheses.
Principal components analysis (PCA) and factor analysis (FA)
These data reduction techniques are regularly used to capture the difference between a set of variables. But they are specifically used with continuous variables. Factor analysis has a proposed extension for ordinal and binary variables, but this assumes variables are continuous, with normal bivariate distribution. Principal components analysis uses a linear combination of variables and factor analysis for latent variables.
Benefits of correspondence analysis
The benefits of correspondence analysis are:
It shows the relationships between categories
The way the information is presented visually means that anyone can easily understand the strength of the relationships between categories with a bit of training or explanation.
It Is objective and makes no assumptions
Because the actual results are not used, but a calculation of the figures in relationship with the other results instead, correspondence analysis is very objective. There are no underlying distributional assumptions, and so it accommodates all category variables.
There are multiple variables
The obvious strength of correspondence analysis is that it easily and simply handles multiple variables. This is something that no other statistical method does with such ease.
It makes things simpler
Unlike many other data science tools, correspondence analysis takes a huge, unwieldy table with multiple variables and categories and, in the end, provides a simple visualization.
Limitations and challenges of correspondence analysis
It is easily misunderstood
Because correspondence analysis shows relative relationships, people reading the graph often misconstrue the results. The idea that there is no strength of correlation because of the physical distance between the points is faulty thinking borne from not understanding the graph.
Solution: For most purposes, a simple sales table or bar graph would be far more easily read and understood than a correspondence analysis.
The data must be consistent
Correspondence analysis is only useful when there are at least two rows and two columns to the data. There should be no missing data, no negative data, and all the data must have an identical scale.
Many tables, for instance, have a column or row devoted to totals, the sum of all that row or column. However, that means the table cannot be turned into a correspondence analysis chart because the totals are on a different scale to the remainder of the table.
Some tables include percentages as well as counts. This will make the data useless, so the percentages need to be removed.
Solution: Most artificial intelligence software will automatically remove totals, percentages, or non-count data lines. It can transform data to be on the same consistent scale and also to remove negatives. However, without these adjustments, the analysis is useless.
Correspondence analysis is too influenced by outliers
When the data is being averaged out in the multivariate table, if there is outlying data, it skews the entire outcome. The influence of outlying data is huge and can cause the entire analysis to be misrepresented.
Solution: Short of removing all outliers, there is no clear solution to this. Besides ensuring the figures are correct, there are no other methods to produce more accurate graphical relationships. However, these outliers are tempered by the averaging out of the data points, with some scientists saying that outliers are the strength of correspondence analysis, not weakness.
Scaling of coordinates on the maps
When the correspondence analysis graph is drawn up, the row and column coordinates are created. However, these can be drawn in such a way that the resulting relationships are not displayed accurately. This may result in an odd-looking map, with clusters of data bunched closely together and other data points placed a long way apart.
Solution: If there is a large variance across coordinates, there is no way to minimize the distance between the points without changing the scaling on the map.
Lack of statistical significance
Unlike chi-squares, which clearly show statistical significance, correspondence analysis only shows a relationship. There’s no mention or way to measure if these relationships have any significance or if the relationship strength is due to anything other than chance.
Correspondence analysis is still the generally accepted method
Despite the range of challenges inherent in correspondence analysis, it is still the generally accepted way of visually displaying the relationship and association between two or more categorical variables.
While primarily used in scientific endeavors, there is a place for correspondence analysis in business. Correspondence analysis can be a valuable tool, as long as the people viewing the map understand that it is not plotting the raw data points, but the relationship between variables. Once it’s understood how these maps are created and what the analysis is about, correspondence analysis is a powerful tool that ignores brand-sizing effects and delivers powerful, easily interpretable insights into relationships within a brand and between brands.