How do explanation methods for machine learning models work?

Researchers have developed a method to verify that popular ways of understanding machine-learning models work correctly.

Imagine a team comprising physicians who use a neural net to detect the presence of cancer in mammogram images. Even though the machine-learning model appears to be working well, it could focus more on features in embodiments that are mistakenly correlated with cancers, such as a watermark, timestamp, or other similar features.

Researchers use feature-attribution methods to test these models. These techniques are meant to identify the parts of an image most relevant for the neural network’s prediction. What if the model’s essential features are not included in the attribution method? Researchers need to know the vital parts, so they can’t tell if their evaluation is practical.

MIT researchers devised a method to modify the data to determine which features are essential to the model. They then use the modified dataset to examine whether feature-attribution plans accurately identify these essential features.

The researchers found that even the most popular methods can miss essential features of an image and that some ways perform worse than random baselines. This could have severe implications for neural networks in high-stakes cases such as medical diagnoses. Yilun Zhou is an electrical engineer and computer science graduate student at the Computer Science and Artificial Intelligence Laboratory.

These methods are widely used, particularly in high-stakes situations like detecting cancer using X-rays and CT scans. These feature-attribution methods may need to be corrected. These feature-attribution methods may point out something that does not correspond to the real feature used by the model to make a prediction. We found this to be often the case. He says that feature-attribution methods can be used to prove that a model works correctly.

Zhou co-authored the paper with Serena Booth, a fellow EECS graduate student, Marco Tulio Ribeiro (Microsoft Research researcher), and Julie Shah, a senior author and MIT professor of astronautics.

Focus on the features

The neural network can use each pixel of an image to make predictions. There are millions of features that it could focus on in image classification. Researchers could create an algorithm that can help amateur photographers improve their skills. For example, they could train a model that can distinguish professional photos from casual tourists’ photos. This model could be used for comparing amateur photos to professional pictures and providing feedback. Researchers could use this model to identify artistic elements in professional images, such as composition, color space, and post-processing. It just so happens that professionally shot photos often include a watermark of the photographer’s name, but few tourist photos do. The model could take this shortcut and find the watermark.

We don’t want to tell photographers that a watermark will suffice for success. Therefore, we want to ensure that the model is focused on artistic features and not the watermark presence. Zhou states that while using feature-attribution methods to analyze your model is tempting, there is no guarantee they will work properly. The model could have artistic features, a watermark, or other features.

“We don’t know what spurious correlations are in the dataset,” Booth says that there could be many things that are entirely invisible to the naked eye, such as the resolution of an object. A neural network can identify these features and use them to classify them, even if they are not perceptible. This is the root problem. It is not only difficult to understand our data but also challenging to comprehend our datasets.

Researchers modified the dataset to reduce all correlations between the original image and data labels and ensure that no original features were lost.

They then add a new feature so evident that the neural network must focus on it to make their prediction—for example, bright rectangles with different colors for different image types.

We can confidently say that any model that achieves high confidence levels must focus on the colored rectangle we have put in. Zhou states that this will allow us to see if feature-attribution methods rush to emphasize that area over all else.

Results that are “especially alarming.”

This technique was used to apply a variety of feature-attribution methods. These methods create a saliency map showing how essential features are distributed across an image. The saliency map may show that 80 percent of the most critical components are located around birds’ beaks, for example, when the neural network classifies images of birds.

After removing all correlations from the image data, the researchers modified the photos in various ways, such as blurring certain parts of the picture, increasing the brightness, or adding a watermark. If the feature-attribution process is correct, nearly 100 percent of the most important features should be found around the area that the researchers have manipulated.

These results were not encouraging. None of the feature-attribution techniques reached the 100 percent goal. Most only managed to get a random baseline of 50 percent. Some even did worse than the baseline. The feature-attribution methods can sometimes miss the prediction even though the new feature was the only one the model could use.

“None of these methods seem to prove reliable across all types of spurious correlations. This is particularly alarming, as we don’t know which spurious correlations may apply to natural datasets,” Zhou states. It could be many factors. Although we thought these methods could be trusted to tell us the truth, our experiment shows that it is tough to trust them.

They found that all feature-attribution methods were more effective at detecting anomalies than identifying absences. These methods could see a watermark faster than recognizing that an image doesn’t contain one. In this instance, humans would have a more challenging time trusting a model that makes a pessimistic prediction.

This team has shown that feature-attribution methods should be tested before being applied to real-world models, particularly in high-stakes scenarios.

Shah explains that researchers and practitioners might use explanation techniques such as feature-attribution methods to instill trust in a model. However, this trust cannot be established unless the explanation technique has been thoroughly evaluated.” While an explanation technique can be used to calibrate someone’s faith in a modeling model, it is equally essential that a person’s trust is calibrated in the explanations of that model.

The researchers plan to continue using their evaluation process to examine more subtle and realistic features that could cause spurious correlations. They also plan to study how humans can better understand saliency maps to make better decisions based on predictions from a neural network.