
Every linear regression in R rests on four assumptions about the data. When the assumptions hold, the slope estimates, the p-values, and the confidence intervals are all trustworthy. When the assumptions break, those numbers become unreliable, even if the regression itself runs without an error message.
R produces four diagnostic plots that test each of these assumptions visually. One line of code generates all four. Most students who lose marks on regression assignments do so by skipping this step or by reading the plots without knowing what each one is testing.
This article covers the four plots one at a time: what each one tests, what a passing plot looks like, what failure looks like, and what to do when one of them fails.
Why regression has assumptions in the first place
A regression line is a kind of compromise. It cannot pass through every point in your data, so it sits in the position that makes the total error as small as possible. The math behind this compromise rests on four conditions: the relationship between predictor and outcome is genuinely linear, the residuals are normally distributed, the residuals have equal spread across the range of fitted values, and no single observation has outsized influence on the fit.
Each condition has its own diagnostic plot, and each plot has a clear visual signature for pass and fail. The plots are not optional checks for advanced students. They are part of the regression report. A regression without diagnostics is incomplete in the same way a chemistry experiment without controls is incomplete.
Setting up a working example
The walkthrough uses the mtcars dataset, which is built into R. A simple regression of mpg (miles per gallon) on wt (car weight in 1,000 lbs) gives a clean example with one predictor and 32 observations.
model <- lm(mpg ~ wt, data = mtcars)
That single line fits the model and stores it in an object called model. Once the model is fitted, all four diagnostic plots come out of one extra command.par(mfrow = c(2, 2))plot(model)
The par() command splits the plot window into a 2 by 2 grid. plot(model) then fills the four cells of that grid with the diagnostic plots. The next four sections take each plot in turn.
For students who want a fuller walkthrough of how the model is fitted and how the summary() output is read line by line, the full process is explained in Linear regression in R. The focus from here onwards is the diagnostic plots and what they tell you about the regression.
Plot 1: Residuals vs Fitted (the linearity check)
What it tests
The Residuals vs Fitted plot tests whether the relationship between your predictor and your outcome is genuinely a straight line. The y-axis shows the residuals, which are the gaps between actual mpg and predicted mpg. The x-axis shows the fitted values, which are the model’s predictions.
If the relationship really is linear, the residuals bounce around zero with no obvious pattern. There is no part of the predictor’s range where the model is consistently over-predicting or under-predicting.
What a passing plot looks like
The dots form a random cloud around the horizontal zero line. The smoothed red line drawn through the dots stays roughly flat across the entire plot. There is no curve, no funnel, no obvious shape.
What failure looks like
A clear curve in the red line is the most common failure pattern. The line dips in the middle and rises at the ends, or rises in the middle and falls at the ends. Either shape signals that a straight line is too simple for the actual relationship.
In the mtcars mpg-on-weight regression, this plot shows a slight curve. The red line is not perfectly flat. That points to a mild non-linearity, which is fixable by adding a squared term to the model.
What to do if it fails
The simplest fix is to add a polynomial term to the regression. For the mtcars example, that looks like this:model_poly <- lm(mpg ~ wt + I(wt^2), data = mtcars)summary(model_poly)
The I(wt^2) term lets the regression bend, which often removes the curve in the residual plot. Other fixes include log-transforming the outcome with log(mpg), or switching to a non-linear model entirely.
Whichever fix you pick, refit the model and run the diagnostic plots again to confirm the curve has gone.
Plot 2: Normal Q-Q (the normality check)
What it tests
The Normal Q-Q plot tests whether the residuals follow a normal distribution. This matters because the p-values and the confidence intervals in the regression output assume the residuals are normally distributed. If they are not, those numbers become misleading, even when the regression itself looks fine.
What a passing plot looks like
The dots fall closely along the diagonal reference line, especially in the middle of the plot. A few points wandering off the line at the very top or very bottom are normal and not a cause for concern.
What failure looks like
A clear S-shape, where the dots curve away from the line at both ends, is the classic sign of heavy tails. That means the residuals have more extreme values at the high and low ends than a normal distribution allows. A reverse S-shape signals light tails, where the residuals cluster too tightly around the centre.
In the mtcars example, the Q-Q plot mostly hugs the line, with one or two points (the Toyota Corolla and the Fiat 128) drifting upwards at the top right. That is borderline, not a clear failure.
What to do if it fails
If the Q-Q plot fails badly, the usual response is to transform the outcome variable. A log transformation is the most common starting point, applied with log(mpg) inside the formula:
model_log <- lm(log(mpg) ~ wt, data = mtcars)
Other transformations include the square root for count-like outcomes, or the Box-Cox method available in the MASS package, which finds the best transformation automatically. After transforming, refit and check the Q-Q plot again. If the plot still fails after transformation, the residuals are genuinely non-normal, and a different family of model (such as glm() with a non-Gaussian family) is the right move.
Plot 3: Scale-Location (the equal variance check)
What it tests
The Scale-Location plot tests whether the residuals have the same spread across the entire range of fitted values. The technical name for this assumption is homoscedasticity. The opposite situation, where the spread changes systematically, is called heteroscedasticity.
Why does it matter?
Because if the model is more accurate for some predictions than for others, the standard errors reported in the summary output are wrong. The slope estimate itself stays valid, but the p-values and confidence intervals around it become unreliable.
What a passing plot looks like
The smoothed red line is roughly horizontal, and the spread of dots above and below it is even across the entire x-axis. The y-axis shows the square root of the standardised residuals, so a flat line at any height is fine. What matters is that the line does not slope.
What failure looks like
A clear slope in the red line, usually rising from left to right, is the textbook signature of heteroscedasticity. The dots themselves often form a funnel shape: tightly clustered on one side of the plot, fanned out on the other.
This pattern is common in datasets where the outcome is a count, a percentage, or a strictly positive quantity (like income, sales, or population). Bigger predicted values come with bigger errors, and the funnel opens to the right.
What to do if it fails
Two responses work. The first is the same outcome transformation used for non-normal residuals: log or square root, applied to the outcome variable. A log transformation often flattens both the Q-Q plot and the Scale-Location plot at the same time, which is why it is the most common first move.
The second response keeps the original model but uses robust standard errors. These adjust the standard errors to account for the heteroscedasticity, leaving the coefficients alone. The sandwich and lmtest packages handle this in two extra lines:
library(sandwich)library(lmtest)coeftest(model, vcov = vcovHC(model, type = "HC1"))
This is the standard fix in econometrics coursework, and it is the approach Wooldridge teaches in his textbook. The output looks like a regular summary() coefficient table, but the standard errors and p-values are now corrected.
Plot 4: Residuals vs Leverage (the influential points check)
What it tests
The Residuals vs Leverage plot identifies individual observations that are pulling the regression line in their direction. Some points have a large residual (the model predicts them poorly). Some points have high leverage (their predictor value is far from the average). The combination of the two is what causes a single observation to dominate the entire regression.
What a passing plot looks like
All the dots sit comfortably inside the dashed Cook’s distance contours. There are no points labelled with row names that sit far from the rest of the cloud. The red line stays roughly flat near zero.
What failure looks like
One or more points sit outside the dashed Cook’s distance line, usually in the top-right or bottom-right corner of the plot. R automatically labels these points with their row name from the dataset, which makes them easy to identify.
In the mtcars example, the Chrysler Imperial often appears as a high-leverage point. It is one of the heaviest cars in the dataset, and it has a residual large enough to be worth investigating before the regression is reported.
What to do if it fails
The first move is to look at the flagged observation and check whether it is genuine. Sometimes high-leverage points come from data entry errors, like a decimal in the wrong place or a missing zero, which is a quick fix. Sometimes they come from genuinely unusual cases that belong to a different population than the rest of the data.
If the point is genuine, two options remain. Refit the regression without it and report both versions to show how much it influences the result. Or keep it and add a footnote acknowledging its influence in the assignment write-up. Removing data points without explanation is not acceptable in academic work, but reporting both versions usually is.
To refit without the Chrysler Imperial:
model_no_chrysler <- lm(mpg ~ wt, data = mtcars[-which(rownames(mtcars) == "Chrysler Imperial"), ])summary(model_no_chrysler)
Compare the slope and the R-squared between the two models. A small change means the original model is robust. A large change means the Chrysler Imperial was driving the regression more than the other 31 cars combined, and the original result needs a footnote.
How to write up the diagnostics in your assignment
A common reason students still lose marks on regression assignments, even after running the diagnostic plots, is failing to write about them in plain language. The plots themselves are not enough. The marker wants to see that you understood what each plot showed.
Here is the kind of paragraph that works for a passing case:
“The four diagnostic plots were inspected to test the regression assumptions. The Residuals vs Fitted plot showed a roughly flat smoothed line, supporting the linearity assumption. The Normal Q-Q plot showed residuals closely following the diagonal, supporting the normality assumption. The Scale-Location plot showed even spread, supporting homoscedasticity.
The Residuals vs Leverage plot showed all observations within the Cook’s distance contours, indicating no single point unduly influenced the regression.”
And here is what to write when one assumption fails:
“The Scale-Location plot showed a clear upward slope in the smoothed line, suggesting heteroscedasticity. To address this, robust standard errors were calculated using the sandwich package with the HC1 estimator. The corrected p-value for weight remained below 0.001, supporting the conclusion that the relationship is statistically significant despite the violation of the equal-variance assumption.”
This second paragraph is what separates a competent regression report from an excellent one. It shows that you understood the violation, knew the appropriate fix, applied it correctly, and confirmed the conclusion still holds.
A worked example of a bad regression
To see what failure looks like in practice, fit a regression that is deliberately mis-specified. The example below predicts mpg from displacement (engine size in cubic inches) without accounting for the obvious non-linearity in that relationship.
bad_model <- lm(mpg ~ disp, data = mtcars)par(mfrow = c(2, 2))plot(bad_model)
The Residuals vs Fitted plot now shows a strong U-shape in the red line. The model under-predicts mpg for cars with very small or very large engines, and over-predicts in the middle. That is the diagnostic plot pointing to a curved relationship, not a straight one.
The fix is the polynomial term:
good_model <- lm(mpg ~ disp + I(disp^2), data = mtcars)par(mfrow = c(2, 2))plot(good_model)
Run the new diagnostic plots and the U-shape in the Residuals vs Fitted plot is largely gone. The red line is much flatter. The model now reflects the curved relationship between engine size and fuel economy that the simple straight line was missing.
This kind of before-and-after comparison is one of the most powerful things to include in a regression assignment. It shows the marker that you ran the diagnostics, identified the problem, applied a fix, and verified the fix worked. That is the full statistical workflow, written up in three paragraphs and four plots.
When the diagnostics keep failing
Sometimes a regression is the wrong tool for the data. If the outcome is binary (yes or no, pass or fail), linear regression is the wrong model and no transformation of the outcome fixes the diagnostic plots. Logistic regression is the appropriate choice, fitted with glm(y ~ x, family = binomial).
If the outcome is a count of events (number of accidents per month, number of customers per day), Poisson regression or negative binomial regression is the right family. If the data has a hierarchical structure (students within classrooms, patients within hospitals), mixed-effects models from the lme4 package are the next step. The diagnostic plots are useful here too, as they tell you when the model you have chosen is not capturing the structure in your data.
If the diagnostic plots have failed and the appropriate fix is not obvious, getting a second opinion before submission is faster than working through the textbook chapter by chapter.
R programming homework help from a verified expert covers exactly this situation: identifying which assumption has failed, picking the right fix, and writing up the result in language that earns marks.