
Linear regression in R needs one function: lm(). You hand it a formula and a dataset, and it draws a straight line through your data. That is most of the job done. The rest is reading the output and checking a few plots to make sure the line is a fair summary of the data.
Open RStudio. The dataset for this walkthrough is mtcars, which comes built into R, so there is nothing to load.
What linear regression actually does
Linear regression draws the best straight line through a scatter plot.
Picture 30 students plotted on a graph, with study hours on the x-axis and exam scores on the y-axis. The dots form a rough cloud. Linear regression looks at that cloud and finds the one straight line that gets closest to the points overall.
Once it has the line, R reports two things about it: the slope and the intercept. The slope answers a useful question, which is how much y changes when x goes up by 1. If the slope is 4.5, every extra study hour adds 4.5 points to the exam score, on average. The intercept is the predicted y when x is zero, so the exam score for a student who studied for zero hours.
Algebraically, the model is y = a + bx + error, where a is the intercept and b is the slope. The error term is the gap between what the line predicts and what actually happened. R takes care of the math, so the formula is for context.
The dataset: mtcars
The mtcars dataset has 32 rows, one row per car, taken from a 1974 Motor Trend article. Type head(mtcars) to see the first six rows.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 1 3 2
Valiant 18.1 6 225 105 2.76 2.460 20.10 1 0 3 1
The y variable for the rest of this walkthrough is mpg (miles per gallon). The x variable is wt (car weight in 1,000 lbs). Heavier cars use more fuel, so the regression slope ought to be negative. We expect mpg to drop as weight goes up
Fit the model with lm()
Fitting the model is one line of code. The syntax is lm(y ~ x, data = your_data). The tilde reads as “explained by” out loud, so the formula below says “mpg explained by wt”.
model <- lm(mpg ~ wt, data = mtcars)
That line does the whole fit. The result is saved into an object called model, which now holds the intercept, the slope, and a long list of other values that the next functions use. To see the basics, type model on its own line and press enter.
modelCall:lm(formula = mpg ~ wt, data = mtcars)
Coefficients:(Intercept) wt37.285 -5.344
The intercept is 37.285 and the slope on wt is -5.344. Together they describe the regression line. A car of zero weight has a predicted mpg of 37.285, which is mathematically valid even though no real car weighs nothing. For every extra 1,000 lbs of weight, the prediction drops by 5.344 mpg.
Plug in a real weight to see how the line works. A 3,000-lb car has wt = 3, so the predicted mpg is 37.285 + (-5.344 × 3), which is 21.25.
The negative slope matches the intuition. The next question is whether -5.344 is a real effect or just noise from this particular sample of 32 cars. That is what summary() tells us.
Read the summary() output
Run summary(model) to see the full regression report.
summary(model)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16
wt -5.3445 0.5591 -9.559 1.29e-10
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The output looks dense, but for a beginner write-up, six numbers carry almost all of the meaning. Here is what to read and how to interpret it.
The Estimate column has the intercept and slope, same numbers as before. The slope of -5.3445 is the most important number in the output, because it is the actual relationship between weight and fuel economy.
The Pr(>|t|) column is the p-value. For wt it is 1.29e-10, which is 0.000000000129 written in plain decimals. Statisticians treat anything below 0.05 as significant, and a value this small is far below that threshold. The three asterisks at the end of the row are R’s shorthand for very significant.
Multiple R-squared is 0.7528. The interpretation is that weight alone explains 75.3 percent of the variation in mpg across these 32 cars. Anything above 0.7 is generally considered strong, although what counts as a good R-squared varies a lot by field.
Adjusted R-squared (0.7446) does the same job, but it adds a small penalty for each extra predictor in the model. Use this version when comparing two models that have different numbers of predictors.
Residual standard error is 3.046. Plain English: on average, the model’s predictions are off by about 3.05 mpg. Residuals are the gaps between predicted mpg and actual mpg, and 3.05 measures how wide those gaps are.
The F-statistic at the bottom asks whether the model as a whole performs better than no model at all. A p-value of 1.294e-10 answers that with a clear yes.
So the regression is telling us that weight has a strong, statistically significant effect on mpg, and that weight on its own explains roughly three-quarters of why one car gets different mileage from another in this dataset.
One thing the summary leaves out is whether the model is actually valid. That answer comes from the diagnostic plots, and missing them out is the most common reason students lose marks on regression assignments. Checking regression assumptions in R goes through the full diagnostic process step by step.
Check the assumptions with diagnostic plots
All four diagnostic plots come out of one extra line of code on top of the existing model.par(mfrow = c(2, 2))plot(model)
The par() command splits the plot window into a 2 by 2 grid. plot(model) then fills the four cells of that grid with the diagnostic plots.
The first plot, Residuals vs Fitted, checks whether the relationship between x and y is genuinely a straight line. A roughly flat red line and dots scattered randomly around zero is the pass case. A clear curve in the line is a warning that the relationship bends in some way that the straight regression line is missing.
The second plot, Normal Q-Q, checks whether the residuals follow a normal distribution. Dots that hug the diagonal line is what you want. Heavy tails or a noticeable bend at the ends is a sign that the residuals are not normal, and that puts the p-values on shaky ground.
The third plot, Scale-Location, checks the assumption of equal residual spread across the range of fitted values, which goes by the name homoscedasticity. A flat line and even point spread is the pass case. A funnel shape, with dots more spread on one side than the other, signals heteroscedasticity, which is the situation where the model is more accurate for some predictions than others.
The fourth plot, Residuals vs Leverage, picks out individual observations that have an outsized pull on the regression. Any dot past the dashed Cook’s distance line is bending the fit toward itself. In mtcars the Chrysler Imperial often shows up as a leverage point, and it is worth a look before reporting results.
A clean regression gives you four scatter clouds with no obvious pattern in any of them. If a clear pattern shows up, that is the regression telling you something is off, and the type of pattern usually points to what.
Adding a second predictor
To add more predictors, put a + in the formula.model2 <- lm(mpg ~ wt + hp, data = mtcars)summary(model2)
The output now reports a coefficient for wt and another for hp. The wt coefficient changes a bit, because the model is now controlling for horsepower at the same time. R-squared rises from 0.753 to 0.827, so horsepower explains an extra 7 percent of the variation in mpg, on top of weight.
Adding more predictors is not always a good idea. Each one uses up a degree of freedom and risks introducing multicollinearity, which is the situation where two predictors overlap so much that the coefficients become unstable. The right metric to compare on in this case is Adjusted R-squared. If it stops going up when you add a new predictor, the predictor is not pulling its weight.
Predict new values with predict()
Once the model is fitted, predict() gives you the predicted mpg for any car weight you feed in.
new_cars <- data.frame(wt = c(2.5, 3.0, 4.0))predict(model, newdata = new_cars)1 2 323.92395 21.25171 15.90724
So a 2,500-lb car has a predicted mpg of 23.9, a 3,000-lb car drops to 21.3, and a 4,000-lb car falls to 15.9. The numbers come straight out of the regression equation 37.285 + (-5.344 × wt).
To get 95 percent confidence intervals around each prediction, add interval = “confidence” to the call.
predict(model, newdata = new_cars, interval = "confidence")fit lwr upr1 23.92395 22.55284 25.295062 21.25171 20.13104 22.372393 15.90724 14.59986 17.21462
The fit column is the prediction. The lwr and upr columns are the lower and upper bounds of the interval. The intervals get tighter near the average car weight in the dataset and wider towards the extremes, which is why predictions near the edges of the data carry more uncertainty.
Where students lose marks
Across hundreds of graded R regression assignments, four mistakes show up far more often than any others.
The first is skipping the diagnostic plots. Reporting a regression without checking residuals is a bit like submitting code without running it. Most marking rubrics deduct points specifically for missing diagnostics, even when the regression itself is fine.
The second is reporting R-squared without context. An R-squared of 0.4 looks poor in a textbook physics example, fairly decent in behavioural data, and sometimes excellent in social science research. The number on its own does not say much, so it is worth a sentence of comparison.
The third is misinterpreting the coefficients. A wt coefficient of -5.34 means “every extra 1,000 lbs of weight reduces predicted mpg by 5.34”, not “weight reduces mpg by 534 percent”. When writing up results, always restate the coefficient in real units, with the variable spelled out.
The fourth is mixing up statistical significance with practical importance. A predictor with a p-value of 0.001 and a coefficient of 0.0002 is statistically significant and practically meaningless. Read the coefficient and the p-value alongside each other before drawing any conclusions.
If the output is not making sense and the deadline is closing in, R programming homework help from a verified expert takes apart the model choice, the output, and the diagnostic plots with you before you submit.
Next steps
The same lm() syntax handles a lot more than this walkthrough covered. Interaction terms (written as wt * hp), categorical predictors handled through factors, and polynomial terms through poly() all slot into the same formula structure.
Once the outcome variable becomes binary instead of numeric, for example pass or fail, the function changes from lm() to glm(), and Logistic regression in R picks up the story from there.
If the assignment specifies robust standard errors, clustered standard errors, or weighted least squares, the syntax expands further with extra packages, but the underlying workflow stays the same: fit the model, run summary on it, then go through the diagnostics