Beautiful Tables for Models
Introduction to gtsummary
Next, we’ll use the gtsummary package, an extension of gt that can incorporate statistical model outputs into publication-ready tables.
As with all R packages, we first need to install the package, then call it for use during this session.
library(gtsummary)Descriptive statistics
The base R approach to getting summary statistics for a dataframe results in a thorough, but not well-formatted for publication, table.
summary(data) year month day julian_date sampling_date
Min. :2013 Min. : 1.00 Min. : 1.00 Min. :13177 Min. : 1.0
1st Qu.:2013 1st Qu.: 4.00 1st Qu.: 8.00 1st Qu.:13262 1st Qu.:13.0
Median :2014 Median : 6.00 Median :15.00 Median :13680 Median :25.5
Mean :2014 Mean : 6.36 Mean :14.71 Mean :13693 Mean :25.5
3rd Qu.:2014 3rd Qu.: 9.00 3rd Qu.:22.00 3rd Qu.:14093 3rd Qu.:38.0
Max. :2014 Max. :12.00 Max. :31.00 Max. :14956 Max. :50.0
lake h2o_temp_c air_temp_c wind_speed_kph
Length:150 Min. :13.90 Min. :12.50 Min. : 0.000
Class :character 1st Qu.:17.95 1st Qu.:19.62 1st Qu.: 3.299
Mode :character Median :21.50 Median :23.02 Median : 6.437
Mean :21.63 Mean :23.19 Mean : 7.170
3rd Qu.:25.25 3rd Qu.:26.69 3rd Qu.: 8.851
Max. :32.78 Max. :40.67 Max. :38.624
NA's :3
ph total_alkalinity_ppm chlorophyll_a_ugL salinity_ppt
Min. :7.980 Min. : 59.0 Min. : 0.740 Min. :0.1625
1st Qu.:8.290 1st Qu.: 87.0 1st Qu.: 1.317 1st Qu.:0.3853
Median :8.480 Median : 96.0 Median : 1.665 Median :0.4200
Mean :8.506 Mean :103.1 Mean : 2.406 Mean :0.4003
3rd Qu.:8.688 3rd Qu.:107.0 3rd Qu.: 2.717 3rd Qu.:0.4300
Max. :9.340 Max. :271.0 Max. :10.300 Max. :0.4900
NA's :3
pco2_water_ppm bacteria_number_per_L total_nitrogen_mgL
Min. : 21.9 Min. : 2583 Min. : 0.156
1st Qu.: 108.9 1st Qu.: 18063 1st Qu.: 0.600
Median : 192.4 Median : 51006 Median : 0.940
Mean : 241.9 Mean : 70075 Mean : 1.333
3rd Qu.: 332.8 3rd Qu.: 89100 3rd Qu.: 1.625
Max. :1017.7 Max. :532016 Max. :15.300
NA's :6
total_phosphorus._mgL dissolved_organic_carbon_mgL
Min. :-0.00640 Min. :2.568
1st Qu.: 0.01158 1st Qu.:3.158
Median : 0.01880 Median :3.537
Mean : 0.02115 Mean :3.577
3rd Qu.: 0.02620 3rd Qu.:3.899
Max. : 0.19800 Max. :5.213
particulate_organic_nitrogen_mgL particulate_organic_carbon_mgL
Min. :0.00440 Min. :0.05262
1st Qu.:0.01880 1st Qu.:0.16528
Median :0.02720 Median :0.23582
Mean :0.03621 Mean :0.28978
3rd Qu.:0.04720 3rd Qu.:0.33362
Max. :0.17400 Max. :1.55655
NA's :3 NA's :3
co2_flux_mmol_m_day pco2_atmosphere_ppm zooplankton_community_avg_legnth_mm
Min. :-112.4010 Min. :392.8 Min. :0.2232
1st Qu.: -3.4505 1st Qu.:394.7 1st Qu.:0.4736
Median : -1.2790 Median :397.5 Median :0.6011
Mean : -3.2167 Mean :397.4 Mean :0.5845
3rd Qu.: -0.1205 3rd Qu.:399.9 3rd Qu.:0.7227
Max. : 9.2800 Max. :401.9 Max. :1.0829
NA's :19 NA's :4
zooplankton_community_biomass_mgL
Min. : 0.1157
1st Qu.: 2.7016
Median : 4.4441
Mean : 6.7327
3rd Qu.: 8.1071
Max. :35.5967
NA's :4
The equivalent summary statistics function from gtsummary is tbl_summary().
tbl_summary(data)| Characteristic | N = 1501 |
|---|---|
| year | |
| 2013 | 75 (50%) |
| 2014 | 75 (50%) |
| month | 6 (4, 9) |
| day | 15 (8, 22) |
| julian_date | 13,681 (13,261, 14,093) |
| sampling_date | 26 (13, 38) |
| lake | |
| Miramar | 50 (33%) |
| Murray | 50 (33%) |
| Poway | 50 (33%) |
| h2o_temp_c | 21.5 (17.9, 25.3) |
| Unknown | 3 |
| air_temp_c | 23.0 (19.6, 26.7) |
| wind_speed_kph | 6.4 (3.2, 8.9) |
| ph | 8.48 (8.29, 8.69) |
| total_alkalinity_ppm | 96 (87, 107) |
| chlorophyll_a_ugL | 1.67 (1.31, 2.72) |
| salinity_ppt | 0.42 (0.38, 0.43) |
| Unknown | 3 |
| pco2_water_ppm | 192 (109, 333) |
| bacteria_number_per_L | 51,007 (18,048, 89,100) |
| Unknown | 6 |
| total_nitrogen_mgL | 0.94 (0.60, 1.63) |
| total_phosphorus._mgL | 0.019 (0.011, 0.026) |
| dissolved_organic_carbon_mgL | 3.54 (3.16, 3.91) |
| particulate_organic_nitrogen_mgL | 0.027 (0.019, 0.047) |
| Unknown | 3 |
| particulate_organic_carbon_mgL | 0.24 (0.17, 0.33) |
| Unknown | 3 |
| co2_flux_mmol_m_day | -1.3 (-3.5, -0.1) |
| pco2_atmosphere_ppm | 397.50 (394.66, 400.78) |
| Unknown | 19 |
| zooplankton_community_avg_legnth_mm | 0.60 (0.47, 0.72) |
| Unknown | 4 |
| zooplankton_community_biomass_mgL | 4 (3, 8) |
| Unknown | 4 |
| 1 n (%); Median (Q1, Q3) | |
The default summary statistic presented depends on variable type. Categorical variables get counts and percentages while continuous variables get median(first quartile, third quartile). Any NA values are listed in a separate row as “unknown”.
These defaults are customizable. For instance, if we wanted only a subset of variables and for continuous variables to include minimum, median, and maximum:
data |>
tbl_summary(
include = c(lake, h2o_temp_c, chlorophyll_a_ugL),
statistic = list(
#change just way continuous vars are presented
all_continuous() ~ "{median} ({min}, {max})"
),
)| Characteristic | N = 1501 |
|---|---|
| lake | |
| Miramar | 50 (33%) |
| Murray | 50 (33%) |
| Poway | 50 (33%) |
| h2o_temp_c | 21.5 (13.9, 32.8) |
| Unknown | 3 |
| chlorophyll_a_ugL | 1.67 (0.74, 10.30) |
| 1 n (%); Median (Min, Max) | |
We can also produce crosstab tables, for instance with descriptive statistics for selected variables for each lake.
data |>
tbl_summary(
by = lake,
include = c(h2o_temp_c, chlorophyll_a_ugL),
statistic = list(
#change just way continuous vars are presented
all_continuous() ~ "{median} ({min}, {max})"
)
)| Characteristic | Miramar N = 501 |
Murray N = 501 |
Poway N = 501 |
|---|---|---|---|
| h2o_temp_c | 21.4 (14.2, 32.8) | 21.7 (14.7, 30.8) | 21.4 (13.9, 29.9) |
| Unknown | 1 | 2 | 0 |
| chlorophyll_a_ugL | 1.47 (0.74, 4.63) | 3.85 (0.96, 10.30) | 1.48 (0.92, 2.77) |
| 1 Median (Min, Max) | |||
And customize the table with better labels and updated NA indicator:
data |>
tbl_summary(
by = lake,
include = c(h2o_temp_c, chlorophyll_a_ugL),
statistic = list(
#change just way continuous vars are presented
all_continuous() ~ "{median} ({min}, {max})"
),
label = list(h2o_temp_c = "Water Temperature (C)", chlorophyll_a_ugL = "Chlorophyll a (ugL)"),
missing_text = "Missing values"
)| Characteristic | Miramar N = 501 |
Murray N = 501 |
Poway N = 501 |
|---|---|---|---|
| Water Temperature (C) | 21.4 (14.2, 32.8) | 21.7 (14.7, 30.8) | 21.4 (13.9, 29.9) |
| Missing values | 1 | 2 | 0 |
| Chlorophyll a (ugL) | 1.47 (0.74, 4.63) | 3.85 (0.96, 10.30) | 1.48 (0.92, 2.77) |
| 1 Median (Min, Max) | |||
Statistical tests
If we wanted to run statistical analyses on this data, a usual approach would be to use a separate R package to run the test and then extract values of interest, like coefficients and p-values, to put into a table. gtsummary streamlines this process with an all-in-one approach to formatting statistical test outputs into a table.
There are many standard statistical tests integrated into gtsummary including t-test, ANOVA, chi-square, regression, survey sample methods, and more.
Let’s look at a couple examples in practice.
Table with significance
Run a one-way ANOVA to compare whether chlorophyll is significantly different between the three lakes in 2013:
oneway.test(chlorophyll_a_ugL ~ as.factor(lake), data = data)
One-way analysis of means (not assuming equal variances)
data: chlorophyll_a_ugL and as.factor(lake)
F = 30.207, num df = 2.000, denom df = 79.289, p-value = 1.769e-10
If we wanted to report these results, we’d have some work to do to paste this into a table format. Or, we can do the same analysis using the tbl_summary function from gtsummary to do this for us!
The tbl_summary() function calculates descriptive statistics for continuous, categorical, and dichotomous variables in R, and presents the results in a beautiful, customizable summary table ready for publication (for example, Table 1 or demographic tables).
Let’s run the same one-way ANOVA as above:
data |>
#specify what variables we want in our table: here, we want a row for chlorophyll with columns for lakes
tbl_summary(by = lake, include = c(chlorophyll_a_ugL)) |>
#specify what test to perform on which variables, to get p-values from
#here, a one-way ANOVA on chlorphyll
add_p(test = chlorophyll_a_ugL ~ "oneway.test")| Characteristic | Miramar N = 501 |
Murray N = 501 |
Poway N = 501 |
p-value2 |
|---|---|---|---|---|
| chlorophyll_a_ugL | 1.47 (1.17, 1.95) | 3.85 (2.18, 5.10) | 1.48 (1.21, 1.72) | <0.001 |
| 1 Median (Q1, Q3) | ||||
| 2 One-way analysis of means (not assuming equal variances) | ||||
Caveats
1) Structure tbl_summary() in line with which test will be run in add_p().
Note that if we don’t include table by = lake in tbl_summary(), the test won’t know which groups to compare for this ANOVA.
data |>
#no specification that we are grouping our data into lake-specific columns
tbl_summary(include = c(chlorophyll_a_ugL)) |>
add_p(test = chlorophyll_a_ugL ~ "oneway.test")Error in `add_p()`:
! Cannot run `add_p()` when `tbl_summary(by)` argument not included.
2) Default tests will vary based on data type.
We don’t specify a test, add_p() will use the default tests, which depend on whether the data is continuous or categorical, how many categories, etc.
If we don’t specify anova, the analysis will default to a Kruskal-Wallis rank sum test.
The default test used in add_p() primarily depends on these factors:
- whether the variable is categorical/dichotomous vs continuous
- number of levels in the tbl_summary(by) variable
- whether the add_p(group) argument is specified
- whether the add_p(adj.vars) argument is specified
data |>
tbl_summary(by= lake, include = c(chlorophyll_a_ugL)) |>
add_p()| Characteristic | Miramar N = 501 |
Murray N = 501 |
Poway N = 501 |
p-value2 |
|---|---|---|---|---|
| chlorophyll_a_ugL | 1.47 (1.17, 1.95) | 3.85 (2.18, 5.10) | 1.48 (1.21, 1.72) | <0.001 |
| 1 Median (Q1, Q3) | ||||
| 2 Kruskal-Wallis rank sum test | ||||
Table with regression results
When running a regression, we would want to have all model results - coefficients, confidence interval, and p-value - formatted into a table.
Predict chlorophyll from water temperature, nitrogen, phosphorus, and dissolved organic carbon:
#use lm() function to create model object
mod1 <- lm(chlorophyll_a_ugL ~ h2o_temp_c +
total_nitrogen_mgL +
total_phosphorus._mgL +
dissolved_organic_carbon_mgL,
data = data)#use summary() to view result
summary(mod1)
Call:
lm(formula = chlorophyll_a_ugL ~ h2o_temp_c + total_nitrogen_mgL +
total_phosphorus._mgL + dissolved_organic_carbon_mgL, data = data)
Residuals:
Min 1Q Median 3Q Max
-3.1835 -0.9147 -0.1819 0.5542 5.9682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.49966 0.87816 -2.846 0.00508 **
h2o_temp_c -0.08191 0.02806 -2.919 0.00408 **
total_nitrogen_mgL 0.10940 0.08108 1.349 0.17941
total_phosphorus._mgL -3.30033 5.50135 -0.600 0.54952
dissolved_organic_carbon_mgL 1.83202 0.22498 8.143 1.78e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.465 on 142 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.3223, Adjusted R-squared: 0.3032
F-statistic: 16.88 on 4 and 142 DF, p-value: 2.41e-11
The equivalent function in gtsummary is tbl_regression(), which produces a formatted table of regression results.
tbl_regression(mod1)| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| h2o_temp_c | -0.08 | -0.14, -0.03 | 0.004 |
| total_nitrogen_mgL | 0.11 | -0.05, 0.27 | 0.2 |
| total_phosphorus._mgL | -3.3 | -14, 7.6 | 0.5 |
| dissolved_organic_carbon_mgL | 1.8 | 1.4, 2.3 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
This table is customizable, using a mix of gt and gtsummary functions for formatting the output table.
tbl_regression(mod1,
#update variable names
label = list(h2o_temp_c = "Water Temperature (C)",
total_nitrogen_mgL = "Total Nitrogen (mgL)",
total_phosphorus._mgL = "Total Phosphorus (mgL)",
dissolved_organic_carbon_mgL = "Dissolved Organic Carbon (mgL)")) |>
#make gt object, in order to use gt functions below
as_gt() |>
#add source note
gt::tab_source_note(source_note = md("Source: _Adamczyk EM, Shurin JB (2015) Seasonal Changes in Plankton Food Web Structure and Carbon Dioxide Flux from Southern California Reservoirs. PLoS ONE 10(10): e0140464. https://doi.org/10.1371/journal.pone.0140464_")
)| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| Water Temperature (C) | -0.08 | -0.14, -0.03 | 0.004 |
| Total Nitrogen (mgL) | 0.11 | -0.05, 0.27 | 0.2 |
| Total Phosphorus (mgL) | -3.3 | -14, 7.6 | 0.5 |
| Dissolved Organic Carbon (mgL) | 1.8 | 1.4, 2.3 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| Source: Adamczyk EM, Shurin JB (2015) Seasonal Changes in Plankton Food Web Structure and Carbon Dioxide Flux from Southern California Reservoirs. PLoS ONE 10(10): e0140464. https://doi.org/10.1371/journal.pone.0140464 | |||
Additional resources
- Primary website for gtsummary including how to cite use of the package.
- gtsummary reference manual with complete details on functions and arguments.
- Examples of gtsummary in practice from R Graph Gallery.