Beautiful Tables for Models

Introduction to gtsummary

Next, we’ll use the gtsummary package, an extension of gt that can incorporate statistical model outputs into publication-ready tables.

As with all R packages, we first need to install the package, then call it for use during this session.

library(gtsummary)

Descriptive statistics

The base R approach to getting summary statistics for a dataframe results in a thorough, but not well-formatted for publication, table.

summary(data)
      year          month            day         julian_date    sampling_date 
 Min.   :2013   Min.   : 1.00   Min.   : 1.00   Min.   :13177   Min.   : 1.0  
 1st Qu.:2013   1st Qu.: 4.00   1st Qu.: 8.00   1st Qu.:13262   1st Qu.:13.0  
 Median :2014   Median : 6.00   Median :15.00   Median :13680   Median :25.5  
 Mean   :2014   Mean   : 6.36   Mean   :14.71   Mean   :13693   Mean   :25.5  
 3rd Qu.:2014   3rd Qu.: 9.00   3rd Qu.:22.00   3rd Qu.:14093   3rd Qu.:38.0  
 Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :14956   Max.   :50.0  
                                                                              
     lake             h2o_temp_c      air_temp_c    wind_speed_kph  
 Length:150         Min.   :13.90   Min.   :12.50   Min.   : 0.000  
 Class :character   1st Qu.:17.95   1st Qu.:19.62   1st Qu.: 3.299  
 Mode  :character   Median :21.50   Median :23.02   Median : 6.437  
                    Mean   :21.63   Mean   :23.19   Mean   : 7.170  
                    3rd Qu.:25.25   3rd Qu.:26.69   3rd Qu.: 8.851  
                    Max.   :32.78   Max.   :40.67   Max.   :38.624  
                    NA's   :3                                       
       ph        total_alkalinity_ppm chlorophyll_a_ugL  salinity_ppt   
 Min.   :7.980   Min.   : 59.0        Min.   : 0.740    Min.   :0.1625  
 1st Qu.:8.290   1st Qu.: 87.0        1st Qu.: 1.317    1st Qu.:0.3853  
 Median :8.480   Median : 96.0        Median : 1.665    Median :0.4200  
 Mean   :8.506   Mean   :103.1        Mean   : 2.406    Mean   :0.4003  
 3rd Qu.:8.688   3rd Qu.:107.0        3rd Qu.: 2.717    3rd Qu.:0.4300  
 Max.   :9.340   Max.   :271.0        Max.   :10.300    Max.   :0.4900  
                                                        NA's   :3       
 pco2_water_ppm   bacteria_number_per_L total_nitrogen_mgL
 Min.   :  21.9   Min.   :  2583        Min.   : 0.156    
 1st Qu.: 108.9   1st Qu.: 18063        1st Qu.: 0.600    
 Median : 192.4   Median : 51006        Median : 0.940    
 Mean   : 241.9   Mean   : 70075        Mean   : 1.333    
 3rd Qu.: 332.8   3rd Qu.: 89100        3rd Qu.: 1.625    
 Max.   :1017.7   Max.   :532016        Max.   :15.300    
                  NA's   :6                               
 total_phosphorus._mgL dissolved_organic_carbon_mgL
 Min.   :-0.00640      Min.   :2.568               
 1st Qu.: 0.01158      1st Qu.:3.158               
 Median : 0.01880      Median :3.537               
 Mean   : 0.02115      Mean   :3.577               
 3rd Qu.: 0.02620      3rd Qu.:3.899               
 Max.   : 0.19800      Max.   :5.213               
                                                   
 particulate_organic_nitrogen_mgL particulate_organic_carbon_mgL
 Min.   :0.00440                  Min.   :0.05262               
 1st Qu.:0.01880                  1st Qu.:0.16528               
 Median :0.02720                  Median :0.23582               
 Mean   :0.03621                  Mean   :0.28978               
 3rd Qu.:0.04720                  3rd Qu.:0.33362               
 Max.   :0.17400                  Max.   :1.55655               
 NA's   :3                        NA's   :3                     
 co2_flux_mmol_m_day pco2_atmosphere_ppm zooplankton_community_avg_legnth_mm
 Min.   :-112.4010   Min.   :392.8       Min.   :0.2232                     
 1st Qu.:  -3.4505   1st Qu.:394.7       1st Qu.:0.4736                     
 Median :  -1.2790   Median :397.5       Median :0.6011                     
 Mean   :  -3.2167   Mean   :397.4       Mean   :0.5845                     
 3rd Qu.:  -0.1205   3rd Qu.:399.9       3rd Qu.:0.7227                     
 Max.   :   9.2800   Max.   :401.9       Max.   :1.0829                     
                     NA's   :19          NA's   :4                          
 zooplankton_community_biomass_mgL
 Min.   : 0.1157                  
 1st Qu.: 2.7016                  
 Median : 4.4441                  
 Mean   : 6.7327                  
 3rd Qu.: 8.1071                  
 Max.   :35.5967                  
 NA's   :4                        

The equivalent summary statistics function from gtsummary is tbl_summary().

tbl_summary(data)
Characteristic N = 1501
year
    2013 75 (50%)
    2014 75 (50%)
month 6 (4, 9)
day 15 (8, 22)
julian_date 13,681 (13,261, 14,093)
sampling_date 26 (13, 38)
lake
    Miramar 50 (33%)
    Murray 50 (33%)
    Poway 50 (33%)
h2o_temp_c 21.5 (17.9, 25.3)
    Unknown 3
air_temp_c 23.0 (19.6, 26.7)
wind_speed_kph 6.4 (3.2, 8.9)
ph 8.48 (8.29, 8.69)
total_alkalinity_ppm 96 (87, 107)
chlorophyll_a_ugL 1.67 (1.31, 2.72)
salinity_ppt 0.42 (0.38, 0.43)
    Unknown 3
pco2_water_ppm 192 (109, 333)
bacteria_number_per_L 51,007 (18,048, 89,100)
    Unknown 6
total_nitrogen_mgL 0.94 (0.60, 1.63)
total_phosphorus._mgL 0.019 (0.011, 0.026)
dissolved_organic_carbon_mgL 3.54 (3.16, 3.91)
particulate_organic_nitrogen_mgL 0.027 (0.019, 0.047)
    Unknown 3
particulate_organic_carbon_mgL 0.24 (0.17, 0.33)
    Unknown 3
co2_flux_mmol_m_day -1.3 (-3.5, -0.1)
pco2_atmosphere_ppm 397.50 (394.66, 400.78)
    Unknown 19
zooplankton_community_avg_legnth_mm 0.60 (0.47, 0.72)
    Unknown 4
zooplankton_community_biomass_mgL 4 (3, 8)
    Unknown 4
1 n (%); Median (Q1, Q3)

The default summary statistic presented depends on variable type. Categorical variables get counts and percentages while continuous variables get median(first quartile, third quartile). Any NA values are listed in a separate row as “unknown”.

These defaults are customizable. For instance, if we wanted only a subset of variables and for continuous variables to include minimum, median, and maximum:

data |>  
  tbl_summary(
    include = c(lake, h2o_temp_c, chlorophyll_a_ugL),
    statistic = list(
      #change just way continuous vars are presented
      all_continuous() ~ "{median} ({min}, {max})"
    ),
  )
Characteristic N = 1501
lake
    Miramar 50 (33%)
    Murray 50 (33%)
    Poway 50 (33%)
h2o_temp_c 21.5 (13.9, 32.8)
    Unknown 3
chlorophyll_a_ugL 1.67 (0.74, 10.30)
1 n (%); Median (Min, Max)

We can also produce crosstab tables, for instance with descriptive statistics for selected variables for each lake.

data |> 
  tbl_summary(
    by = lake,
    include = c(h2o_temp_c, chlorophyll_a_ugL),
    statistic = list(
      #change just way continuous vars are presented
      all_continuous() ~ "{median} ({min}, {max})"
    )
  )
Characteristic Miramar
N = 501
Murray
N = 501
Poway
N = 501
h2o_temp_c 21.4 (14.2, 32.8) 21.7 (14.7, 30.8) 21.4 (13.9, 29.9)
    Unknown 1 2 0
chlorophyll_a_ugL 1.47 (0.74, 4.63) 3.85 (0.96, 10.30) 1.48 (0.92, 2.77)
1 Median (Min, Max)

And customize the table with better labels and updated NA indicator:

data |> 
  tbl_summary(
    by = lake,
    include = c(h2o_temp_c, chlorophyll_a_ugL),
    statistic = list(
      #change just way continuous vars are presented
      all_continuous() ~ "{median} ({min}, {max})"
    ),
    label = list(h2o_temp_c = "Water Temperature (C)", chlorophyll_a_ugL = "Chlorophyll a (ugL)"),
    missing_text = "Missing values"
  )
Characteristic Miramar
N = 501
Murray
N = 501
Poway
N = 501
Water Temperature (C) 21.4 (14.2, 32.8) 21.7 (14.7, 30.8) 21.4 (13.9, 29.9)
    Missing values 1 2 0
Chlorophyll a (ugL) 1.47 (0.74, 4.63) 3.85 (0.96, 10.30) 1.48 (0.92, 2.77)
1 Median (Min, Max)

Statistical tests

If we wanted to run statistical analyses on this data, a usual approach would be to use a separate R package to run the test and then extract values of interest, like coefficients and p-values, to put into a table. gtsummary streamlines this process with an all-in-one approach to formatting statistical test outputs into a table.

There are many standard statistical tests integrated into gtsummary including t-test, ANOVA, chi-square, regression, survey sample methods, and more.

Let’s look at a couple examples in practice.

Table with significance

Run a one-way ANOVA to compare whether chlorophyll is significantly different between the three lakes in 2013:

oneway.test(chlorophyll_a_ugL ~ as.factor(lake), data = data)

    One-way analysis of means (not assuming equal variances)

data:  chlorophyll_a_ugL and as.factor(lake)
F = 30.207, num df = 2.000, denom df = 79.289, p-value = 1.769e-10

If we wanted to report these results, we’d have some work to do to paste this into a table format. Or, we can do the same analysis using the tbl_summary function from gtsummary to do this for us!

The tbl_summary() function calculates descriptive statistics for continuous, categorical, and dichotomous variables in R, and presents the results in a beautiful, customizable summary table ready for publication (for example, Table 1 or demographic tables).

Let’s run the same one-way ANOVA as above:

data |> 
  #specify what variables we want in our table: here, we want a row for chlorophyll with columns for lakes
  tbl_summary(by = lake, include = c(chlorophyll_a_ugL)) |>
  #specify what test to perform on which variables, to get p-values from
  #here, a one-way ANOVA on chlorphyll
  add_p(test = chlorophyll_a_ugL ~ "oneway.test")
Characteristic Miramar
N = 501
Murray
N = 501
Poway
N = 501
p-value2
chlorophyll_a_ugL 1.47 (1.17, 1.95) 3.85 (2.18, 5.10) 1.48 (1.21, 1.72) <0.001
1 Median (Q1, Q3)
2 One-way analysis of means (not assuming equal variances)

Caveats

1) Structure tbl_summary() in line with which test will be run in add_p().

Note that if we don’t include table by = lake in tbl_summary(), the test won’t know which groups to compare for this ANOVA.

data |> 
  #no specification that we are grouping our data into lake-specific columns
  tbl_summary(include = c(chlorophyll_a_ugL)) |>
  add_p(test = chlorophyll_a_ugL ~ "oneway.test")
Error in `add_p()`:
! Cannot run `add_p()` when `tbl_summary(by)` argument not included.

2) Default tests will vary based on data type.

We don’t specify a test, add_p() will use the default tests, which depend on whether the data is continuous or categorical, how many categories, etc.

If we don’t specify anova, the analysis will default to a Kruskal-Wallis rank sum test.

The default test used in add_p() primarily depends on these factors:

  • whether the variable is categorical/dichotomous vs continuous
  • number of levels in the tbl_summary(by) variable
  • whether the add_p(group) argument is specified
  • whether the add_p(adj.vars) argument is specified
data |> 
  tbl_summary(by= lake, include = c(chlorophyll_a_ugL)) |>
  add_p()
Characteristic Miramar
N = 501
Murray
N = 501
Poway
N = 501
p-value2
chlorophyll_a_ugL 1.47 (1.17, 1.95) 3.85 (2.18, 5.10) 1.48 (1.21, 1.72) <0.001
1 Median (Q1, Q3)
2 Kruskal-Wallis rank sum test

Table with regression results

When running a regression, we would want to have all model results - coefficients, confidence interval, and p-value - formatted into a table.

Predict chlorophyll from water temperature, nitrogen, phosphorus, and dissolved organic carbon:

#use lm() function to create model object
mod1 <- lm(chlorophyll_a_ugL ~ h2o_temp_c + 
                              total_nitrogen_mgL + 
                              total_phosphorus._mgL + 
                              dissolved_organic_carbon_mgL, 
           data = data)
#use summary() to view result
summary(mod1)

Call:
lm(formula = chlorophyll_a_ugL ~ h2o_temp_c + total_nitrogen_mgL + 
    total_phosphorus._mgL + dissolved_organic_carbon_mgL, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1835 -0.9147 -0.1819  0.5542  5.9682 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -2.49966    0.87816  -2.846  0.00508 ** 
h2o_temp_c                   -0.08191    0.02806  -2.919  0.00408 ** 
total_nitrogen_mgL            0.10940    0.08108   1.349  0.17941    
total_phosphorus._mgL        -3.30033    5.50135  -0.600  0.54952    
dissolved_organic_carbon_mgL  1.83202    0.22498   8.143 1.78e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.465 on 142 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.3223,    Adjusted R-squared:  0.3032 
F-statistic: 16.88 on 4 and 142 DF,  p-value: 2.41e-11

The equivalent function in gtsummary is tbl_regression(), which produces a formatted table of regression results.

tbl_regression(mod1)
Characteristic Beta 95% CI p-value
h2o_temp_c -0.08 -0.14, -0.03 0.004
total_nitrogen_mgL 0.11 -0.05, 0.27 0.2
total_phosphorus._mgL -3.3 -14, 7.6 0.5
dissolved_organic_carbon_mgL 1.8 1.4, 2.3 <0.001
Abbreviation: CI = Confidence Interval

This table is customizable, using a mix of gt and gtsummary functions for formatting the output table.

tbl_regression(mod1,
               #update variable names
               label = list(h2o_temp_c = "Water Temperature (C)", 
                            total_nitrogen_mgL = "Total Nitrogen (mgL)", 
                            total_phosphorus._mgL = "Total Phosphorus (mgL)", 
                            dissolved_organic_carbon_mgL = "Dissolved Organic Carbon (mgL)")) |>
  #make gt object, in order to use gt functions below
   as_gt() |>
  #add source note
   gt::tab_source_note(source_note = md("Source: _Adamczyk EM, Shurin JB (2015) Seasonal Changes in Plankton Food Web Structure and Carbon Dioxide Flux from Southern California Reservoirs. PLoS ONE 10(10): e0140464. https://doi.org/10.1371/journal.pone.0140464_")
  )
Characteristic Beta 95% CI p-value
Water Temperature (C) -0.08 -0.14, -0.03 0.004
Total Nitrogen (mgL) 0.11 -0.05, 0.27 0.2
Total Phosphorus (mgL) -3.3 -14, 7.6 0.5
Dissolved Organic Carbon (mgL) 1.8 1.4, 2.3 <0.001
Abbreviation: CI = Confidence Interval
Source: Adamczyk EM, Shurin JB (2015) Seasonal Changes in Plankton Food Web Structure and Carbon Dioxide Flux from Southern California Reservoirs. PLoS ONE 10(10): e0140464. https://doi.org/10.1371/journal.pone.0140464

Additional resources