Using R for Everything: ggplot2 可视化 RDA 结果

关于 RDA(Redundancy analysis 冗余分析)是什么,相信对于已经有可视化需求的同学来说已经不用更多的解释了。

在 R 中常用来进行 RDA 分析和绘制工作的是 veganggvegan 这两个包。然而,在实际使用中,最常遇到的问题是虽然这些包内建的 plot 等功能可以绘制出基本可用的包,但想要进一步的定制图形却没那么容易。

想要绘制出一副自己满意、编辑满意、导师满意最主要的是审稿人满意的 RDA 结果,作为最强可视化工具之一的 ggplot2 包毋庸置疑是最佳的选择。

我们需要什么样的 RDA 图

首先,我们来思考我们需要什么样的 RDA 图?按照世纪需求以及审稿人的建议:

I would recommend showing in bold the variables with significant correlations

笔者最后的目标是绘制一幅:

1. 显示物种信息(实际上是响应变量矩阵);
2. 环境变量(实际上是解释变量矩阵);
3. 在两轴上能显示各自的解释度;
4. 标记有显著性的解释变量。

以笔者对 plot.rda 以及 autoplot.rda 这两个 vegan 和 ggvegan 内建函数的浅薄了解,似乎很难完成。

如何「标记有显著性的解释变量」?

如果进行过 RDA 分析不难发现,使用 vegan 内建的 rda 是没有标记解释变量的显著性的。其实这种显著性需要 vegan 内建的 envfit 函数来得到一个及其近似的结果。通常这个方法在论文中会写作:

Monte Carlo permutation (999 permutations) was used to identify axes with significant eigenvalues and species-environment correlations.

由于是进行了 999 次(默认参数可以修改)的蒙特卡洛抽样,因此这个结果是及其近似的结果,直接用作 RDA 中解释变量的显著性是没有问题的。然而,envfit 输出的结果并非标准的 data.frame 或者类似的结果,无法方便的输出或者进行分析,后续我们还会进行结果提取的步骤,暂且按下不表。

RDA 和 ENVFIT 分析

此例,我们使用随机抽取的 150 条土壤线虫群落群落数据以及对应的环境数据,由于不影响后续的理解以及版权考量,对各参数名不再解释。

每一步的操作以及原因以注释的形式呈现。

library('tidyverse')

# read Environmental Variables
df_env <- read_csv('df_env_smp.csv')
# read Community composition matrix
df_com <- read_csv('df_com_smp.csv')

# View the structure of data
head(df_env)
head(df_com)
Parsed with column specification:
cols(
  pH = col_double(),
  MOI = col_double(),
  TN = col_double(),
  TP = col_double(),
  AP = col_double(),
  NH4 = col_double(),
  NO3 = col_double(),
  SOC = col_double(),
  MAT = col_double(),
  MAP = col_double(),
  PBM = col_double(),
  PCV = col_double(),
  PSR = col_double()
)

Parsed with column specification:
cols(
  pp = col_double(),
  ba = col_double(),
  fu = col_double(),
  pr = col_double(),
  om = col_double()
)
A tibble: 6 × 13
pHMOITNTPAPNH4NO3SOCMATMAPPBMPCVPSR
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
-1.6396547-0.1610384-0.1162406-0.44892589 0.17966965 0.4471754-0.5148515-0.292278803 0.8420221 0.02487222 0.17638650 0.426595458 0.2275781
0.9429126-0.5549069-0.7919730 0.02952209-0.78701735-0.3426843-0.4197756-0.697789311-1.5519670 0.27013035-0.02465146 0.008333099-0.8352947
-1.3387571-0.2762522 0.1998817-0.14900328-0.10273314 0.7179844-0.4922144 0.461059956-0.8480169 1.21243154 0.73829039-0.793336422 0.8349340
-1.2035947-0.1658411-0.2600847-0.45606691 0.09277663-0.1328073-0.1677489-0.007572046 0.7833595-0.87351309-0.80968336-0.514494850-1.2908116
-1.1762407 1.0043821 0.2918750-0.47748995 2.24338416 0.7924569-0.5827629 0.035809220-0.2893264-0.22471654 0.12645186 0.182609082 0.9867730
0.4263991 0.5467350-0.3570957-0.12043922 0.54896611-0.3652517-0.5540892-0.360483891-0.8480169 1.21243154 2.11789718-0.026522097 0.6830950
A tibble: 6 × 5
ppbafuprom
<dbl><dbl><dbl><dbl><dbl>
2.55000028.010000 2.5500002.55000015.280000
1.311046 7.316707 1.3378841.337884 5.953381
8.25000013.255000 3.3550000.00000014.850000
102.49000018.51666720.9466670.00000023.500000
21.35000052.72000014.8466670.000000 5.623333
2.490000 4.965000 1.2450000.000000 2.475000
library('vegan')

# Performing RDA and viewing the results
res_rda <- rda(df_com, df_env)
res_rda

# Performing ENVFIT
res_envfit <- envfit(df_com, df_env)

# However, the result of ENVFIT is not in the data.frame format, we should extract useful information from it.
res_envfit
Call: rda(X = df_com, Y = df_env)

                Inertia Proportion Rank
Total         2031.7341     1.0000     
Constrained    895.7317     0.4409    5
Unconstrained 1136.0024     0.5591    5
Inertia is variance 

Eigenvalues for constrained axes:
 RDA1  RDA2  RDA3  RDA4  RDA5 
754.0  88.3  45.4   6.2   1.8 

Eigenvalues for unconstrained axes:
  PC1   PC2   PC3   PC4   PC5 
747.4 235.0  96.3  44.7  12.6 





***VECTORS

          pp       ba     r2 Pr(>r)    
pH  -0.94935  0.31421 0.0809  0.003 ** 
MOI  0.79143  0.61126 0.2636  0.001 ***
TN   0.91399  0.40574 0.0887  0.004 ** 
TP   0.86358 -0.50421 0.0002  0.984    
AP   0.99269  0.12071 0.0575  0.026 *  
NH4  0.94495 -0.32722 0.0462  0.033 *  
NO3  0.52786  0.84933 0.1430  0.002 ** 
SOC  0.81629  0.57764 0.1440  0.001 ***
MAT  0.10537 -0.99443 0.0118  0.405    
MAP -0.18615 -0.98252 0.0073  0.579    
PBM  0.75360  0.65733 0.0096  0.508    
PCV  0.57508  0.81810 0.0380  0.072 .  
PSR  0.88521 -0.46519 0.0062  0.639    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Permutation: free
Number of permutations: 999

正如我们在注释提到的,ENVFIT 并不是以 data.frame 的形式提供数据,因此我们还需要通过一些提取的操作才能获得与结果相同的表格。由于这类操作经常使用,因此我们将其包装成 function 并命名为 envfit_to_df

# Here, env_obj indicates the result of envfit. In this case, it's the res_envfit.
# r2_dig is the significant figure of R2
# p_dig is the significant figure of p value
envfit_to_df <- function(env_obj,
                         r2_dig = 6,
                         p_dig = 3) {
  r2_fmt <- as.character(paste('%.', r2_dig, 'f', sep = ''))
  p_fmt <- as.character(paste('%.', p_dig, 'f', sep = ''))
  tibble(
      # the name of explainary variables
    factor = names(env_obj$vectors$r),
      # list or vector of R2
    r2 = env_obj$vectors$r,
      # list or vector of p values
    pvals = env_obj$vectors$pvals
  ) %>%
    # generate significant levels by p values
    mutate(sig = case_when(
      pvals <= 0.001 ~ '***',
      pvals <= 0.01 ~ '**',
      pvals <= 0.05 ~ '*',
      TRUE ~ ' '
    )) %>%
    # format the significant figure by format definition before.
    mutate(pvals = sprintf('%.3f', pvals),
           r2 = sprintf(r2_fmt, r2))
}
# Convert result of ENVFIT to data.frame, in fact, it's a tibble.
df_env <- envfit_to_df(res_envfit, r2_dig = 3)
df_env
A tibble: 13 × 4
factorr2pvalssig
<chr><chr><chr><chr>
pH 0.0810.003**
MOI0.2640.001***
TN 0.0890.004**
TP 0.0000.984
AP 0.0580.026*
NH40.0460.033*
NO30.1430.002**
SOC0.1440.001***
MAT0.0120.405
MAP0.0070.579
PBM0.0100.508
PCV0.0380.072
PSR0.0060.639

截止到目前,我们已经准备完成了 RDA 分析以及 ENVFIT 分析,并将数据转换成为了满足可视化需求的格式。

接下来就是需要进行可视化作业。此例,笔者仅需要显示环境变量与物种信息,不需要显示样地信息,因此绘制中仅保留了所需的信息,如需显示样地信息,请按实际需求更改。

绘制 RDA 图形是常用的操作,因此同样将它包装成为 function 并命名为 ggRDA

# Here, rda_obj means the object which is from vegan::rda
# sp_size means the text size of species
# arrow_txt_size means the environmet variable names at the end of the arrow
# Because not every RDA plot needs indicate significant correlations, the envfit_df is optional here.
ggRDA <- function(rda_obj,sp_size = 4,arrow_txt_size = 4, envfit_df) {
    # ggplot doesn't support rda object directly, we use ggvegan::fortify function to convert the rda to data.frame
    fmod <- fortify(rda_obj)
    # to get the arrow of biplot, we plot rda by vegan::plot.rda function firstly. 
    # The arrow attributes contain in the attributes(plot_obejct$biplot)$arrow.mul
    basplot <- plot(rda_obj)
    mult <- attributes(basplot$biplot)$arrow.mul
  
    # To check if envfit_df exists or not
    # If envfit_df exists, join the fortified rda_obj and envfit to mark which variable is significant.
    if(missingArg(envfit_df)){
        bplt_df <- filter(fmod, Score == "biplot") %>%
        # If there is no requirement to mark significant variable
        # use the sytle of sinificant (black bolder solid arrow)
        # to paint the arrow
        mutate(bold = 'sig')
    } else {
        bplt_df <- filter(fmod, Score == "biplot") %>%
        left_join(envfit_df, by = c('Label' = 'factor')) %>%
        # To mark the significant variables as sig, not significant variables as ns
        # these information are stored in bold column
        mutate(bold = ifelse(str_detect(sig, fixed('*')), 'sig', 'ns'))
    }
  ggplot(fmod, aes(x = RDA1, y = RDA2)) +
    coord_fixed() +
    geom_segment(data = bplt_df,
                 # mult and RDA1/RDA2 are from fortified RDA data.frame
                 # they contain the direction and effects of every variabl
                 # their products are the direction and length of arrows
                 aes(x = 0, xend = mult * RDA1,
                     y = 0, yend = mult * RDA2,
                     # Use different arrow size to indicate the significant level
                     size = bold,
                     # Use different arrow color to indicate the significant level
                     color = bold,
                     # Use different arrow linetype to indicate the significant level
                     linetype = bold),
                 #############################
                 # Q:Why use three different attibution to control the significant levels?
                     # It is redundancy, isn't it?
                 # A: In fact, it's not easy to recongize the significant level by one kind attribution
                     # Becasue it is not delicate to indicate it with supper bold and supper thin arrow,
                     # by the same logic, high contrast colors are not delicate neither.
                     # As for the line type, some arrow are really short, it's not easy to recognize
                     # weather it is solid or dashed line at all.
                     # To sum up, we use three different attributions
                     # to indicate the same difference to avoid any misleading.
                 #############################
                 # to control the size of the header of arrow
                 arrow = arrow(length = unit(0.25, "cm")), 
                ) +
    # Add the text of variable name at the end of arrow
    geom_text(data =  subset(fmod, Score == "biplot"),
              aes(x = (mult + mult/10) * RDA1,
                   #we add 10% to the text to push it slightly out from arrows
                  y = (mult + mult/10) * RDA2,
                  label = Label),
              size = arrow_txt_size,
               #otherwise you could use hjust and vjust. I prefer this option
              hjust = 0.5) +
    # Add the text of species
    geom_text(
      data = subset(fmod, Score == "species"),
      aes(colour = "species",label = Label),
      size = sp_size
    )
}

虽然使用 ggRDA 可以直接绘制图形,但通常为了美观,对于特定的参数还需要进一步的调整。

注意,由于在 ggRDA 中使用了 vegan::plot.rda 绘制图像,所以在下面的调用 ggRDA 会首先绘制一次简易的 rda 图像之后,再显示出 ggplot2 绘制的图形,不影响后续的输出保存。

library(ggvegan)
# Get the amount of explanation by each axis
# gernerally, we choose the first two axes.
exp_by_x <- (as.list(res_rda$CCA$eig)$RDA1)/(res_rda$tot.chi) * 100
exp_by_y <- (as.list(res_rda$CCA$eig)$RDA2)/(res_rda$tot.chi) * 100
    
ggRDA(res_rda, envfit_df = df_env, sp_size = 5) +
    # Generally theme_classic is a good choice to paint a figure
    theme_classic() +
    # In general, we don't need to show the legend in RDA figure
    theme(legend.position = "none") +
    xlab(paste('RDA1 (', round(exp_by_x, 2), '%)', sep = '')) +
    ylab(paste('RDA2 (', round(exp_by_y, 2), '%)', sep = '')) +
    # scale_XXXXX_manual series provide the ability
    # to define the style of legend by variable value
    scale_size_manual(values = c('ns' = .6,
                                 'sig' = .8)) +
    # Q: What's species here? I don't remember their is a significant level which is called 'species'
    # A: Indeed, their is no significant 'species'. However, 
        # the species name in RDA which is generated from geom_text contains colour attribution.
    scale_colour_manual(values = c(
        'ns' = '#606060',
        'sig' = 'black',
        'species' = 'red'
      )) +
    scale_linetype_manual(values = c('ns' = 8, 'sig' = 1))

最后很简单了,使用 ggsave 将 RDA 图保存成你需要的格式就可以。

Tips: 自然科学期刊的投稿系统通常支持上传 PDF 格式的图片,根据实际情况也推荐使用 ggsave 输出 PDF 格式。

此时无需设置 DPI 也无法设置 DPI,这是因为,ggsave 保存的 PDF 文件会优先将图片输出为矢量图,简而言之就是图片无论如何放大,都不会变得模糊,而期刊的排版系统也能很好的处理这种矢量图。

只要给 PDF 图片设定一个合适的宽高,就无需担心图片会变得不清晰等奇奇怪怪看似玄学的问题了。

不过如果有修改字体的需求,也许 PDF 会报错,类似于字体无法嵌入,此时输出的 PDF 文件所有的文字都会消失。关于解决这个问题,后续也许会进行进一步的讨论。

ggsave('RDA.pdf',width = 7)
Saving 7 x 7 in image