Using R for Everything: ggplot2 可视化 RDA 结果

关于 RDA（Redundancy analysis 冗余分析）是什么，相信对于已经有可视化需求的同学来说已经不用更多的解释了。

在 R 中常用来进行 RDA 分析和绘制工作的是 vegan 和 ggvegan 这两个包。然而，在实际使用中，最常遇到的问题是虽然这些包内建的 plot 等功能可以绘制出基本可用的包，但想要进一步的定制图形却没那么容易。

想要绘制出一副自己满意、编辑满意、导师满意最主要的是审稿人满意的 RDA 结果，作为最强可视化工具之一的 ggplot2 包毋庸置疑是最佳的选择。

我们需要什么样的 RDA 图

首先，我们来思考我们需要什么样的 RDA 图？按照世纪需求以及审稿人的建议：

I would recommend showing in bold the variables with significant correlations

笔者最后的目标是绘制一幅：

1. 显示物种信息（实际上是响应变量矩阵）；
2. 环境变量（实际上是解释变量矩阵）；
3. 在两轴上能显示各自的解释度；
4. 标记有显著性的解释变量。

以笔者对 plot.rda 以及 autoplot.rda 这两个 vegan 和 ggvegan 内建函数的浅薄了解，似乎很难完成。

如何「标记有显著性的解释变量」？

如果进行过 RDA 分析不难发现，使用 vegan 内建的 rda 是没有标记解释变量的显著性的。其实这种显著性需要 vegan 内建的 envfit 函数来得到一个及其近似的结果。通常这个方法在论文中会写作：

Monte Carlo permutation (999 permutations) was used to identify axes with significant eigenvalues and species-environment correlations.

由于是进行了 999 次（默认参数可以修改）的蒙特卡洛抽样，因此这个结果是及其近似的结果，直接用作 RDA 中解释变量的显著性是没有问题的。然而，envfit 输出的结果并非标准的 data.frame 或者类似的结果，无法方便的输出或者进行分析，后续我们还会进行结果提取的步骤，暂且按下不表。

RDA 和 ENVFIT 分析

此例，我们使用随机抽取的 150 条土壤线虫群落群落数据以及对应的环境数据，由于不影响后续的理解以及版权考量，对各参数名不再解释。

每一步的操作以及原因以注释的形式呈现。

library('tidyverse')

# read Environmental Variables
df_env <- read_csv('df_env_smp.csv')
# read Community composition matrix
df_com <- read_csv('df_com_smp.csv')

# View the structure of data
head(df_env)
head(df_com)

Parsed with column specification:
cols(
  pH = [32mcol_double()[39m,
  MOI = [32mcol_double()[39m,
  TN = [32mcol_double()[39m,
  TP = [32mcol_double()[39m,
  AP = [32mcol_double()[39m,
  NH4 = [32mcol_double()[39m,
  NO3 = [32mcol_double()[39m,
  SOC = [32mcol_double()[39m,
  MAT = [32mcol_double()[39m,
  MAP = [32mcol_double()[39m,
  PBM = [32mcol_double()[39m,
  PCV = [32mcol_double()[39m,
  PSR = [32mcol_double()[39m
)

Parsed with column specification:
cols(
  pp = [32mcol_double()[39m,
  ba = [32mcol_double()[39m,
  fu = [32mcol_double()[39m,
  pr = [32mcol_double()[39m,
  om = [32mcol_double()[39m
)

A tibble: 6 × 13
pH	MOI	TN	TP	AP	NH4	NO3	SOC	MAT	MAP	PBM	PCV	PSR
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
-1.6396547	-0.1610384	-0.1162406	-0.44892589	0.17966965	0.4471754	-0.5148515	-0.292278803	0.8420221	0.02487222	0.17638650	0.426595458	0.2275781
0.9429126	-0.5549069	-0.7919730	0.02952209	-0.78701735	-0.3426843	-0.4197756	-0.697789311	-1.5519670	0.27013035	-0.02465146	0.008333099	-0.8352947
-1.3387571	-0.2762522	0.1998817	-0.14900328	-0.10273314	0.7179844	-0.4922144	0.461059956	-0.8480169	1.21243154	0.73829039	-0.793336422	0.8349340
-1.2035947	-0.1658411	-0.2600847	-0.45606691	0.09277663	-0.1328073	-0.1677489	-0.007572046	0.7833595	-0.87351309	-0.80968336	-0.514494850	-1.2908116
-1.1762407	1.0043821	0.2918750	-0.47748995	2.24338416	0.7924569	-0.5827629	0.035809220	-0.2893264	-0.22471654	0.12645186	0.182609082	0.9867730
0.4263991	0.5467350	-0.3570957	-0.12043922	0.54896611	-0.3652517	-0.5540892	-0.360483891	-0.8480169	1.21243154	2.11789718	-0.026522097	0.6830950

A tibble: 6 × 5
pp	ba	fu	pr	om
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
2.550000	28.010000	2.550000	2.550000	15.280000
1.311046	7.316707	1.337884	1.337884	5.953381
8.250000	13.255000	3.355000	0.000000	14.850000
102.490000	18.516667	20.946667	0.000000	23.500000
21.350000	52.720000	14.846667	0.000000	5.623333
2.490000	4.965000	1.245000	0.000000	2.475000

library('vegan')

# Performing RDA and viewing the results
res_rda <- rda(df_com, df_env)
res_rda

# Performing ENVFIT
res_envfit <- envfit(df_com, df_env)

# However, the result of ENVFIT is not in the data.frame format, we should extract useful information from it.
res_envfit

Call: rda(X = df_com, Y = df_env)

                Inertia Proportion Rank
Total         2031.7341     1.0000     
Constrained    895.7317     0.4409    5
Unconstrained 1136.0024     0.5591    5
Inertia is variance 

Eigenvalues for constrained axes:
 RDA1  RDA2  RDA3  RDA4  RDA5 
754.0  88.3  45.4   6.2   1.8 

Eigenvalues for unconstrained axes:
  PC1   PC2   PC3   PC4   PC5 
747.4 235.0  96.3  44.7  12.6 





***VECTORS

          pp       ba     r2 Pr(>r)    
pH  -0.94935  0.31421 0.0809  0.003 ** 
MOI  0.79143  0.61126 0.2636  0.001 ***
TN   0.91399  0.40574 0.0887  0.004 ** 
TP   0.86358 -0.50421 0.0002  0.984    
AP   0.99269  0.12071 0.0575  0.026 *  
NH4  0.94495 -0.32722 0.0462  0.033 *  
NO3  0.52786  0.84933 0.1430  0.002 ** 
SOC  0.81629  0.57764 0.1440  0.001 ***
MAT  0.10537 -0.99443 0.0118  0.405    
MAP -0.18615 -0.98252 0.0073  0.579    
PBM  0.75360  0.65733 0.0096  0.508    
PCV  0.57508  0.81810 0.0380  0.072 .  
PSR  0.88521 -0.46519 0.0062  0.639    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Permutation: free
Number of permutations: 999

正如我们在注释提到的，ENVFIT 并不是以 data.frame 的形式提供数据，因此我们还需要通过一些提取的操作才能获得与结果相同的表格。由于这类操作经常使用，因此我们将其包装成 function 并命名为 envfit_to_df

# Here, env_obj indicates the result of envfit. In this case, it's the res_envfit.
# r2_dig is the significant figure of R2
# p_dig is the significant figure of p value
envfit_to_df <- function(env_obj,
                         r2_dig = 6,
                         p_dig = 3) {
  r2_fmt <- as.character(paste('%.', r2_dig, 'f', sep = ''))
  p_fmt <- as.character(paste('%.', p_dig, 'f', sep = ''))
  tibble(
      # the name of explainary variables
    factor = names(env_obj$vectors$r),
      # list or vector of R2
    r2 = env_obj$vectors$r,
      # list or vector of p values
    pvals = env_obj$vectors$pvals
  ) %>%
    # generate significant levels by p values
    mutate(sig = case_when(
      pvals <= 0.001 ~ '***',
      pvals <= 0.01 ~ '**',
      pvals <= 0.05 ~ '*',
      TRUE ~ ' '
    )) %>%
    # format the significant figure by format definition before.
    mutate(pvals = sprintf('%.3f', pvals),
           r2 = sprintf(r2_fmt, r2))
}
# Convert result of ENVFIT to data.frame, in fact, it's a tibble.
df_env <- envfit_to_df(res_envfit, r2_dig = 3)
df_env

A tibble: 13 × 4
factor	r2	pvals	sig
<chr>	<chr>	<chr>	<chr>
pH	0.081	0.003	**
MOI	0.264	0.001	***
TN	0.089	0.004	**
TP	0.000	0.984
AP	0.058	0.026	*
NH4	0.046	0.033	*
NO3	0.143	0.002	**
SOC	0.144	0.001	***
MAT	0.012	0.405
MAP	0.007	0.579
PBM	0.010	0.508
PCV	0.038	0.072
PSR	0.006	0.639

截止到目前，我们已经准备完成了 RDA 分析以及 ENVFIT 分析，并将数据转换成为了满足可视化需求的格式。

接下来就是需要进行可视化作业。此例，笔者仅需要显示环境变量与物种信息，不需要显示样地信息，因此绘制中仅保留了所需的信息，如需显示样地信息，请按实际需求更改。

绘制 RDA 图形是常用的操作，因此同样将它包装成为 function 并命名为 ggRDA

# Here, rda_obj means the object which is from vegan::rda
# sp_size means the text size of species
# arrow_txt_size means the environmet variable names at the end of the arrow
# Because not every RDA plot needs indicate significant correlations, the envfit_df is optional here.
ggRDA <- function(rda_obj,sp_size = 4,arrow_txt_size = 4, envfit_df) {
    # ggplot doesn't support rda object directly, we use ggvegan::fortify function to convert the rda to data.frame
    fmod <- fortify(rda_obj)
    # to get the arrow of biplot, we plot rda by vegan::plot.rda function firstly. 
    # The arrow attributes contain in the attributes(plot_obejct$biplot)$arrow.mul
    basplot <- plot(rda_obj)
    mult <- attributes(basplot$biplot)$arrow.mul
  
    # To check if envfit_df exists or not
    # If envfit_df exists, join the fortified rda_obj and envfit to mark which variable is significant.
    if(missingArg(envfit_df)){
        bplt_df <- filter(fmod, Score == "biplot") %>%
        # If there is no requirement to mark significant variable
        # use the sytle of sinificant (black bolder solid arrow)
        # to paint the arrow
        mutate(bold = 'sig')
    } else {
        bplt_df <- filter(fmod, Score == "biplot") %>%
        left_join(envfit_df, by = c('Label' = 'factor')) %>%
        # To mark the significant variables as sig, not significant variables as ns
        # these information are stored in bold column
        mutate(bold = ifelse(str_detect(sig, fixed('*')), 'sig', 'ns'))
    }
  ggplot(fmod, aes(x = RDA1, y = RDA2)) +
    coord_fixed() +
    geom_segment(data = bplt_df,
                 # mult and RDA1/RDA2 are from fortified RDA data.frame
                 # they contain the direction and effects of every variabl
                 # their products are the direction and length of arrows
                 aes(x = 0, xend = mult * RDA1,
                     y = 0, yend = mult * RDA2,
                     # Use different arrow size to indicate the significant level
                     size = bold,
                     # Use different arrow color to indicate the significant level
                     color = bold,
                     # Use different arrow linetype to indicate the significant level
                     linetype = bold),
                 #############################
                 # Q:Why use three different attibution to control the significant levels?
                     # It is redundancy, isn't it?
                 # A: In fact, it's not easy to recongize the significant level by one kind attribution
                     # Becasue it is not delicate to indicate it with supper bold and supper thin arrow,
                     # by the same logic, high contrast colors are not delicate neither.
                     # As for the line type, some arrow are really short, it's not easy to recognize
                     # weather it is solid or dashed line at all.
                     # To sum up, we use three different attributions
                     # to indicate the same difference to avoid any misleading.
                 #############################
                 # to control the size of the header of arrow
                 arrow = arrow(length = unit(0.25, "cm")), 
                ) +
    # Add the text of variable name at the end of arrow
    geom_text(data =  subset(fmod, Score == "biplot"),
              aes(x = (mult + mult/10) * RDA1,
                   #we add 10% to the text to push it slightly out from arrows
                  y = (mult + mult/10) * RDA2,
                  label = Label),
              size = arrow_txt_size,
               #otherwise you could use hjust and vjust. I prefer this option
              hjust = 0.5) +
    # Add the text of species
    geom_text(
      data = subset(fmod, Score == "species"),
      aes(colour = "species",label = Label),
      size = sp_size
    )
}

虽然使用 ggRDA 可以直接绘制图形，但通常为了美观，对于特定的参数还需要进一步的调整。

注意，由于在 ggRDA 中使用了 vegan::plot.rda 绘制图像，所以在下面的调用 ggRDA 会首先绘制一次简易的 rda 图像之后，再显示出 ggplot2 绘制的图形，不影响后续的输出保存。

library(ggvegan)
# Get the amount of explanation by each axis
# gernerally, we choose the first two axes.
exp_by_x <- (as.list(res_rda$CCA$eig)$RDA1)/(res_rda$tot.chi) * 100
exp_by_y <- (as.list(res_rda$CCA$eig)$RDA2)/(res_rda$tot.chi) * 100
    
ggRDA(res_rda, envfit_df = df_env, sp_size = 5) +
    # Generally theme_classic is a good choice to paint a figure
    theme_classic() +
    # In general, we don't need to show the legend in RDA figure
    theme(legend.position = "none") +
    xlab(paste('RDA1 (', round(exp_by_x, 2), '%)', sep = '')) +
    ylab(paste('RDA2 (', round(exp_by_y, 2), '%)', sep = '')) +
    # scale_XXXXX_manual series provide the ability
    # to define the style of legend by variable value
    scale_size_manual(values = c('ns' = .6,
                                 'sig' = .8)) +
    # Q: What's species here? I don't remember their is a significant level which is called 'species'
    # A: Indeed, their is no significant 'species'. However, 
        # the species name in RDA which is generated from geom_text contains colour attribution.
    scale_colour_manual(values = c(
        'ns' = '#606060',
        'sig' = 'black',
        'species' = 'red'
      )) +
    scale_linetype_manual(values = c('ns' = 8, 'sig' = 1))

最后很简单了，使用 ggsave 将 RDA 图保存成你需要的格式就可以。

Tips: 自然科学期刊的投稿系统通常支持上传 PDF 格式的图片，根据实际情况也推荐使用 ggsave 输出 PDF 格式。

此时无需设置 DPI 也无法设置 DPI，这是因为，ggsave 保存的 PDF 文件会优先将图片输出为矢量图，简而言之就是图片无论如何放大，都不会变得模糊，而期刊的排版系统也能很好的处理这种矢量图。

只要给 PDF 图片设定一个合适的宽高，就无需担心图片会变得不清晰等奇奇怪怪看似玄学的问题了。

不过如果有修改字体的需求，也许 PDF 会报错，类似于字体无法嵌入，此时输出的 PDF 文件所有的文字都会消失。关于解决这个问题，后续也许会进行进一步的讨论。

ggsave('RDA.pdf',width = 7)

Saving 7 x 7 in image

洗衣机用户不会用洗衣机

R for Everything: 使用 ggplot2 可视化 RDA 结果

Using R for Everything: ggplot2 可视化 RDA 结果

我们需要什么样的 RDA 图

如何「标记有显著性的解释变量」？

RDA 和 ENVFIT 分析

设备管理器中的 MVSI Card Reader USB Device driver 是什么？如何移除其占用盘符？