Using R for Everything: ggplot2 可视化 RDA 结果
关于 RDA(Redundancy analysis 冗余分析)是什么,相信对于已经有可视化需求的同学来说已经不用更多的解释了。
在 R 中常用来进行 RDA 分析和绘制工作的是 vegan 和 ggvegan 这两个包。然而,在实际使用中,最常遇到的问题是虽然这些包内建的 plot 等功能可以绘制出基本可用的包,但想要进一步的定制图形却没那么容易。
想要绘制出一副自己满意、编辑满意、导师满意最主要的是审稿人满意的 RDA 结果,作为最强可视化工具之一的 ggplot2 包毋庸置疑是最佳的选择。
我们需要什么样的 RDA 图
首先,我们来思考我们需要什么样的 RDA 图?按照世纪需求以及审稿人的建议:
I would recommend showing in bold the variables with significant correlations
笔者最后的目标是绘制一幅:
1. 显示物种信息(实际上是响应变量矩阵);
2. 环境变量(实际上是解释变量矩阵);
3. 在两轴上能显示各自的解释度;
4. 标记有显著性的解释变量。
以笔者对 plot.rda 以及 autoplot.rda 这两个 vegan 和 ggvegan 内建函数的浅薄了解,似乎很难完成。
如何「标记有显著性的解释变量」?
如果进行过 RDA 分析不难发现,使用 vegan 内建的 rda 是没有标记解释变量的显著性的。其实这种显著性需要 vegan 内建的 envfit 函数来得到一个及其近似的结果。通常这个方法在论文中会写作:
Monte Carlo permutation (999 permutations) was used to identify axes with significant eigenvalues and species-environment correlations.
由于是进行了 999 次(默认参数可以修改)的蒙特卡洛抽样,因此这个结果是及其近似的结果,直接用作 RDA 中解释变量的显著性是没有问题的。然而,envfit 输出的结果并非标准的 data.frame 或者类似的结果,无法方便的输出或者进行分析,后续我们还会进行结果提取的步骤,暂且按下不表。
RDA 和 ENVFIT 分析
此例,我们使用随机抽取的 150 条土壤线虫群落群落数据以及对应的环境数据,由于不影响后续的理解以及版权考量,对各参数名不再解释。
每一步的操作以及原因以注释的形式呈现。
library('tidyverse')
# read Environmental Variables
df_env <- read_csv('df_env_smp.csv')
# read Community composition matrix
df_com <- read_csv('df_com_smp.csv')
# View the structure of data
head(df_env)
head(df_com)
Parsed with column specification:
cols(
pH = [32mcol_double()[39m,
MOI = [32mcol_double()[39m,
TN = [32mcol_double()[39m,
TP = [32mcol_double()[39m,
AP = [32mcol_double()[39m,
NH4 = [32mcol_double()[39m,
NO3 = [32mcol_double()[39m,
SOC = [32mcol_double()[39m,
MAT = [32mcol_double()[39m,
MAP = [32mcol_double()[39m,
PBM = [32mcol_double()[39m,
PCV = [32mcol_double()[39m,
PSR = [32mcol_double()[39m
)
Parsed with column specification:
cols(
pp = [32mcol_double()[39m,
ba = [32mcol_double()[39m,
fu = [32mcol_double()[39m,
pr = [32mcol_double()[39m,
om = [32mcol_double()[39m
)
pH | MOI | TN | TP | AP | NH4 | NO3 | SOC | MAT | MAP | PBM | PCV | PSR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
-1.6396547 | -0.1610384 | -0.1162406 | -0.44892589 | 0.17966965 | 0.4471754 | -0.5148515 | -0.292278803 | 0.8420221 | 0.02487222 | 0.17638650 | 0.426595458 | 0.2275781 |
0.9429126 | -0.5549069 | -0.7919730 | 0.02952209 | -0.78701735 | -0.3426843 | -0.4197756 | -0.697789311 | -1.5519670 | 0.27013035 | -0.02465146 | 0.008333099 | -0.8352947 |
-1.3387571 | -0.2762522 | 0.1998817 | -0.14900328 | -0.10273314 | 0.7179844 | -0.4922144 | 0.461059956 | -0.8480169 | 1.21243154 | 0.73829039 | -0.793336422 | 0.8349340 |
-1.2035947 | -0.1658411 | -0.2600847 | -0.45606691 | 0.09277663 | -0.1328073 | -0.1677489 | -0.007572046 | 0.7833595 | -0.87351309 | -0.80968336 | -0.514494850 | -1.2908116 |
-1.1762407 | 1.0043821 | 0.2918750 | -0.47748995 | 2.24338416 | 0.7924569 | -0.5827629 | 0.035809220 | -0.2893264 | -0.22471654 | 0.12645186 | 0.182609082 | 0.9867730 |
0.4263991 | 0.5467350 | -0.3570957 | -0.12043922 | 0.54896611 | -0.3652517 | -0.5540892 | -0.360483891 | -0.8480169 | 1.21243154 | 2.11789718 | -0.026522097 | 0.6830950 |
pp | ba | fu | pr | om |
---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
2.550000 | 28.010000 | 2.550000 | 2.550000 | 15.280000 |
1.311046 | 7.316707 | 1.337884 | 1.337884 | 5.953381 |
8.250000 | 13.255000 | 3.355000 | 0.000000 | 14.850000 |
102.490000 | 18.516667 | 20.946667 | 0.000000 | 23.500000 |
21.350000 | 52.720000 | 14.846667 | 0.000000 | 5.623333 |
2.490000 | 4.965000 | 1.245000 | 0.000000 | 2.475000 |
library('vegan')
# Performing RDA and viewing the results
res_rda <- rda(df_com, df_env)
res_rda
# Performing ENVFIT
res_envfit <- envfit(df_com, df_env)
# However, the result of ENVFIT is not in the data.frame format, we should extract useful information from it.
res_envfit
Call: rda(X = df_com, Y = df_env)
Inertia Proportion Rank
Total 2031.7341 1.0000
Constrained 895.7317 0.4409 5
Unconstrained 1136.0024 0.5591 5
Inertia is variance
Eigenvalues for constrained axes:
RDA1 RDA2 RDA3 RDA4 RDA5
754.0 88.3 45.4 6.2 1.8
Eigenvalues for unconstrained axes:
PC1 PC2 PC3 PC4 PC5
747.4 235.0 96.3 44.7 12.6
***VECTORS
pp ba r2 Pr(>r)
pH -0.94935 0.31421 0.0809 0.003 **
MOI 0.79143 0.61126 0.2636 0.001 ***
TN 0.91399 0.40574 0.0887 0.004 **
TP 0.86358 -0.50421 0.0002 0.984
AP 0.99269 0.12071 0.0575 0.026 *
NH4 0.94495 -0.32722 0.0462 0.033 *
NO3 0.52786 0.84933 0.1430 0.002 **
SOC 0.81629 0.57764 0.1440 0.001 ***
MAT 0.10537 -0.99443 0.0118 0.405
MAP -0.18615 -0.98252 0.0073 0.579
PBM 0.75360 0.65733 0.0096 0.508
PCV 0.57508 0.81810 0.0380 0.072 .
PSR 0.88521 -0.46519 0.0062 0.639
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Permutation: free
Number of permutations: 999
正如我们在注释提到的,ENVFIT 并不是以 data.frame 的形式提供数据,因此我们还需要通过一些提取的操作才能获得与结果相同的表格。由于这类操作经常使用,因此我们将其包装成 function 并命名为 envfit_to_df
# Here, env_obj indicates the result of envfit. In this case, it's the res_envfit.
# r2_dig is the significant figure of R2
# p_dig is the significant figure of p value
envfit_to_df <- function(env_obj,
r2_dig = 6,
p_dig = 3) {
r2_fmt <- as.character(paste('%.', r2_dig, 'f', sep = ''))
p_fmt <- as.character(paste('%.', p_dig, 'f', sep = ''))
tibble(
# the name of explainary variables
factor = names(env_obj$vectors$r),
# list or vector of R2
r2 = env_obj$vectors$r,
# list or vector of p values
pvals = env_obj$vectors$pvals
) %>%
# generate significant levels by p values
mutate(sig = case_when(
pvals <= 0.001 ~ '***',
pvals <= 0.01 ~ '**',
pvals <= 0.05 ~ '*',
TRUE ~ ' '
)) %>%
# format the significant figure by format definition before.
mutate(pvals = sprintf('%.3f', pvals),
r2 = sprintf(r2_fmt, r2))
}
# Convert result of ENVFIT to data.frame, in fact, it's a tibble.
df_env <- envfit_to_df(res_envfit, r2_dig = 3)
df_env
factor | r2 | pvals | sig |
---|---|---|---|
<chr> | <chr> | <chr> | <chr> |
pH | 0.081 | 0.003 | ** |
MOI | 0.264 | 0.001 | *** |
TN | 0.089 | 0.004 | ** |
TP | 0.000 | 0.984 | |
AP | 0.058 | 0.026 | * |
NH4 | 0.046 | 0.033 | * |
NO3 | 0.143 | 0.002 | ** |
SOC | 0.144 | 0.001 | *** |
MAT | 0.012 | 0.405 | |
MAP | 0.007 | 0.579 | |
PBM | 0.010 | 0.508 | |
PCV | 0.038 | 0.072 | |
PSR | 0.006 | 0.639 |
截止到目前,我们已经准备完成了 RDA 分析以及 ENVFIT 分析,并将数据转换成为了满足可视化需求的格式。
接下来就是需要进行可视化作业。此例,笔者仅需要显示环境变量与物种信息,不需要显示样地信息,因此绘制中仅保留了所需的信息,如需显示样地信息,请按实际需求更改。
绘制 RDA 图形是常用的操作,因此同样将它包装成为 function 并命名为 ggRDA
# Here, rda_obj means the object which is from vegan::rda
# sp_size means the text size of species
# arrow_txt_size means the environmet variable names at the end of the arrow
# Because not every RDA plot needs indicate significant correlations, the envfit_df is optional here.
ggRDA <- function(rda_obj,sp_size = 4,arrow_txt_size = 4, envfit_df) {
# ggplot doesn't support rda object directly, we use ggvegan::fortify function to convert the rda to data.frame
fmod <- fortify(rda_obj)
# to get the arrow of biplot, we plot rda by vegan::plot.rda function firstly.
# The arrow attributes contain in the attributes(plot_obejct$biplot)$arrow.mul
basplot <- plot(rda_obj)
mult <- attributes(basplot$biplot)$arrow.mul
# To check if envfit_df exists or not
# If envfit_df exists, join the fortified rda_obj and envfit to mark which variable is significant.
if(missingArg(envfit_df)){
bplt_df <- filter(fmod, Score == "biplot") %>%
# If there is no requirement to mark significant variable
# use the sytle of sinificant (black bolder solid arrow)
# to paint the arrow
mutate(bold = 'sig')
} else {
bplt_df <- filter(fmod, Score == "biplot") %>%
left_join(envfit_df, by = c('Label' = 'factor')) %>%
# To mark the significant variables as sig, not significant variables as ns
# these information are stored in bold column
mutate(bold = ifelse(str_detect(sig, fixed('*')), 'sig', 'ns'))
}
ggplot(fmod, aes(x = RDA1, y = RDA2)) +
coord_fixed() +
geom_segment(data = bplt_df,
# mult and RDA1/RDA2 are from fortified RDA data.frame
# they contain the direction and effects of every variabl
# their products are the direction and length of arrows
aes(x = 0, xend = mult * RDA1,
y = 0, yend = mult * RDA2,
# Use different arrow size to indicate the significant level
size = bold,
# Use different arrow color to indicate the significant level
color = bold,
# Use different arrow linetype to indicate the significant level
linetype = bold),
#############################
# Q:Why use three different attibution to control the significant levels?
# It is redundancy, isn't it?
# A: In fact, it's not easy to recongize the significant level by one kind attribution
# Becasue it is not delicate to indicate it with supper bold and supper thin arrow,
# by the same logic, high contrast colors are not delicate neither.
# As for the line type, some arrow are really short, it's not easy to recognize
# weather it is solid or dashed line at all.
# To sum up, we use three different attributions
# to indicate the same difference to avoid any misleading.
#############################
# to control the size of the header of arrow
arrow = arrow(length = unit(0.25, "cm")),
) +
# Add the text of variable name at the end of arrow
geom_text(data = subset(fmod, Score == "biplot"),
aes(x = (mult + mult/10) * RDA1,
#we add 10% to the text to push it slightly out from arrows
y = (mult + mult/10) * RDA2,
label = Label),
size = arrow_txt_size,
#otherwise you could use hjust and vjust. I prefer this option
hjust = 0.5) +
# Add the text of species
geom_text(
data = subset(fmod, Score == "species"),
aes(colour = "species",label = Label),
size = sp_size
)
}
虽然使用 ggRDA 可以直接绘制图形,但通常为了美观,对于特定的参数还需要进一步的调整。
注意,由于在 ggRDA 中使用了 vegan::plot.rda 绘制图像,所以在下面的调用 ggRDA 会首先绘制一次简易的 rda 图像之后,再显示出 ggplot2 绘制的图形,不影响后续的输出保存。
library(ggvegan)
# Get the amount of explanation by each axis
# gernerally, we choose the first two axes.
exp_by_x <- (as.list(res_rda$CCA$eig)$RDA1)/(res_rda$tot.chi) * 100
exp_by_y <- (as.list(res_rda$CCA$eig)$RDA2)/(res_rda$tot.chi) * 100
ggRDA(res_rda, envfit_df = df_env, sp_size = 5) +
# Generally theme_classic is a good choice to paint a figure
theme_classic() +
# In general, we don't need to show the legend in RDA figure
theme(legend.position = "none") +
xlab(paste('RDA1 (', round(exp_by_x, 2), '%)', sep = '')) +
ylab(paste('RDA2 (', round(exp_by_y, 2), '%)', sep = '')) +
# scale_XXXXX_manual series provide the ability
# to define the style of legend by variable value
scale_size_manual(values = c('ns' = .6,
'sig' = .8)) +
# Q: What's species here? I don't remember their is a significant level which is called 'species'
# A: Indeed, their is no significant 'species'. However,
# the species name in RDA which is generated from geom_text contains colour attribution.
scale_colour_manual(values = c(
'ns' = '#606060',
'sig' = 'black',
'species' = 'red'
)) +
scale_linetype_manual(values = c('ns' = 8, 'sig' = 1))
最后很简单了,使用 ggsave 将 RDA 图保存成你需要的格式就可以。
Tips: 自然科学期刊的投稿系统通常支持上传 PDF 格式的图片,根据实际情况也推荐使用 ggsave 输出 PDF 格式。
此时无需设置 DPI 也无法设置 DPI,这是因为,ggsave 保存的 PDF 文件会优先将图片输出为矢量图,简而言之就是图片无论如何放大,都不会变得模糊,而期刊的排版系统也能很好的处理这种矢量图。
只要给 PDF 图片设定一个合适的宽高,就无需担心图片会变得不清晰等奇奇怪怪看似玄学的问题了。
不过如果有修改字体的需求,也许 PDF 会报错,类似于字体无法嵌入,此时输出的 PDF 文件所有的文字都会消失。关于解决这个问题,后续也许会进行进一步的讨论。
ggsave('RDA.pdf',width = 7)
Saving 7 x 7 in image