估参算法改进

  • 将sapply全部改为Rcpp进行循环,整体的估参速度提升很多,如5组annotation联合估参时,改进前需要7h,而改进后只需要1.2h
  • 多个参数估参时,optim method 由默认的Nelder-Mead改为BFGS,避免陷入局部最优
  • prior由原来固定的0.05,改为一个可变的参数,和annotations的effect size一起由optim估参决定

ASD

SNV model

effect size of functional annotations

number of the affected individuals: 6430

separately estimated rr of all annotations:https://yfu1116.github.io/project/2021-05-11-rr-all-annota-ASD-CHD/

feature selection: 固定prior为0.05,用optim对每个annotation单独估参,选择置信区间在0以上的进行joint estimation

joint_logRR:

  • jointly estimate logRR and prior using optim(method = "BFGS"), where logRR is a vector, and prior is a scalar

  • result: prior = 0.118159391983715

1
readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/2021-05-11_selected16_joint_separate_rr.rds")
A data.frame: 16 × 5
joint_logRRannotaseparate_logRRupper_boundlower_bound
<dbl><chr><dbl><dbl><dbl>
0.7753066hnRNPL binding regions 1.5994883.0837530.1152234
0.2631773coding constraint >= 90 1.8974442.1033771.6915119
0.8483160dbSNP.RBP-Var: likely to affcted RNA-bind, RNA 2nd structure 1.8442032.2785401.4098657
0.6809386RNA modifications, including m6a, m1A, m5C, and etc.(filter_low_RMVar_hb) 1.3120181.6459880.9780486
0.5315570deepsea: mix ago_adult_brain.BA4 & ago_adult_brain.Cingulate.gyrus 2.4839182.9038372.0639991
0.5140172deepsea: mix elavl_Adult_brain.all_human_samples & elavl_Adult_brain.BA9_Alzheimer & elavl_Adult_brain.BA92.5928613.0063682.1793531
0.24716361<= mpc score < 2 1.6216011.8358551.4073458
0.7175324mpc score >= 2 2.3214052.5479512.0948598
0.7615884interaction-disrupting: mutations annotated as interface residues and probably damaging by PolyPhen-2 2.6004932.8280702.3729150
0.7574717pathogenic missense: primateAI_MVP_mix 1.7571851.8924051.6219658
0.2998243spidex_lower10pct 1.7733911.9670281.5797548
0.3567299spliceAI accept gain >= 0.05 1.9972803.6649530.3296074
0.9058033spliceAI donor gain >= 0.05 2.8003073.4769402.1236739
1.1840759ptv in [0,0.5) 2.0815982.5579301.6052671
1.0524383ptv in [0.5,0.995) 2.2097102.9645291.4548916
2.5145196ptv in [0.995,1) 3.4937913.7594863.2280951

heatmap of similarity bewteen two annotations

x-axis: count of overlap / count of annota1 mutations in our windows (r1), y-axis: count of overlap / count of annota2 mutations in our windows (r2)

red denotes the ratio over 40%, and blue denotes the ratio less than 40%

gBAwa4.png

r1: count of overlap / count of annota1 mutations in our windows, r2: count of overlap / count of annota2 mutations in our windows

以40%为分界线,两两相似度较高的annotation如下:

1
2
library(data.table)
readRDS("/storage11_7T/fuy/TADA-A/annotation/overlp/results/asd_chd_red.txt")
A data.table: 9 × 4
annota1annota2r1r2
<chr><chr><dbl><dbl>
pathogenic missense: primateAI mpc12 0.40270.3804
pathogenic missense: primateAI MVP 0.43370.2908
pathogenic missense: primateAI interaction-disrupting: interface residues and probably damaging by PolyPhen-20.10350.4279
pathogenic missense: primateAI mpc2 0.20760.6311
spidex ptv.0-05 0.18380.6232
spidex ptv.995 0.06390.8053
spidex ptv.05-995 0.04970.6226
pathogenic missense: MVP interaction-disrupting: interface residues and probably damaging by PolyPhen-20.09040.5569
pathogenic missense: MVP mpc2 0.10800.4895

functional analysis

SNV model: 找出80个风险基因,其中8个基因中没有覆盖任何DNM,因此将其去除

indel model

将SNV模型得到的基因致病的posterior作为indel模型的prior,用EM算法得到frameshift的effect size,details: https://yfu1116.github.io/project/2021-05-13-indel-model-ASD/

Results:

RR(relative risk) = 3.64713076316748

logRR = 1.29394076617897

跟SNV的结果相比,indel model新增了17个新基因,这些基因的功能富集在neuron projection morphogenesis, Neurodevelopmental Disorders, Autistic behavior等(details: https://yfu1116.github.io/project/2021-05-13_md_indel-snv_ASD_riskG_GO_DisGeNET_17/)) ,一定程度上证明了indel model的可行性:

SNV model + indel model

相对于cell那篇文章里找出的102个风险基因,我们的SNV+indel model 找出了62个与其一致的基因,和27个novel ASD risk genes,这些novel risk genes的功能富集如下:

novel risk gene list:

GO

gNAU8U.png

DisGeNET (gene-disease association)

gNEQJK.png

potentially damaging DNMs

DNMs with functional annotations in novel risk genes :https://github.com/yfu1116/project/blob/master/files/ASD_annota_merge_info.xlsx?raw=true

打算重点看synonymous / constrained transcriptional or post-transcriptional regulatory variants

CHD

effect size of functional annotations

number of the affected individuals: 2871

separately estimated rr of all annotations: https://yfu1116.github.io/project/2021-05-11-rr-all-annota-ASD-CHD/

feature selection: 固定prior为0.05,用optim对每个annotation单独估参,选择置信区间在0以上的进行joint estimation

joint_logRR:

  • estimate logRR and prior using optim(method = "BFGS"), where logRR is a vector, and prior is a scalar

  • result: prior = 0.0293059350146621

1
readRDS("/storage11_7T/fuy/TADA-A/CHD/rds/sig_0512_all_annota_rr.rds")
A data.frame: 16 × 5
joint_logRRannotaseparately_logRRupper_boundlower_bound
<dbl><chr><dbl><dbl><dbl>
3 0.11161697coding constraint >= 90 1.6876062.192012 1.18320074
5 0.70350768RADAR: top5% variants in the post-transcriptional regulome of RBP 1.8148822.339178 1.29058489
6 1.48950949dbSNP.RBP-Var: likely to affcted RNA-bind, RNA 2nd structure 2.1946272.734925 1.65432873
10 1.02046920deepsea: mix ago_adult_brain.BA4 & ago_adult_brain.Cingulate.gyrus 2.4305483.074458 1.78663714
11-0.18538841deepsea: mix elavl_Adult_brain.all_human_samples & elavl_Adult_brain.BA9_Alzheimer & elavl_Adult_brain.BA91.7387953.155253 0.32233776
12 0.324938151<= mpc score < 2 1.8132932.139593 1.48699171
13 0.61086386mpc score >= 2 2.0586412.554155 1.56312683
14 0.03448242interaction-disrupting: mutations annotated as interface residues and probably damaging by PolyPhen-2 2.3076272.803729 1.81152465
15 1.75734201pathogenic missense: primateAI_MVP_mix 2.0351332.225165 1.84509985
16 0.63166845spidex_lower10pct 2.0307932.296126 1.76545964
17 0.96439229spliceAI accept gain >= 0.05 2.9440044.195226 1.69278194
18 0.19183110spliceAI donor gain >= 0.05 2.4721563.889113 1.05519861
20 0.63851827spliceAI donor loss >= 0.05 2.1563314.376120-0.06345714
21 1.12710657ptv in [0,0.5) 2.2837682.944003 1.62353425
22 2.61386398ptv in [0.5,995) 3.0396573.769662 2.30965246
23 3.02368867ptv in [0.995,1) 3.7017284.160876 3.24258001

risk genes

known CHD risk genelist is from the paper: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-019-0709-8

SNV model: 20 CHD risk genes are identified, of which 12 are novel genes

INDEL model: effect size of frameshift = 6.29013716769851, 2 new risk genes are identified(“CIT”, “KDM5B”)

SNV + indel model: 21 CHD risk genes are identified, of which 13 are novel genes. Details:

Functional analysis for these novel genes

GO

gBUQPA.png

DisGeNET (gene-disease association)

gBU0Gn.png

potentially damaging DNMs

DNMs with functional annotations in novel risk genes :https://github.com/yfu1116/project/blob/master/files/CHD_annota_merge_info.xlsx?raw=true

next plan

find potential transcriptional or post-transcriptional variants in the etiology of ASD

question

1) frameshift effect size偏小的原因

2) 当一个变异有多重注释,如(protein-truncating (score: 0-0.05) and RNA modifying),如何确认哪种作用机制的影响大?