估参算法改进
- 将sapply全部改为Rcpp进行循环,整体的估参速度提升很多,如5组annotation联合估参时,改进前需要7h,而改进后只需要1.2h
- 多个参数估参时,optim method 由默认的
Nelder-Mead改为BFGS,避免陷入局部最优
- prior由原来固定的0.05,改为一个可变的参数,和annotations的effect size一起由optim估参决定
ASD
SNV model
effect size of functional annotations
number of the affected individuals: 6430
separately estimated rr of all annotations:https://yfu1116.github.io/project/2021-05-11-rr-all-annota-ASD-CHD/
feature selection: 固定prior为0.05,用optim对每个annotation单独估参,选择置信区间在0以上的进行joint estimation
joint_logRR:
jointly estimate logRR and prior using
optim(method = "BFGS"), where logRR is a vector, and prior is a scalarresult: prior = 0.118159391983715
1 | readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/2021-05-11_selected16_joint_separate_rr.rds") |
| joint_logRR | annota | separate_logRR | upper_bound | lower_bound |
|---|---|---|---|---|
| <dbl> | <chr> | <dbl> | <dbl> | <dbl> |
| 0.7753066 | hnRNPL binding regions | 1.599488 | 3.083753 | 0.1152234 |
| 0.2631773 | coding constraint >= 90 | 1.897444 | 2.103377 | 1.6915119 |
| 0.8483160 | dbSNP.RBP-Var: likely to affcted RNA-bind, RNA 2nd structure | 1.844203 | 2.278540 | 1.4098657 |
| 0.6809386 | RNA modifications, including m6a, m1A, m5C, and etc.(filter_low_RMVar_hb) | 1.312018 | 1.645988 | 0.9780486 |
| 0.5315570 | deepsea: mix ago_adult_brain.BA4 & ago_adult_brain.Cingulate.gyrus | 2.483918 | 2.903837 | 2.0639991 |
| 0.5140172 | deepsea: mix elavl_Adult_brain.all_human_samples & elavl_Adult_brain.BA9_Alzheimer & elavl_Adult_brain.BA9 | 2.592861 | 3.006368 | 2.1793531 |
| 0.2471636 | 1<= mpc score < 2 | 1.621601 | 1.835855 | 1.4073458 |
| 0.7175324 | mpc score >= 2 | 2.321405 | 2.547951 | 2.0948598 |
| 0.7615884 | interaction-disrupting: mutations annotated as interface residues and probably damaging by PolyPhen-2 | 2.600493 | 2.828070 | 2.3729150 |
| 0.7574717 | pathogenic missense: primateAI_MVP_mix | 1.757185 | 1.892405 | 1.6219658 |
| 0.2998243 | spidex_lower10pct | 1.773391 | 1.967028 | 1.5797548 |
| 0.3567299 | spliceAI accept gain >= 0.05 | 1.997280 | 3.664953 | 0.3296074 |
| 0.9058033 | spliceAI donor gain >= 0.05 | 2.800307 | 3.476940 | 2.1236739 |
| 1.1840759 | ptv in [0,0.5) | 2.081598 | 2.557930 | 1.6052671 |
| 1.0524383 | ptv in [0.5,0.995) | 2.209710 | 2.964529 | 1.4548916 |
| 2.5145196 | ptv in [0.995,1) | 3.493791 | 3.759486 | 3.2280951 |
heatmap of similarity bewteen two annotations
x-axis: count of overlap / count of annota1 mutations in our windows (r1), y-axis: count of overlap / count of annota2 mutations in our windows (r2)
red denotes the ratio over 40%, and blue denotes the ratio less than 40%

r1: count of overlap / count of annota1 mutations in our windows, r2: count of overlap / count of annota2 mutations in our windows
以40%为分界线,两两相似度较高的annotation如下:
1 | library(data.table) |
| annota1 | annota2 | r1 | r2 |
|---|---|---|---|
| <chr> | <chr> | <dbl> | <dbl> |
| pathogenic missense: primateAI | mpc12 | 0.4027 | 0.3804 |
| pathogenic missense: primateAI | MVP | 0.4337 | 0.2908 |
| pathogenic missense: primateAI | interaction-disrupting: interface residues and probably damaging by PolyPhen-2 | 0.1035 | 0.4279 |
| pathogenic missense: primateAI | mpc2 | 0.2076 | 0.6311 |
| spidex | ptv.0-05 | 0.1838 | 0.6232 |
| spidex | ptv.995 | 0.0639 | 0.8053 |
| spidex | ptv.05-995 | 0.0497 | 0.6226 |
| pathogenic missense: MVP | interaction-disrupting: interface residues and probably damaging by PolyPhen-2 | 0.0904 | 0.5569 |
| pathogenic missense: MVP | mpc2 | 0.1080 | 0.4895 |
functional analysis
SNV model: 找出80个风险基因,其中8个基因中没有覆盖任何DNM,因此将其去除
indel model
将SNV模型得到的基因致病的posterior作为indel模型的prior,用EM算法得到frameshift的effect size,details: https://yfu1116.github.io/project/2021-05-13-indel-model-ASD/
Results:
RR(relative risk) = 3.64713076316748
logRR = 1.29394076617897
跟SNV的结果相比,indel model新增了17个新基因,这些基因的功能富集在neuron projection morphogenesis, Neurodevelopmental Disorders, Autistic behavior等(details: https://yfu1116.github.io/project/2021-05-13_md_indel-snv_ASD_riskG_GO_DisGeNET_17/)) ,一定程度上证明了indel model的可行性:
SNV model + indel model
相对于cell那篇文章里找出的102个风险基因,我们的SNV+indel model 找出了62个与其一致的基因,和27个novel ASD risk genes,这些novel risk genes的功能富集如下:
novel risk gene list:
GO

DisGeNET (gene-disease association)

potentially damaging DNMs
DNMs with functional annotations in novel risk genes :https://github.com/yfu1116/project/blob/master/files/ASD_annota_merge_info.xlsx?raw=true
打算重点看synonymous / constrained transcriptional or post-transcriptional regulatory variants
CHD
effect size of functional annotations
number of the affected individuals: 2871
separately estimated rr of all annotations: https://yfu1116.github.io/project/2021-05-11-rr-all-annota-ASD-CHD/
feature selection: 固定prior为0.05,用optim对每个annotation单独估参,选择置信区间在0以上的进行joint estimation
joint_logRR:
estimate logRR and prior using
optim(method = "BFGS"), where logRR is a vector, and prior is a scalarresult: prior = 0.0293059350146621
1 | readRDS("/storage11_7T/fuy/TADA-A/CHD/rds/sig_0512_all_annota_rr.rds") |
| joint_logRR | annota | separately_logRR | upper_bound | lower_bound | |
|---|---|---|---|---|---|
| <dbl> | <chr> | <dbl> | <dbl> | <dbl> | |
| 3 | 0.11161697 | coding constraint >= 90 | 1.687606 | 2.192012 | 1.18320074 |
| 5 | 0.70350768 | RADAR: top5% variants in the post-transcriptional regulome of RBP | 1.814882 | 2.339178 | 1.29058489 |
| 6 | 1.48950949 | dbSNP.RBP-Var: likely to affcted RNA-bind, RNA 2nd structure | 2.194627 | 2.734925 | 1.65432873 |
| 10 | 1.02046920 | deepsea: mix ago_adult_brain.BA4 & ago_adult_brain.Cingulate.gyrus | 2.430548 | 3.074458 | 1.78663714 |
| 11 | -0.18538841 | deepsea: mix elavl_Adult_brain.all_human_samples & elavl_Adult_brain.BA9_Alzheimer & elavl_Adult_brain.BA9 | 1.738795 | 3.155253 | 0.32233776 |
| 12 | 0.32493815 | 1<= mpc score < 2 | 1.813293 | 2.139593 | 1.48699171 |
| 13 | 0.61086386 | mpc score >= 2 | 2.058641 | 2.554155 | 1.56312683 |
| 14 | 0.03448242 | interaction-disrupting: mutations annotated as interface residues and probably damaging by PolyPhen-2 | 2.307627 | 2.803729 | 1.81152465 |
| 15 | 1.75734201 | pathogenic missense: primateAI_MVP_mix | 2.035133 | 2.225165 | 1.84509985 |
| 16 | 0.63166845 | spidex_lower10pct | 2.030793 | 2.296126 | 1.76545964 |
| 17 | 0.96439229 | spliceAI accept gain >= 0.05 | 2.944004 | 4.195226 | 1.69278194 |
| 18 | 0.19183110 | spliceAI donor gain >= 0.05 | 2.472156 | 3.889113 | 1.05519861 |
| 20 | 0.63851827 | spliceAI donor loss >= 0.05 | 2.156331 | 4.376120 | -0.06345714 |
| 21 | 1.12710657 | ptv in [0,0.5) | 2.283768 | 2.944003 | 1.62353425 |
| 22 | 2.61386398 | ptv in [0.5,995) | 3.039657 | 3.769662 | 2.30965246 |
| 23 | 3.02368867 | ptv in [0.995,1) | 3.701728 | 4.160876 | 3.24258001 |
risk genes
known CHD risk genelist is from the paper: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-019-0709-8
SNV model: 20 CHD risk genes are identified, of which 12 are novel genes
INDEL model: effect size of frameshift = 6.29013716769851, 2 new risk genes are identified(“CIT”, “KDM5B”)
SNV + indel model: 21 CHD risk genes are identified, of which 13 are novel genes. Details:
Functional analysis for these novel genes
GO

DisGeNET (gene-disease association)

potentially damaging DNMs
DNMs with functional annotations in novel risk genes :https://github.com/yfu1116/project/blob/master/files/CHD_annota_merge_info.xlsx?raw=true
next plan
find potential transcriptional or post-transcriptional variants in the etiology of ASD
question
1) frameshift effect size偏小的原因
2) 当一个变异有多重注释,如(protein-truncating (score: 0-0.05) and RNA modifying),如何确认哪种作用机制的影响大?