目前进展及分析

详见1,2

下一步计划

100 rounds simulation (尚未完全跑完)
完成CHD部分
挑选candidate genes

ASD

simulation

parameters:

3 annotations (Lof: ptv > 0.995, union of spidex_low3 and spliceai, without ptv, mpc>2) , log-RR = 3, 1, 2

pi = 0.05, Sample size N = 6000, num_of_genes = 5000

average of 100 rounds simulation

effect size

sim_logRR is within the confidence interval
pi_estimated = 0.055

1	readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/simulation/rr_allinfo.dt.rds")

A data.frame: 4 × 7
joint_estim_pi	joint_fix_pi0.05	sim_logRR	annota	separate_fix_pi	upper_bound	lower_bound
<dbl>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<dbl>
2.60486096	2.6759428	3.0	Lof: ptv > 0.995	2.732214	3.7379131	1.72651467
1.63200522	1.6966532	2.0	union of spidex_low3 and spliceai, without ptv	1.835084	2.7070583	0.96310953
0.65504221	0.6385113	1.0	mpc2	1.208715	2.6279897	-0.21055962
0.05489308	NA	0.1	pi_estimate	NA	0.1691957	-0.05940955

FDR

fdr is always less than cutoff.

all_g = rep(4896,4) 
risk_g = rep(239,4)
pi = rep(239/4896,4)
FDR_cutoff = c(.2,.4,.6,.8)
g_identified = c(2,6,19,122)
g_false = c(0,2,10,95)
df = data.frame(all_g,risk_g,pi,FDR_cutoff,g_identified,g_false)
df$FDR = df$g_false / df$g_identified
df

A data.frame: 4 × 7
all_g	risk_g	pi	FDR_cutoff	g_identified	g_false	FDR
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
4896	239	0.04881536	0.2	2	0	0.0000000
4896	239	0.04881536	0.4	6	2	0.3333333
4896	239	0.04881536	0.6	19	10	0.5263158
4896	239	0.04881536	0.8	122	95	0.7786885

effect size of 14 SNV annotations

https://yfu1116.github.io/project/2021-05-20-fix-pi-VS-estim-pi-effect-size-num-enrich/

effect size of frameshift

calibrate frameshift rate using the number of non-frameshift from siblings

glm(observed_nonfs_count ~ SNV_rate_2N,family=poisson(link="log"),data=df)  

nonfs_rate_2N = exp(coef[1] + coef[2]* SNV_rate_2N)

frameshift_rate_2N = nonfs_rate_2N * (fs count / nonfs count)

estimate effect size using EM

RR = 22.5, logRR = 3.1

details:https://yfu1116.github.io/project/2021-05-24-nonfs-rate-fs-gama-Copy1/

novel genes

desc = c("baseline (MPC + PTV)",
         "all SNV annota",
         "all SNV annota + frameshift")

pi_method1 = c("fix","fix","--")
num_risk_g1 = c(42,66,"--")
pi_method2 = c("estimate","estimate","estimate")
num_risk_g2 = c(54,81,172)
data.frame(desc,pi_method1,num_risk_g1 ,pi_method2,num_risk_g2 )

A data.frame: 3 × 5
desc	pi_method1	num_risk_g1	pi_method2	num_risk_g2
<fct>	<fct>	<fct>	<fct>	<dbl>
baseline (MPC + PTV)	fix	42	estimate	54
all SNV annota	fix	66	estimate	81
all SNV annota + frameshift	--	--	estimate	172

comparsion = c("`snv (estim pi)` vs. `snv (fix pi)`",
              "`frameshift + snv` vs. `snv`",
              "`frameshift + snv` vs. `baseline`")
risk_g = c("81 vs. 66","172 vs. 81","172 vs. 54")
novel_g = c(15,93,126)
novel_g_enrich_terms = c("Neurodevelopmental Disorders","Neurodevelopmental Disorders","Neurodevelopmental Disorders")
conclusion = c("`estimate pi` outperforms `fix pi`","frameshift model works","other annota helps to identify ND genes ")
data.frame(comparsion,risk_g,novel_g,novel_g_enrich_terms,conclusion)

A data.frame: 3 × 5
comparsion	risk_g	novel_g	novel_g_enrich_terms	conclusion
<fct>	<fct>	<dbl>	<fct>	<fct>
`snv (estim pi)` vs. `snv (fix pi)`	81 vs. 66	15	Neurodevelopmental Disorders	`estimate pi` outperforms `fix pi`
`frameshift + snv` vs. `snv`	172 vs. 81	93	Neurodevelopmental Disorders	frameshift model works
`frameshift + snv` vs. `baseline`	172 vs. 54	126	Neurodevelopmental Disorders	other annota helps to identify ND genes

Enrichment analysis for novel ASD risk genes from frameshift + snv vs. baseline (merge genes from the same family)

DisGeNET (gene-disease association)

Additional issue:

有关overlapping genes：

window中保留了overlap的区间，有些基因，如UGT系列，exon区域大部分重叠，这些重叠的基因都会被模型找出，导致enrichment analysis相关类别（Neurodevelopmental Disorders）的富集度也减小，因此。。只保留一个

UGT1A3
UGT1A4
UGT1A5
UGT1A7
UGT1A10
UGT1A9
UGT1A1
UGT1A6

ASD

simulation

average of 100 rounds simulation

effect size

FDR

effect size of 14 SNV annotations

effect size of frameshift

novel genes

potentially damaging mutations

CHD

effect size of all SNV annotations

effect size of frameshift

novel genes

Enrichment analysis

candidate genes

potentially damaging variants