目前进展及分析

  • 详见1,2

下一步计划

  • 100 rounds simulation (尚未完全跑完)
  • 完成CHD部分
  • 挑选candidate genes

ASD

simulation

parameters:

3 annotations (Lof: ptv > 0.995, union of spidex_low3 and spliceai, without ptv, mpc>2) , log-RR = 3, 1, 2

pi = 0.05, Sample size N = 6000, num_of_genes = 5000

average of 100 rounds simulation

effect size

  • sim_logRR is within the confidence interval
  • pi_estimated = 0.055
1
readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/simulation/rr_allinfo.dt.rds")
A data.frame: 4 × 7
joint_estim_pijoint_fix_pi0.05sim_logRRannotaseparate_fix_piupper_boundlower_bound
<dbl><dbl><dbl><chr><dbl><dbl><dbl>
2.604860962.67594283.0Lof: ptv > 0.995 2.7322143.7379131 1.72651467
1.632005221.69665322.0union of spidex_low3 and spliceai, without ptv1.8350842.7070583 0.96310953
0.655042210.63851131.0mpc2 1.2087152.6279897-0.21055962
0.05489308 NA0.1pi_estimate NA0.1691957-0.05940955

FDR

fdr is always less than cutoff.

1
2
3
4
5
6
7
8
9
all_g = rep(4896,4) 
risk_g = rep(239,4)
pi = rep(239/4896,4)
FDR_cutoff = c(.2,.4,.6,.8)
g_identified = c(2,6,19,122)
g_false = c(0,2,10,95)
df = data.frame(all_g,risk_g,pi,FDR_cutoff,g_identified,g_false)
df$FDR = df$g_false / df$g_identified
df
A data.frame: 4 × 7
all_grisk_gpiFDR_cutoffg_identifiedg_falseFDR
<dbl><dbl><dbl><dbl><dbl><dbl><dbl>
48962390.048815360.2 2 00.0000000
48962390.048815360.4 6 20.3333333
48962390.048815360.6 19100.5263158
48962390.048815360.8122950.7786885

effect size of 14 SNV annotations

https://yfu1116.github.io/project/2021-05-20-fix-pi-VS-estim-pi-effect-size-num-enrich/

effect size of frameshift

  • calibrate frameshift rate using the number of non-frameshift from siblings
1
2
3
4
5
glm(observed_nonfs_count ~ SNV_rate_2N,family=poisson(link="log"),data=df)  

nonfs_rate_2N = exp(coef[1] + coef[2]* SNV_rate_2N)

frameshift_rate_2N = nonfs_rate_2N * (fs count / nonfs count)
  • estimate effect size using EM

    RR = 22.5, logRR = 3.1

details:https://yfu1116.github.io/project/2021-05-24-nonfs-rate-fs-gama-Copy1/

novel genes

1
2
3
4
5
6
7
8
9
desc = c("baseline (MPC + PTV)",
"all SNV annota",
"all SNV annota + frameshift")

pi_method1 = c("fix","fix","--")
num_risk_g1 = c(42,66,"--")
pi_method2 = c("estimate","estimate","estimate")
num_risk_g2 = c(54,81,172)
data.frame(desc,pi_method1,num_risk_g1 ,pi_method2,num_risk_g2 )
A data.frame: 3 × 5
descpi_method1num_risk_g1pi_method2num_risk_g2
<fct><fct><fct><fct><dbl>
baseline (MPC + PTV) fix42estimate 54
all SNV annota fix66estimate 81
all SNV annota + frameshift-- --estimate172
1
2
3
4
5
6
7
8
comparsion = c("`snv (estim pi)` vs. `snv (fix pi)`",
"`frameshift + snv` vs. `snv`",
"`frameshift + snv` vs. `baseline`")
risk_g = c("81 vs. 66","172 vs. 81","172 vs. 54")
novel_g = c(15,93,126)
novel_g_enrich_terms = c("Neurodevelopmental Disorders","Neurodevelopmental Disorders","Neurodevelopmental Disorders")
conclusion = c("`estimate pi` outperforms `fix pi`","frameshift model works","other annota helps to identify ND genes ")
data.frame(comparsion,risk_g,novel_g,novel_g_enrich_terms,conclusion)
A data.frame: 3 × 5
comparsionrisk_gnovel_gnovel_g_enrich_termsconclusion
<fct><fct><dbl><fct><fct>
`snv (estim pi)` vs. `snv (fix pi)`81 vs. 66 15Neurodevelopmental Disorders`estimate pi` outperforms `fix pi`
`frameshift + snv` vs. `snv` 172 vs. 81 93Neurodevelopmental Disordersframeshift model works
`frameshift + snv` vs. `baseline` 172 vs. 54126Neurodevelopmental Disordersother annota helps to identify ND genes

Enrichment analysis for novel ASD risk genes from frameshift + snv vs. baseline (merge genes from the same family)

GO

2PLWCt.png

DisGeNET (gene-disease association)

2POSrF.png

Additional issue:

有关overlapping genes:

window中保留了overlap的区间,有些基因,如UGT系列,exon区域大部分重叠,这些重叠的基因都会被模型找出,导致enrichment analysis相关类别(Neurodevelopmental Disorders)的富集度也减小,因此。。只保留一个

1
2
3
4
5
6
7
8
UGT1A3
UGT1A4
UGT1A5
UGT1A7
UGT1A10
UGT1A9
UGT1A1
UGT1A6

potentially damaging mutations

candidates:

CHD

effect size of all SNV annotations

  • ptv 0-0.05 在Joint estimation中表现奇怪,将其去除再做joint estim

effect size of frameshift

novel genes

Enrichment analysis

candidate genes

potentially damaging variants