Check RR, prior, and FDR by simulation

parameters: 3 annotations (Lof: ptv > 0.995, union of spidex_low3 and spliceai, without ptv, mpc>2) , log-RR = 3, 2,1

pi = 0.05, Sample size N = 6000, num_of_genes = 5000

logRR and pi (averaged over 100 rounds of simulation)

1
readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/simulation/org.rr.mean.rds")
A matrix: 3 × 4 of type chr
annotaLof: ptv > 0.995union of spidex_low3 and spliceai, without ptvmpc>2 prior
sim_par3.00 2.00 1.00 0.05
joint_estim_average2.7197 1.7714 0.60390.0926
  • estimated parameter大体上接近simulation的设定值,但有时会估出CI (Confidence Interval) 特别大的结果,如:
1
readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/simulation/dt19_allrrinfo.rds")
A data.frame: 4 × 9
annotajoint_estimjoint_estim_CIfix_pi0.05_jointfix_pi0.05_sepfix_pi0.05_sep_CIobs_DNMbg_DNMburden(obs/bg)
<chr><dbl><chr><chr><chr><chr><dbl><dbl><dbl>
Lof: ptv > 0.995 1.2852623[-0.103, 2.674] 2.256291 2.469966 [1.275, 3.665] 15101.5000000
union of spidex_low3 and spliceai, without ptv 1.1044304[0.16, 2.049] 1.986388 2.044128 [1.361, 2.727] 81581.3965517
mpc>2 -3.4387649[-25.695, 18.818]-4.936559 -1 [-5.296, 3.296] 49610.8032787
prior 0.1991692[-0.057, 0.455] fix prior = 0.05fix prior = 0.05fix prior = 0.05NANA NA

FDR (averaged over 100 rounds of simulation)

  • it seems that 0.1 is feasible.
1
readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/simulation/org.fdr.check.res.rds")
A data.frame: 8 × 2
cutoff.lstfdr.average
<dbl><dbl>
0.020.02165
0.050.04120
0.080.08554
0.100.10238
0.150.17765
0.200.24224
0.300.31463
0.500.53028

two simulated dataset (‘Rdata’ files for mutation rate and mutation count)

  • dt9 (estimated RRs are normal)
  • dt19(estimated RRs are abnormal)

download: https://github.com/yfu1116/project/tree/master/files/RData

ASD

RR of 14 functional annotations

effect size of frameshift

method 1

  • derived inframe rate using GLM
1
glm(formula = inframe count from the affected ~ calibrated_SNV_rate * 2N + GC_content, family = poisson(link = "log"), data = df)

method 2

  • took the RR of nonsense as the RR of frameshift (fix prior = 0.05), logRR = 1.952505, CI=[1.782561, 2.122449]

There is a big difference of results between these two methods.

novel genes (method 1)

  • FDR = 0.1
  • 158 risk genes were identified, including 105 novel genes with overlapping genes excluded

For overlapping genes:

window中保留了overlap的区间,因此有些与risk genes重叠较多的基因,如UGT系列,也会被模型找出,目前采取的办法是只保留文章里DNM对应的那个

1
UGT1A3, UGT1A4, UGT1A5, UGT1A7, UGT1A10, UGT1A9, UGT1A1, UGT1A6

Enrichment analysis

105 novel ASD risk genes (frameshift + snv vs. baseline; remove the redundant)

GO

fYpoPx.png

DisGeNET (gene-disease association)

fY9iQS.png

potentially damaging mutations

CHD

effect size of 13 SNV annotations, pi=0.022

https://yfu1116.github.io/project/2021-06-01-CHD-selected13-all-rr-info/

effect size of frameshift

method 1

  • RR =52.85 , logRR=3.97

method 2

  • RR 105.6, = logRR = 4.66

novel genes (method 1)

Enrichment analysis

32 risk genes

GO
fULVJK.png

DisGeNET (gene-disease association)

fUL2y4.png

GWAS signals

We do not find any CHD GWAS signals associated with these risk genes.

potentially damaging variants

all info: https://github.com/yfu1116/project/blob/master/files/filtered.indel%2Bsnv_CHD_GO_DisGeNET.xlsx

new DD dataset to be explored

Question

  • How to explain the outliers in RRs (account for 20%) estimated using simulated data?
  • There is a big difference of frameshift RR between method 1 and method 2. I’m confused why the RR of nonsense could be a substitution for the RR of frameshift.