Check RR, prior, and FDR by simulation
parameters: 3 annotations (Lof: ptv > 0.995, union of spidex_low3 and spliceai, without ptv, mpc>2) , log-RR = 3, 2,1
pi = 0.05, Sample size N = 6000, num_of_genes = 5000
logRR and pi (averaged over 100 rounds of simulation)
1 | readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/simulation/org.rr.mean.rds") |
| annota | Lof: ptv > 0.995 | union of spidex_low3 and spliceai, without ptv | mpc>2 | prior |
|---|---|---|---|---|
| sim_par | 3.00 | 2.00 | 1.00 | 0.05 |
| joint_estim_average | 2.7197 | 1.7714 | 0.6039 | 0.0926 |
- estimated parameter大体上接近simulation的设定值,但有时会估出CI (Confidence Interval) 特别大的结果,如:
1 | readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/simulation/dt19_allrrinfo.rds") |
| annota | joint_estim | joint_estim_CI | fix_pi0.05_joint | fix_pi0.05_sep | fix_pi0.05_sep_CI | obs_DNM | bg_DNM | burden(obs/bg) |
|---|---|---|---|---|---|---|---|---|
| <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
| Lof: ptv > 0.995 | 1.2852623 | [-0.103, 2.674] | 2.256291 | 2.469966 | [1.275, 3.665] | 15 | 10 | 1.5000000 |
| union of spidex_low3 and spliceai, without ptv | 1.1044304 | [0.16, 2.049] | 1.986388 | 2.044128 | [1.361, 2.727] | 81 | 58 | 1.3965517 |
| mpc>2 | -3.4387649 | [-25.695, 18.818] | -4.936559 | -1 | [-5.296, 3.296] | 49 | 61 | 0.8032787 |
| prior | 0.1991692 | [-0.057, 0.455] | fix prior = 0.05 | fix prior = 0.05 | fix prior = 0.05 | NA | NA | NA |
FDR (averaged over 100 rounds of simulation)
- it seems that 0.1 is feasible.
1 | readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/simulation/org.fdr.check.res.rds") |
| cutoff.lst | fdr.average |
|---|---|
| <dbl> | <dbl> |
| 0.02 | 0.02165 |
| 0.05 | 0.04120 |
| 0.08 | 0.08554 |
| 0.10 | 0.10238 |
| 0.15 | 0.17765 |
| 0.20 | 0.24224 |
| 0.30 | 0.31463 |
| 0.50 | 0.53028 |
two simulated dataset (‘Rdata’ files for mutation rate and mutation count)
- dt9 (estimated RRs are normal)
- dt19(estimated RRs are abnormal)
download: https://github.com/yfu1116/project/tree/master/files/RData
ASD
RR of 14 functional annotations
SNV rate was calibrated using the number of synonymous mutations.
We fixed the prior as 0.05 in the initial feature selection,and took all significant ones into joint estimation, where pi was also regarded as a parameter.
pi = 0.12 in joint estimation
details: https://yfu1116.github.io/project/2021-05-20-fix-pi-VS-estim-pi-effect-size-num-enrich/
effect size of frameshift
method 1
- derived inframe rate using GLM
1 | glm(formula = inframe count from the affected ~ calibrated_SNV_rate * 2N + GC_content, family = poisson(link = "log"), data = df) |
frameshift rate = (frameshift count / inframe count from the unaffected) * inframe rate
estimated relative risk of frameshift using EM, RR = 17.7, logRR = 2.9
details: https://yfu1116.github.io/project/2021-08-10-nonfs-rate-fs-gama-report/
method 2
- took the RR of nonsense as the RR of frameshift (fix prior = 0.05), logRR = 1.952505, CI=[1.782561, 2.122449]
There is a big difference of results between these two methods.
novel genes (method 1)
- FDR = 0.1
- 158 risk genes were identified, including
105 novel geneswith overlapping genes excluded
For overlapping genes:
window中保留了overlap的区间,因此有些与risk genes重叠较多的基因,如UGT系列,也会被模型找出,目前采取的办法是只保留文章里DNM对应的那个
1 | UGT1A3, UGT1A4, UGT1A5, UGT1A7, UGT1A10, UGT1A9, UGT1A1, UGT1A6 |
- comparison between the results of baseline (MPC+PTV) and the annotations we used
https://yfu1116.github.io/project/2021-08-10-ASD-CHD-res-comparison/
Enrichment analysis
105 novel ASD risk genes (frameshift + snv vs. baseline; remove the redundant)
GO

DisGeNET (gene-disease association)

potentially damaging mutations
Some of the synonymous DNMs might be pathogenic by disrupting splicing or RNA-binding. For instance, a synonymous mutation in KCNMA1 gene, might contribute to the etiology of ASD through regulation of argonaute protein (AGO) binding.(to be further investigated)
CHD
effect size of 13 SNV annotations, pi=0.022
https://yfu1116.github.io/project/2021-06-01-CHD-selected13-all-rr-info/
effect size of frameshift
method 1
- RR =52.85 , logRR=3.97
method 2
- RR 105.6, = logRR = 4.66
novel genes (method 1)
- FDR = 0.1
- 32 risk genes were identified, including
24 novel genes(compared to the baseline (MPC+PTV)) with overlapping genes excluded - results comparison between baseline and other annotations
https://yfu1116.github.io/project/2021-08-10-ASD-CHD-res-comparison/
Enrichment analysis
32 risk genes
GO
DisGeNET (gene-disease association)

GWAS signals
We do not find any CHD GWAS signals associated with these risk genes.
potentially damaging variants
all info: https://github.com/yfu1116/project/blob/master/files/filtered.indel%2Bsnv_CHD_GO_DisGeNET.xlsx
new DD dataset to be explored
- https://www.nature.com/articles/s41586-020-2832-5#Sec2
- 45,221 coding and splicing DNMs in 31,058 individuals
Question
- How to explain the outliers in RRs (account for 20%) estimated using simulated data?
- There is a big difference of frameshift RR between method 1 and method 2. I’m confused why the RR of nonsense could be a substitution for the RR of frameshift.