Check RR, prior, and FDR by simulation

parameters: 3 annotations (Lof: ptv > 0.995, union of spidex_low3 and spliceai, without ptv, mpc>2) , log-RR = 3, 2,1

pi = 0.05, Sample size N = 6000, num_of_genes = 5000

logRR and pi （averaged over 100 rounds of simulation）

1	readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/simulation/org.rr.mean.rds")

A matrix: 3 × 4 of type chr
annota	Lof: ptv > 0.995	union of spidex_low3 and spliceai, without ptv	mpc>2	prior
sim_par	3.00	2.00	1.00	0.05
joint_estim_average	2.7197	1.7714	0.6039	0.0926

estimated parameter大体上接近simulation的设定值，但有时会估出CI (Confidence Interval) 特别大的结果，如：

1	readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/simulation/dt19_allrrinfo.rds")

A data.frame: 4 × 9
annota	joint_estim	joint_estim_CI	fix_pi0.05_joint	fix_pi0.05_sep	fix_pi0.05_sep_CI	obs_DNM	bg_DNM	burden(obs/bg)
<chr>	<dbl>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>
Lof: ptv > 0.995	1.2852623	[-0.103, 2.674]	2.256291	2.469966	[1.275, 3.665]	15	10	1.5000000
union of spidex_low3 and spliceai, without ptv	1.1044304	[0.16, 2.049]	1.986388	2.044128	[1.361, 2.727]	81	58	1.3965517
mpc>2	-3.4387649	[-25.695, 18.818]	-4.936559	-1	[-5.296, 3.296]	49	61	0.8032787
prior	0.1991692	[-0.057, 0.455]	fix prior = 0.05	fix prior = 0.05	fix prior = 0.05	NA	NA	NA

FDR (averaged over 100 rounds of simulation)

it seems that 0.1 is feasible.

1	readRDS("/storage11_7T/fuy/TADA-A/cell_WES/DNM/simulation/org.fdr.check.res.rds")

A data.frame: 8 × 2
cutoff.lst	fdr.average
<dbl>	<dbl>
0.02	0.02165
0.05	0.04120
0.08	0.08554
0.10	0.10238
0.15	0.17765
0.20	0.24224
0.30	0.31463
0.50	0.53028

two simulated dataset (‘Rdata’ files for mutation rate and mutation count)

dt9 (estimated RRs are normal)
dt19(estimated RRs are abnormal)

download: https://github.com/yfu1116/project/tree/master/files/RData

ASD

RR of 14 functional annotations

SNV rate was calibrated using the number of synonymous mutations.
We fixed the prior as 0.05 in the initial feature selection,and took all significant ones into joint estimation, where pi was also regarded as a parameter.
pi = 0.12 in joint estimation
details: https://yfu1116.github.io/project/2021-05-20-fix-pi-VS-estim-pi-effect-size-num-enrich/

effect size of frameshift

method 1

derived inframe rate using GLM

1	glm(formula = inframe count from the affected ~ calibrated_SNV_rate * 2N + GC_content, family = poisson(link = "log"), data = df)

frameshift rate = (frameshift count / inframe count from the unaffected) * inframe rate
estimated relative risk of frameshift using EM, RR = 17.7, logRR = 2.9
details: https://yfu1116.github.io/project/2021-08-10-nonfs-rate-fs-gama-report/

method 2

took the RR of nonsense as the RR of frameshift (fix prior = 0.05), logRR = 1.952505, CI=[1.782561, 2.122449]

There is a big difference of results between these two methods.

novel genes (method 1)

FDR = 0.1
158 risk genes were identified, including 105 novel genes with overlapping genes excluded

For overlapping genes:

window中保留了overlap的区间，因此有些与risk genes重叠较多的基因，如UGT系列，也会被模型找出，目前采取的办法是只保留文章里DNM对应的那个

1	UGT1A3, UGT1A4, UGT1A5, UGT1A7, UGT1A10, UGT1A9, UGT1A1, UGT1A6

comparison between the results of baseline (MPC+PTV) and the annotations we used
https://yfu1116.github.io/project/2021-08-10-ASD-CHD-res-comparison/

Enrichment analysis

105 novel ASD risk genes (frameshift + snv vs. baseline; remove the redundant)

DisGeNET (gene-disease association)

potentially damaging mutations

Some of the synonymous DNMs might be pathogenic by disrupting splicing or RNA-binding. For instance, a synonymous mutation in KCNMA1 gene, might contribute to the etiology of ASD through regulation of argonaute protein (AGO) binding.(to be further investigated)
all info: https://github.com/yfu1116/project/blob/master/files/filtered.indel%2Bsnv-baseline_ASD_GO_DisGeNET.xlsx

CHD

effect size of 13 SNV annotations, pi=0.022

https://yfu1116.github.io/project/2021-06-01-CHD-selected13-all-rr-info/

effect size of frameshift

method 1

RR =52.85 , logRR=3.97

method 2

RR 105.6, = logRR = 4.66

novel genes (method 1)

FDR = 0.1
32 risk genes were identified, including 24 novel genes (compared to the baseline (MPC+PTV)) with overlapping genes excluded
results comparison between baseline and other annotations
https://yfu1116.github.io/project/2021-08-10-ASD-CHD-res-comparison/

Enrichment analysis

32 risk genes

DisGeNET (gene-disease association)

GWAS signals

We do not find any CHD GWAS signals associated with these risk genes.

potentially damaging variants

all info: https://github.com/yfu1116/project/blob/master/files/filtered.indel%2Bsnv_CHD_GO_DisGeNET.xlsx

new DD dataset to be explored

https://www.nature.com/articles/s41586-020-2832-5#Sec2
45,221 coding and splicing DNMs in 31,058 individuals

Question

How to explain the outliers in RRs (account for 20%) estimated using simulated data?
There is a big difference of frameshift RR between method 1 and method 2. I’m confused why the RR of nonsense could be a substitution for the RR of frameshift.