Causal models: the mechanism between variables, to understand the underlying system between variables
-Can be represented using graphs
-What does it mean for a variable X to cause variable Y? (X -> Y)
Ex1: Simple Example
Assumption: We are having an exam, two girls are participating. X is an answer to the exam question
Girl1:First X is realized as X=x
Girl2:Then Y is realized conditioned on X=x
=> There is a clear causal connection between random variables X and Y
First X is realized and variable Y is realized based on the value of X
How can we see the asymmetry between random variables?
The asymmetry can not be seen from the data itself, but an experiment or intervention can be performed
Situation1:
The teacher tells the answer to Girl1
(1)The value of x is forced to take a certain value
(2)Because of the underlying mechanism, value of y will be realized and will be set as the answer
Situation2:
The teacher tells the answer to Girl2
(1)Value of y is changed
(2)Because of the underlying system, the value of x will not be changed
=> This is how we can see the asymmetry between random variables in a causal system
Summary
If you force a value on X, then this affects to the answer of Y
If you force a value on Y, then this doesn’t affect the answer of X
This is asymmetric relation, and if this can be observed, X is causing Y can be distinguished
Ex2: Realistic Example
Assumption: Given a dataset of binary variables
S: Heavy Smoker | C: Lung cancer before 60 |
---|---|
0 | 0 |
1 | 1 |
0 | 1 |
1 | 1 |
S:Binary variable which stands for whether it is a heavy smoker or not
C:Binay variable which stands for whether it has lung cancer before 60
How do we check if they are causally related?
Multiple explanations for the observed correlations might exist
Situation1:
Smoking -> Cancer
Smoking indeed causes cancer and there is causal dependencies that can be observed
Situation2:
Gene — Smoking
|
Cancer
Intricate explanation might be..maybe there is a gene out there that is active that causes me to smoke but the same gene also gives me cancer
=> This can also induce the absurd correlation between smoking and cancer
Task: Distinguish from one another. Which one is the true one
Distinguish two causal graphs from one another to find out which one is the true one
(1)Find 100 people randomly -> Force them to smoke
(2)Collect data whether they get cancer or not
(3)Find 100 other people randomly -> Force them not to smoke
(4)Collect additional data
=>If cancer rates differ between two subpopulations, then we can say smoking causes cancer.
Randomized experiment was performed which has been used in clinical trials for decades, but this is also known as an intervention.
Intervention or Randomized experiment helped understand the correlation
How a randomized experiment(randomizing who smokes and who doesn’t) helps in the language of DAGs
Pre-intervention: Before the intervention, two causal graphs existed
Gene — Smoking
|
Cancer
Smoking -> Cancer
Post-intervention: After the intervention
Gene -/- Smoking
|
Cancer
(1)Effectively cutting off the relationship between Gene and smoking variable
(2)Observed data from that
(3)After cutting the edge, smoking and cancer become independent
(4)Whereas intervening with smoking or not, it will not change the dependency and the correlation between smoking and cancer will still exist
=> How intervention or randomized experiment helps distinguish two causal graphs from one another
This experiment may be impossible to perform, because forcing people to smoke is not ethical
If certain experiments cannot be performed, if having expert knowledge and causal graphs are known from expert knowledge, then without performing the experiment, I can actually predict what would happen if I perform the experiment
=> If knowing which graph is a true causal graph, if intervening on smoking, whether cancer rates will change or not,
X causes Y
-X is sampled from sound distribution Px
-E is sampled from sound distribution Pe
This is independent from X and a variable I never get to observe
It is not on the graph, but I know that it affects my variable Y
-Y can be written as a function of X and E together
While X and E are independent
-In general, if we have a bigger graph, variable Y is a function of its parent set parent Y
E is again independent from all the parents
=> Typical assumption for using structural equations
What happens if I intervene and sample in your distribution?
Causal graph and structural equations that come with it
1)Perform an intervention on Y
Setting a value of Y to something, forcing it to take lower case y
2)Replace where would I see the variable Y with the realization of y
3)Y no longer depends on X, but set by the value I decide
Causal graph is the one on the right hand side
4)These structural equations now induce another joint distribution on the random variable if I am sampling from if I perform this experiment
SCM is not the only way to represent causal relations, there is something called causal bayesian networks which is a less detailed description
-Causal Bayesian Networks is DAG which is a causal graph alongside with a joint distribution that factorizes with respect to the graph
-From these two, we can calculate interventional distributions by using the factorization formula and using the causal graph assuming that there are no latent variables
Reference:
https://www.youtube.com/watch?v=Czk3aczfZlk&t=850s