Principal component analysis

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components.

Principal components analysis is a powerful approach for correcting for stratification in genome-wide association studies. This document provides the pipeline of three methods for principal component analysis using genomic data.

  • Pipeline

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    #qc
    plink --tfile <file> --allow-extra-chr --chr-set 33 --geno 0.1 --hwe 0.0000001 --maf 0.01 --mind 0.1 --make-bed --out <file>_qc

    plink --bfile <file>_qc --chr-set 33 --allow-extra-chr --indep-pairwise 100 25 0.25
    plink --bfile <file>_qc --chr-set 33 --allow-extra-chr --extract plink.prune.in --make-bed --out <file>_qc.prune

    ##gcta PCA
    gcta64 --bfile <file>_qc.prune --autosome --autosome-num 33 --make-grm --out <file>_qc
    gcta64 --grm <file>_qc --pca 20 --out <file>_qc

    ##Plink MDS
    plink --bfile <file>_qc.prune --chr-set 38 --allow-extra-chr --genome
    plink --bfile <file>_qc --chr-set 38 --allow-extra-chr --read-genome plink.genome --cluster --mds-plot 10

    ##smartPCA
    smartpca -i <file>.ped -a <file>.pedsnp -b <file>.pedind -o <file>.pca -k 10 -p <file>.plot -e <file>.eval -l <file>.log
  • Visualization in R

    1
    plot(pca[,1],pca[,2],xlab = "eigenvector1",ylab = "eigenvector2")