Analysis

Workflow

Author

Daniel Cusimano

Published

August 15, 2024

Data Analysis in R Workflow

Purpose and Design

This page is still in progress. From the data obtained via SlicerMorph and the step described on the processing page, I am running ANOVA and PCA analyses. Running the statistics and understanding the superficial insights is pretty straightforward. Some of the more detailed analysis, however, still need to be worked out.

This markdown file utilizes codechunks.

Load needed packages

require(geomorph) #for GM analysis
require(SlicerMorphR) #for importing Slicer data
require(plyr) #for wireframe specs
require(dplyr) #for data processing/cleaning
require(tidyr) #for data processing/cleaning
require(skimr) #for nice visualization of data 
require(knitr) #for qmd building
require(gapminder) #for plot aesthetics

Importing Data

The files used in this analysis were produced in the processing code (SLprocess.r) illustrated in the Data Processing page on this website. The files listed below represent only the material thus far used. More data may be added as this project continues.

data_location1 <- "Data/Processed_data/output.rds"
d1 <- readRDS(data_location1)

data_location2 <- "Data/Processed_data/Coords.rds"
Coords <- readRDS(data_location2) 

data_location3 <- "Data/Processed_data/gpa.rds"
d1array.gpa <- readRDS(data_location3)

data_location4 <- "Data/Processed_data/SMlog.rds"
SMlog <- readRDS(data_location4)

data_location5 <- "Data/Processed_data/PD.rds"
PD <- readRDS(data_location5)



###Work in progress####

PCA

The primary statistical approach first applied is a Principal Components Analysis. The next three plots below illustrate the initial results of PC1 by PC2, followed by color coded modifiers to visualize how the plotted variation relates to the sample demographics.

par(mar=c(1,1,1,1))

d1.pca<-gm.prcomp(d1array.gpa$coords)
d1.pca


Ordination type: Principal Component Analysis 
Centering by OLS mean
Orthogonal projection of OLS residuals
Number of observations: 15 
Number of vectors 14 

Importance of Components:
                             Comp1        Comp2        Comp3        Comp4
Eigenvalues            0.001284241 0.0008394793 0.0007064908 0.0004946383
Proportion of Variance 0.231890428 0.1515815806 0.1275683414 0.0893149409
Cumulative Proportion  0.231890428 0.3834720082 0.5110403496 0.6003552905
                              Comp5        Comp6        Comp7        Comp8
Eigenvalues            0.0004420162 0.0003562317 0.0002984485 0.0002513745
Proportion of Variance 0.0798131715 0.0643233985 0.0538897120 0.0453897296
Cumulative Proportion  0.6801684621 0.7444918605 0.7983815725 0.8437713021
                              Comp9       Comp10       Comp11       Comp12
Eigenvalues            0.0002313215 0.0001697308 0.0001513341 0.0001270698
Proportion of Variance 0.0417688424 0.0306476375 0.0273258209 0.0229445070
Cumulative Proportion  0.8855401445 0.9161877820 0.9435136029 0.9664581099
                             Comp13       Comp14
Eigenvalues            0.0001010387 8.472084e-05
Proportion of Variance 0.0182441710 1.529772e-02
Cumulative Proportion  0.9847022809 1.000000e+00

plot(d1.pca)

 SlicerMorph.MS <- read.table(file = paste(SMlog$output.path,
                                            SMlog$MeanShape,
                                            sep="/"),
                               sep=",", header = TRUE, row.names=1)
                               
                               
plot(d1.pca, main = "PCA by Sex",
col=d1$Sex,
pch=16
)
legend("bottomright", pch = 20, col=unique(d1$Sex), legend = unique(d1$Sex))

plot(d1.pca, main = "PCA by Ancestry",
col=d1$Ancestry,
pch=16)
legend("bottomright", pch = 20, col=unique(d1$Ancestry), legend = unique(d1$Ancestry))

The collage below shows the original PC1 by PC2 compared with analyses by three different planes. Unsurprisingly, the original analysis shows the most variation and the strongest clustering by ancestry. This collage also labels the PC axes for me which I haven’t figured out how do yet for the larger plots. Everything is still a work in progress.

par(mfrow= c(2,2))
plot(d1.pca, main = "PCA",
col=d1$Ancestry,
pch=16)
legend("topright", pch = 20, col=unique(d1$Ancestry), legend = unique(d1$Ancestry))

plot(d1.pca, main = "PCA",
    axis1=1, axis2=3,
    col=d1$Ancestry,
    pch=16
)
legend("topright", pch = 20, col=unique(d1$Ancestry), legend = unique(d1$Ancestry))

plot(d1.pca, main = "PCA",
    axis1=2, axis2=3,
    col=d1$Ancestry,
    pch=16
)
legend("topright", pch = 20, col=unique(d1$Ancestry), legend = unique(d1$Ancestry))

plot(d1.pca, main = "PCA",
    axis1=3, axis2=4,
    col=d1$Ancestry,
    pch=16
)
legend("topright", pch = 20, col=unique(d1$Ancestry), legend = unique(d1$Ancestry))

gdf <- geomorph.data.frame(PD,
Ancestry = d1$Ancestry,
Sex = d1$Sex,
Csize = d1$centeroid)
attributes(gdf)

$names
[1] ""         "Ancestry" "Sex"      "Csize"   

$class
[1] "geomorph.data.frame"

lm.fit <- procD.lm(Coords~Ancestry*Sex, data=gdf)


Warning: Because variables in the linear model are redundant,
the linear model design has been truncated (via QR decomposition).
Original X columns: 10
Final X columns (rank): 7
Check coefficients or degrees of freedom in ANOVA to see changes.

summary(lm.fit)


Analysis of Variance, using Residual Randomization
Permutation procedure: Randomization of null model residuals 
Number of permutations: 1000 
Estimation method: Ordinary Least Squares 
Sums of Squares and Cross-products: Type I 
Effect sizes (Z) based on F distributions

             Df        SS        MS     Rsq      F        Z Pr(>F)  
Ancestry      4 0.0090834 0.0022709 0.37550 1.5073  1.63054  0.056 .
Sex           1 0.0019193 0.0019193 0.07934 1.2739  0.69194  0.234  
Ancestry:Sex  1 0.0011348 0.0011348 0.04691 0.7532 -0.46291  0.675  
Residuals     8 0.0120528 0.0015066 0.49825                         
Total        14 0.0241903                                           
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Call: procD.lm(f1 = Coords ~ Ancestry * Sex, data = gdf)

plot(lm.fit)

anova(procD.lm(Coords~Csize + Ancestry*Sex, data=gdf))


Warning: Because variables in the linear model are redundant,
the linear model design has been truncated (via QR decomposition).
Original X columns: 11
Final X columns (rank): 8
Check coefficients or degrees of freedom in ANOVA to see changes.


Analysis of Variance, using Residual Randomization
Permutation procedure: Randomization of null model residuals 
Number of permutations: 1000 
Estimation method: Ordinary Least Squares 
Sums of Squares and Cross-products: Type I 
Effect sizes (Z) based on F distributions

             Df        SS        MS     Rsq      F        Z Pr(>F)   
Csize         1 0.0034837 0.0034837 0.14401 2.5410  2.97834  0.001 **
Ancestry      4 0.0082665 0.0020666 0.34173 1.5074  1.49535  0.079 . 
Sex           1 0.0018098 0.0018098 0.07481 1.3200  0.70917  0.247   
Ancestry:Sex  1 0.0010333 0.0010333 0.04272 0.7537 -0.43578  0.658   
Residuals     7 0.0095969 0.0013710 0.39673                          
Total        14 0.0241903                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Call: procD.lm(f1 = Coords ~ Csize + Ancestry * Sex, data = gdf)