r - Split-sample design -
i working on large data set containing 9,000 observations belonging different groups. now, use method called split-sample design analyze data. let me explain in detail do. data has following structure:
groupid performance commitment affect size 1234 5 4 2 2 1234 6 8 9 2 2235 4 3 2 5 2235 4 3 2 5 2235 2 1 7 5 2235 2 1 7 5 2235 2 6 10 5 3678 3 5 5 4 3678 7 3 5 4 3678 5 2 6 4 3678 1 4 6 4
now, aggregate data in following way: each group, use average performance score of first half of group , average commitment , affect scores of second half of group create 1 new observation (for uneven group sizes drop 1 random observation within group - e.g. last observation in group - create group size). however, in 2 steps. first, data should this:
groupid performance commitment affect size 1234 5 8 9 2 2235 4 1 7 5 2235 4 1 7 5 3678 3 2 6 4 3678 7 4 6 4
in next step, aggregate data. new data set have 1 observation per group , this:
groupid performance commitment affect size 1234 5 8 9 2 2235 4 1 7 5 3678 5 3 6 4
again, please note last observation of group 2235 dropped, since group size uneven number.
is there package out there split , aggregate data in way? if not, how go ahead , code this? grateful advice, since have no idea how elegantly approach this, other writing bunch of for
loops.
here code above example:
groupid <- c(1234, 1234, 2235, 2235, 2235, 2235, 2235, 3678, 3678, 3678, 3678) performance <- c(5, 6, 4, 4, 2, 2, 2, 3, 7, 5, 1) commitment <- c(4, 8, 3, 3, 1, 1, 6, 5, 3, 2, 4) affect <- c(2, 9, 2, 2, 7, 7, 10, 5, 5, 6, 6) size <- c(2, 2, 5, 5, 5, 5, 5, 4, 4, 4, 4) mydata <- data.frame(groupid, performance, commitment, affect, size)
many thanks!!
here solution:
library(plyr) mydata1<-ddply(mydata,.(groupid),summarize,aveper=mean(head((performance),length(groupid)/2)), avecom=mean(tail((commitment),length(groupid)/2)), aveaff=mean(tail((affect),length(groupid)/2)),avesiz=mean(size)) > mydata1 groupid aveper avecom aveaff avesiz 1 1234 5 8.000 9 2 2 2235 4 2.667 8 5 3 3678 5 3.000 6 4
update:
mydata2<-ddply(mydata,.(groupid),transform,aveper=mean(head((performance),length(groupid)/2)), avecom=mean(tail((commitment),length(groupid)/2)), aveaff=mean(tail((affect),length(groupid)/2)),avesiz=mean(size),lengr=length(groupid)) > mydata2 groupid performance commitment affect size aveper avecom aveaff avesiz lengr 1 1234 5 4 2 2 5 8.000 9 2 2 2 1234 6 8 9 2 5 8.000 9 2 2 3 2235 4 3 2 5 4 2.667 8 5 5 4 2235 4 3 2 5 4 2.667 8 5 5 5 2235 2 1 7 5 4 2.667 8 5 5 6 2235 2 1 7 5 4 2.667 8 5 5 8 3678 3 5 5 4 5 3.000 6 4 4 9 3678 7 3 5 4 5 3.000 6 4 4 10 3678 5 2 6 4 5 3.000 6 4 4 11 3678 1 4 6 4 5 3.000 6 4 4 mydata2<-mydata2[-7,] # assumes have taken care of uneven groups mydata3<-map(function(x)head(mydata2[mydata2$groupid==x,],head(mydata2$lengr[which(mydata2$groupid==x)],1)/2),unique(mydata2$groupid)) library(plyr) mydata4<-ldply(mydata3) mydata5<-mydata4[,c(1,6:9)] > mydata5 groupid aveper avecom aveaff avesiz 1 1234 5 8.000 9 2 2 2235 4 2.667 8 5 3 2235 4 2.667 8 5 4 3678 5 3.000 6 4 5 3678 5 3.000 6 4
Comments
Post a Comment