Statistics Question

**stalefish3169** · 07-12-2012, 07:03 PM

Plot your data using a qq plot. Any outliers should be obvious.

Consider using a log transformation to "fix" your data if it is skewed.

Sent from my DROIDX using TGR Forums

**idiot** · 07-12-2012, 07:27 PM

thanks for the helpful ideas.

the wiki links just reminds me that i have no clue what i'm doing.

**Long duc dong** · 07-13-2012, 10:34 AM

Originally Posted by CantDog

I'm trying to go through a data set and find which values are 'outliers'. Not sure of the best way to do this, I first thought ANOVA then a simple boxplot, but I thought that might not work but the data is not really Gaussian, and I dont know much about nonparametric tests.

Experiment Background:
Basically a test is administered to 5 subjects. I want to see if a given question seems to be a problem for the multiple subjects. The only thing is the scores vary wildly between subjects, like a genius and a 3 year old are taking the same test, so a good score on a given question for the 3 year old would be fantastic, but compared to the genius, really bad. But I want to see if there is a way to compare each question to see if a bad score for one subject(compared to his overall results) is also a bad score for another subject, compared to her overall results. Does that make sense?

Ideas?

I might use a regression model, with multiple variables. This will allow you to find out what the effect of the question is, independent of variables such as age, intelligence.

You will come up with a model which is something like this:

(b1)x1 + (b2)x2 + (b3)x3 = y, where the b values are the change in y given a one unit change in x1(or x2 or x3) while all other variables remain the same. This will allow you to test the difficulty of a given question independently of factors such as age.

An example might be home prices. You might have a y variable, which is home price, and how it is a function of a location variable(x1) plus a square footage variable(x2) plus a number of bedrooms variable(x3). This would look like:

y = b1*x1 + b2*x2 + b3*x3, which would give you your predicted y for given scores of x1, x2, x3.

The beta matrix of data would be (xTx)inversexTy, which would be your b1, b2, and b3.

You could do this to show if certain questions have lower scores, relative to the ability of the test takers.

EDIT: I looked at your question again, and my solution is neither necessary nor efficient. It would probably be easier to just figure out what the average score and standard deviation are for each subject, and then test each score using a normal table. This will show you how far off each test score is from the mean, how many standard deviations off it is. If most of the questions are pretty clustered, and one is pretty far off, this will show that. You could also do a box plot for each subject, which would not be too hard given that you only have 5 subjects, and look for consistencies. If a question really is a bit of an outlier, you should be able to see it consistently in all of the plots.

You could also try to find a parametric distribution for the data, using maximum likelihood estimation, and then test the fit using a pp plot, but that could be tough to do. It might not be easy to find a distribution that fits your data, so it might not be worth your while. You only have 5 subjects, so finding the questions which are outliers should not be all that difficult. I suspect a box plot will show what you need.

Even easier plots might help too. Just plotting question score on the y and question number on the x, and doing this for each subject, will show you if any questions are consistent outliers.

**idiot** · 07-24-2012, 01:23 PM

Sorry guys......

One quick question. I have to pick if this would require a paired t-test or an independent t-test.

I think it is independent, but can't be wrong or I will miss tons of points on the following calculations.

rental car prices were compared between company x and y in ten different cities.
the car is a small compact

city company x company y
a 5 3
b 4 5
c 6 5
d 3 5
e 5 4
f 6 3
g 7 4
h 5 7
i 2 4
j 4 6

this is just a generic replica

thanks

**WrongWay** · 07-24-2012, 01:36 PM

Originally Posted by idiot

Sorry guys......

One quick question. I have to pick if this would require a paired t-test or an independent t-test.

I think it is independent, but can't be wrong or I will miss tons of points on the following calculations.

rental car prices were compared between company x and y in ten different cities.
the car is a small compact

city company x company y
a 5 3
b 4 5
c 6 5
d 3 5
e 5 4
f 6 3
g 7 4
h 5 7
i 2 4
j 4 6

this is just a generic replica

thanks

"generic replica", jesus. do your own homework! you can do this. and do be careful with trusting some of the answers you have already received on previous questions on this thread. this statistician is done with it.

**Mazderati** · 07-24-2012, 06:56 PM

The fact that data is being pulled from two independent sources makes your assumption correct.

**klar** · 07-25-2012, 02:35 AM

Click image for larger version.

Name: ancova coeff.jpg
Views: 26
Size: 20.4 KB
ID: 118696

Click image for larger version.

Name: ancova plot.jpg
Views: 25
Size: 28.6 KB
ID: 118697

oh can i join in here?

i have dates of when the snow melts at several different loggers for about 10 years. snow melt seems to be shifting to earlier dates as time goes on for all loggers. i was told to do an analysis of covariance to see if the trend is significant. no significance using a t-test of correlation between snowmelt date and year. i used snowfall data for each winter (same for all loggers) to group the years into 3 categories (little snow, normal and a lot of snow) and used that as a covariant. i get the attached results (for the mean date of snowmelt over all loggers). i have zero idea what i am doing. can someone tell me if there is some fundamental flaw in my thinking? not enough data (10 years x 8 loggers)? ancova not applicable because ??? trying to make up our own statistics?

edit: hm, i can't get my exciting plot to show up...

**Brock Landers** · 07-25-2012, 08:50 AM

are the loggers together and susceptible to the same weather patterns? if so its probably treated as 1 enviironment x 10 years. Can you go back any farther? More years would help. You could also do 1940-1950 versus the current 10 year period and see if theres significant deviation.

**klar** · 07-25-2012, 09:56 AM

thanks for the input! they are reasonably close together in a treeless, high alpine, fairly dry environment (a couple of sq km between 2000 and 3300m asl) but differ in aspect, exposure to wind etc. While subject to the same general weatherpatterns, the snowpack at each spot is really more defined by how much sun/wind/skier traffic it sees. (ie logger under groomer has snow longer than very high logger on an exposed, south facing rock face, even though both get the same weather). i used the accumulated snowfall from a close by waeather station as covariant in the ancova to eliminate the "weather" effect. (at least that's what i think i did.) the loggers have only been in place since 2000 and any older data i could dig up would not really be comparable because snow is so location specific. it does seem like a whole lot of random weirdness for all loggers to be showing that trend. i am a total statistics idiot. i feel like my brain ties itself in knots everytime i try some educational reading.

**idiot** · 04-07-2014, 03:51 PM

I'm currently doing my senior project (soil science). It involves comparing two wetlands, one is a reference, and one is a wetland that would be the same but the hydrology has been changed. Caltrans is our project site for mitigation work on HW101.

We have taken samples from the Caltrans site (larger), and also from the reference site. There are more samples from the Caltrans site. We have already done the lab work on bulk density, organic matter, and salinity. We have a data set for both sites. I have sucked at stats in the past, and continue to suck at stats.

I think the correct test here is just a simple t-test to see if there is a significant difference between bulk density in the reference compared to the Caltrans site etc...

Questions: Excel won't do these tests on two different sized data sets. So do I use a random number generator to remove some points from the larger set? I don't want to do this, and would rather include all the data.

Is there a better statistical test that could be used?

Is there a more interesting way to compare the sites that I am not aware of?

Thanks, and also feel free to rag on me again wrong way, I don't mind

**flyman683** · 04-07-2014, 04:05 PM

Originally Posted by idiot

I'm currently doing my senior project (soil science). It involves comparing two wetlands, one is a reference, and one is a wetland that would be the same but the hydrology has been changed.
...

Is there a more interesting way to compare the sites that I am not aware of?

Just to be an ass, ill note that it is a somewhat large assumption to assume that the reference wetland and the study plot "would be the same". Not necessarily bad, but something to note if this project involves report writing/technical writing.

Oftentimes, I've found that something many science/engineer types miss being clear (or even realizing) the assumptions required to do their study in the way it was done.

Oh, and I don't remember enough of my stats work in excel, but I think Matlab would work around the issue using some matrix manipulation steps that I also don't remember - that's probably not helpful is it?

**wendigo** · 04-07-2014, 04:12 PM

Yes there are a few things missing:

1. what is your overall question/hypothesis? this will determine your methods/analyses.
2. was this set up as a BACI (before-after, control-impact) or was the impact already in place when the project started?
3. assumptions of normality - use a qqplot, histogram and/or Kolmogorov-Smirnov test to examine for normality (if the plots look reasonable it probably is).
4. if the data are equal you can do an F-test for equal variance between the two locations, if that is equal then you can use a t-test adjusted for unequal sample sizes.

Yes you were supposed to plan for the analyses before starting the project...

**idiot** · 04-07-2014, 04:20 PM

This does involve report writing/tech writing, but for this study they would have been roughly the same. The reference and our project site are/were esturine wetlands (tidal salt marsh) along the Mad river slough. Our site had a levee installed 80 years ago to keep it from flooding and to convert it to pasture. It still floods enough to be considered a wetland by the Army Corp of Engineers three parameter approach to delineating wetlands. Easily. The difference is that because the main input of water now is from rain, and not tide, it is now classified as a freshwater wetland, and the salinity has largely been removed by leeching from rainfall.

You bring up a good point though. One that I questioned for half of the semester.

**idiot** · 04-07-2014, 04:31 PM

Originally Posted by wendigo

Yes there are a few things missing:

1. what is your overall question/hypothesis? this will determine your methods/analyses.

wetlands accumulate organic matter in an anaerobic environment. We propose that the reference wetland will have accumulated more organic matter than our project site because of the reduction of flooding due to the levee. We also propose that if there is higher organic matter in the reference there will be lower bulk density. We also propose that without tidal influence the salinity of the soil will be reduced by leeching.

Originally Posted by wendigo

2. was this set up as a BACI (before-after, control-impact) or was the impact already in place when the project started?

The levee went in 80 years ago, so if I am understanding the question, the impact was in place when the project started.

Originally Posted by wendigo

3. assumptions of normality - use a qqplot, histogram and/or Kolmogorov-Smirnov test to examine for normality (if the plots look reasonable it probably is).
4. if the data are equal you can do an F-test for equal variance between the two locations, if that is equal then you can use a t-test adjusted for unequal sample sizes.

Rodger that.

Originally Posted by wendigo

Yes you were supposed to plan for the analyses before starting the project...

I planned to use simple t-tests to detect significant differences between the sites, however, I am wondering if there is anything that could be improved or modified.

**jma233** · 04-07-2014, 04:33 PM

A back of the envelop calculation I often do is calculate the 95% confidence intervals for each population and see if they overlap. I think excel can handle this. I've even used this in some low level reports with associated histograms becuae its an easy way to visualize the data for the statistics impaired.

**irul&ublo** · 04-07-2014, 10:18 PM

Originally Posted by TruckeeLocal

That's true as far as it goes. But as the population increases it becomes more obvious which data points are unduely influencing the overall results. Eliminating these may, or may not, be appropriate. For example when looking at aviation incidents in California ...

Salton Sea (Death Valley) turns out to be the unsafest airport when incidents are matched to operations. But there's been all of one incident in the last 20 years.
Equivalently San Francisco International turns out to be the safest because of the immense number of operations compared to incidents. But there's next to no general aviation who are the ones who seem to have the death wish.

Are those two examples outliers or part of the distribution ? I, as a statistician by training, would eliminate Salton Sea because of the small dataset of incidents and operations and bitch about the excessive influence San Francisco (and LAX) have on the overall distribution model.

That San Francisco thing was valid only until Wi Too Lo and Ho Lee Fuk tried to land