Statistics theory: Difference between revisions

Revision as of 00:54, 19 September 2008

Simple illustration

Suppose one wishes to embark on a quantitative study of the height of adult males in some country C. How should one go about doing this and how can the data be summarized? In statistics, the approach taken is to assume/model the quantity of interest, i.e., "height of adult men from the country C" as a random variable X, say, taking on values in [0,5] (measured in metres) and distributed according to some unknown probability distribution^[4] F on [0,5] . One important theme studied in statistics is to develop theoretically sound methods (firmly grounded in probability theory) to learn something about the postulated random variable X and also its distribution F by collecting samples, for this particular example, of the height of a number of men randomly drawn from the adult male population of C.

Suppose that N men labeled $\scriptstyle M_{1},M_{2},\ldots ,M_{N}$ have been randomly drawn by simple random sampling (this means that each man in the population is equally likely to be selected in the sampling process) whose heights are $\scriptstyle x_{1},x_{2},\ldots ,x_{N}$ , respectively. An important yet subtle point to note here is that, due to random sampling, the data sample $\scriptstyle x_{1},x_{2},\ldots ,x_{N}$ obtained is actually an instance or realization of a sequence of independent random variables $\scriptstyle X_{1},X_{2},\ldots ,X_{N}$ with each random variable $X_{i}$ being distributed identically according to the distribution of $X$ (that is, each $\scriptstyle X_{i}$ has the distribution F). Such a sequence $\scriptstyle X_{1},X_{2},\ldots ,X_{N}$ is referred to in statistics as independent and identically distributed (i.i.d) random variables. To further clarify this point, suppose that there are two other investigators, Tim and Allen, who are also interested in the same quantitative study and they in turn also randomly sample N adult males from the population of C. Let Tim's height data sample be $\scriptstyle y_{1},y_{2},\ldots ,y_{N}$ and Allen's be $\scriptstyle z_{1},z_{2},\ldots ,z_{N}$ , then both samples are also realizations of the i.i.d sequence $\scriptstyle X_{1},X_{2},\ldots ,X_{N}$ , just as the first sample $\scriptstyle x_{1},x_{2},\ldots ,x_{N}$ was.

From a data sample $\scriptstyle x_{1},x_{2},\ldots ,x_{N}$ one may define a statistic T as $\scriptstyle T=f(x_{1},x_{2},\ldots ,x_{N})$ for some real-valued function f which is measurable (here with respect to the Borel sets of $\scriptstyle \mathbb {R} ^{N}$ ). Two examples of commonly used statistics are:

$T\,=\,{\bar {x}}\,=\,{\frac {x_{1}+x_{2}+\ldots +x_{N}}{N}}$ . This statistic is known as the sample mean
$T\,=\,\sum _{i=1}^{N}(x_{i}-{\bar {x}})^{2}/N$ . This statistic is known as the sample variance. Often the alternative definition $T\,=\,{\frac {1}{N-1}}\sum _{i=1}^{N}(x_{i}-{\bar {x}})^{2}$ of sample variance is preferred because it is an unbiased estimator of the variance of X, while the former is a biased estimator.

Summary statistics

Descriptive statistics

Measurements of central tendency

Mean
Median

Measurements of variation

Standard deviation (SD) is a measure of variation.
Standard error of the mean (SEM) measures the how accurately you know the mean of a population and is always smaller than the SD.^[5]
Variance
95% confidence interval is + 1.96 * standard error.

Inferential statistics and hypothesis testing

Frequentist method

This approach uses mathematical formulas to calculate deductive probabilities (p-value) of an experimental result.^[6] This approach can generate confidence intervals.

A problem with the frequentist analyses of p-values is that they may overstate "statistical significance".^[7]^[8] See Bayes factor for details.

Likelihood or Bayesian method

Some argue that the P-value should be interpreted in light of how plausible is the hypothesis based on the totality of prior research and physiologic knowledge.^[9]^[6]^[10] This approach can generate Bayesian 95% credibility intervals.^[11]

Classification

Discriminant analysis
Factor analysis
Cluster analysis
Propensity score
Recursive partitioning

References

↑ Trapp, Robert; Beth Dawson (2004). Basic & clinical biostatistics. New York: Lange Medical Books/McGraw-Hill. LCC QH323.5 .D38 LCCN 2005-263. ISBN 0-07-141017-1.
↑ Guilford, J.P., Fruchter, B. (1978). Fundamental statistics in psychology and education. New York: McGraw-Hill.
↑ Shao, J. (2003). Mathematical Statistics (2 ed.). ser. Springer Texts in Statistics, New York: Springer-Verlag, p. 100.
↑ This is the case in non-parametric statistics. On the other hand, in parametric statistics the underlying distribution is assumed to be of some particular type, say a normal or exponential distribution, but with unknown parameters that are to be estimated.
↑ What is the difference between "standard deviation" and "standard error of the mean"? Which should I show in tables and graphs?. Retrieved on 2008-09-18.
↑ ^6.0 ^6.1 Goodman SN (1999). "Toward evidence-based medical statistics. 1: The P value fallacy". Ann Intern Med 130: 995–1004. PMID 10383371. ^[e]
↑ Goodman S (1999). "Toward evidence-based medical statistics. 1: The P value fallacy.". Ann Intern Med 130 (12): 995–1004. PMID 10383371.
↑ Goodman S (1999). "Toward evidence-based medical statistics. 2: The Bayes factor.". Ann Intern Med 130 (12): 1005–13. PMID 10383350.
↑ Browner WS, Newman TB (1987). "Are all significant P values created equal? The analogy between diagnostic tests and clinical research". JAMA 257: 2459–63. PMID 3573245. ^[e]
↑ Goodman SN (1999). "Toward evidence-based medical statistics. 2: The Bayes factor". Ann Intern Med 130: 1005–13. PMID 10383350. ^[e]
↑ Gelfand, Alan E.; Sudipto Banerjee; Carlin, Bradley P. (2003). Hierarchical Modeling and Analysis for Spatial Data (Monographs on Statistics and Applied Probability). Boca Raton: Chapman & Hall/CRC. LCC QA278.2 .B36. ISBN 1-58488-410-X.

[isbn0-07-141017-1-1] Trapp, Robert; Beth Dawson (2004). Basic & clinical biostatistics. New York: Lange Medical Books/McGraw-Hill. LCC QH323.5 .D38 LCCN 2005-263. ISBN 0-07-141017-1.

[2] Guilford, J.P., Fruchter, B. (1978). Fundamental statistics in psychology and education. New York: McGraw-Hill.

[3] Shao, J. (2003). Mathematical Statistics (2 ed.). ser. Springer Texts in Statistics, New York: Springer-Verlag, p. 100.

[4] This is the case in non-parametric statistics. On the other hand, in parametric statistics the underlying distribution is assumed to be of some particular type, say a normal or exponential distribution, but with unknown parameters that are to be estimated.

[urlWhat_is_the_difference_between_standard_deviation_and_standard_error_of_the_mean?_Which_should_I_show_in_tables_and_graphs?-5] What is the difference between "standard deviation" and "standard error of the mean"? Which should I show in tables and graphs?. Retrieved on 2008-09-18.

[pmid10383371-6] 6.0 ^6.1 Goodman SN (1999). "Toward evidence-based medical statistics. 1: The P value fallacy". Ann Intern Med 130: 995–1004. PMID 10383371. ^[e]

[Goodman1999a-7] Goodman S (1999). "Toward evidence-based medical statistics. 1: The P value fallacy.". Ann Intern Med 130 (12): 995–1004. PMID 10383371.

[Goodman1999b-8] Goodman S (1999). "Toward evidence-based medical statistics. 2: The Bayes factor.". Ann Intern Med 130 (12): 1005–13. PMID 10383350.

[pmid3573245-9] Browner WS, Newman TB (1987). "Are all significant P values created equal? The analogy between diagnostic tests and clinical research". JAMA 257: 2459–63. PMID 3573245. ^[e]

[pmid10383350-10] Goodman SN (1999). "Toward evidence-based medical statistics. 2: The Bayes factor". Ann Intern Med 130: 1005–13. PMID 10383350. ^[e]

[isbn1-58488-410-X-11] Gelfand, Alan E.; Sudipto Banerjee; Carlin, Bradley P. (2003). Hierarchical Modeling and Analysis for Spatial Data (Monographs on Statistics and Applied Probability). Boca Raton: Chapman & Hall/CRC. LCC QA278.2 .B36. ISBN 1-58488-410-X.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

@@ Line 21: / Line 21: @@
 ==Summary statistics==
 *[[Descriptive statistics]]
+===Measurements of central tendency===
 *[[Mean]]
 *[[Median]]
-*[[Standard deviation]]
+===Measurements of variation===
+*[[Standard deviation]] (SD) is a measure of variation.
+*[[Standard error]] of the mean (SEM) measures the how accurately you know the mean of a population and is always smaller than the SD.<ref name="urlWhat is the difference between standard deviation and standard error of the mean? Which should I show in tables and graphs?">{{cite web |url=http://www1.graphpad.com/faq/viewfaq.cfm?faq=201 |title=What is the difference between "standard deviation" and "standard error of the mean"? Which should I show in tables and graphs? |author= |authorlink= |coauthors= |date= |format= |work= |publisher= |pages= |language= |archiveurl= |archivedate= |quote= |accessdate=2008-09-18}}</ref>
+* [[Variance]]
+* 95% [[confidence interval]] is <u>+</u> 1.96 * [[standard error]].
 ==Inferential statistics and hypothesis testing==

Statistics theory: Difference between revisions

Revision as of 00:54, 19 September 2008

Contents

Simple illustration

Summary statistics

Measurements of central tendency

Measurements of variation

Inferential statistics and hypothesis testing

Frequentist method

Likelihood or Bayesian method

Classification

See also

References

Navigation menu

Statistics theory: Difference between revisions

Revision as of 00:54, 19 September 2008

Simple illustration

Summary statistics

Measurements of central tendency

Measurements of variation

Inferential statistics and hypothesis testing

Frequentist method

Likelihood or Bayesian method

Classification

See also

References

Navigation menu

Search