Logistic regression or T test?

Stack Exchange network consists of 171 Q&A communities includingStack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Sign uporlog into customize your list.

Start here for a quick overview of the site

Detailed answers to any questions you might have

Discuss the workings and policies of this site

Learn more about Stack Overflow the company

Learn more about hiring developers or posting ads with us

Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. Join them; it only takes a minute:

A group of persons answers one question. The answer can be yes or no. The researcher wants to know whether age is associated with the type of answer.

The association was assessed by doing a logistic regression where age is the explanatory variable and type of answer (yes, no) is the dependent variable. It was separately addressed by calculating the mean age of the groups that answered yes and no, respectively, and by conducting aTtest to compare means.

Both tests were performed following the advice of different persons, and neither of them is sure which is the right way to go. In view of the research question, which would be the better test?

For hypothesis testing the p values were not significant (regression) and significant (T test). The sample is less than 20 cases.

Im not sure this gets are your real question. You already ran both of the analyses youre asking about. Im guessing that what you really want to know is something about comparisons between or relations among those tests, for example which is better. Please edit your question to fix that.

Both tests were performed following the advice of different persons, and none of them is sure whether this is the right way to go. In view of the research questions (is age associated with the type of response?) which would be the better test, logistic regression of type of response on age or a T test comparing the mean age the persons who answered yes with the mean age of the persons who answered no?

Both tests implicitly model the age-response relationship, but they do so in different ways. Which one to select depends on how you choose to model that relationship. Your choice ought to depend on an underlying theory, if there is one; on what kind of information you want to extract from the results; and on how the sample is selected. This answer discusses these three aspects in order.

I will describe the t-test and logistic regression using language that supposes you are studying a well-defined population of people and wish to make inferences from the sample to this population.

In order to support any kind of statistical inference we must assume the sample is random.

A t-testassumes the people in the sample responding no are a simple random sample of all no-respondents in the population and that the people in the sample responding yes are a simple random sample of all yes-respondents in the population.

A t-test makes additional technical assumptions about the distributions of the ages within each of the two groups in the population. Various versions of the t-test exist to handle the likely possibilities.

Logistic regressionassumes all people are a simple random sample of the population. When this population is broken down by age, the separate age groups exhibit different rates of yes responses. These rates, when expressed as log odds (rather than as straight proportions), are assumed to be linearly related with age.

Logistic regression is easily extended to accommodate non-linear relationships between age and response. Such an extension can be used to evaluate the plausibility of the initial linear assumption. It is practicable with large datasets, which afford enough detail to display non-linearities, but is unlikely to be of much use with small datasets. A common rule of thumb–that regression models should have ten times as many observations as parameters–suggests that more than 20 observations are needed to detect nonlinearity (which needs a third parameter in addition to the intercept and slope of a linear function).

A t-test identifies whether the average ages differ between no-and yes-respondents in the population. A logistic regression estimates how the response rate varies by age. As such it is more flexible and capable of supplying more detailed information than the t-test is. On the other hand, it tends to be less powerful than the t-test for the basic purpose of detecting a difference between the average ages in the groups.

It is possible for the pair of tests to exhibit all four combinations of significance and non-significance. Two of these are problematic:

The t-test is not significant but the logistic regression is. When the assumptions of both tests are plausible, such a result is practically impossible, because the t-test is not trying to detect such a specific relationship as posited by logistic regression. However, when that relationship is sufficiently nonlinear to cause the oldest and youngest subjects to share one opinion and the middle-aged subjects another, then the extension of logistic regression to nonlinear relationships can detect and quantify that situation, which no t-test could detect.

The t-test is significant but the logistic regression is not, as in the question. This often happens, especially when there is a group of younger respondents, a group of older respondents, and few people in between. This creates a great separation between the groups of no- and yes-responders. It is readily detected by the t-test. However, logistic regression would either have relatively little detailed information about how the response rate actually changes with age or else it would have inconclusive information: the case of complete separation where all older people respond one way and all younger people another way–but in that case both tests would usually have very low p-values.

Note that the experimental design can invalidate some of the test assumptions. For instance, if you selected people according to their age in a stratified design, then the t-tests assumption (that each group reflects a simple random sample of ages) becomes questionable. This design would suggest relying on logistic regression. If instead you had two pools, one of no-responders and one of yes-responders, and selected randomly from those to ascertain their age, then the assumptions of logistic regression are doubtful while those of the t-test will hold. That design would suggest using some form of a t-test.

(The second design might seem silly here, but in circumstances where age is replaced by some characteristic that is difficult, costly, or time-consuming to measure it can be appealing.)

The better test is the the one that better addresses your question. Neither is just better on its face. The differences here are equivalent to those found when regressing y on x and x on y and the reasons for different results are similar. The variance being assessed depends on which variable is being treated as the response variable in the model.

Your research question is terribly vague. Perhaps if you considered direction of causality youd be able to come to a conclusion about which analysis you want to use. Is age causing people to respond yes or is responding yes causing people to get older? Its more likely the former, in which case the variance in the probability of a yes is what you wish to model and therefore the logistic regression is the best choice.

That said, you should examine assumptions of the tests. Those can be found online at wikipedia or in your text books on them. It may well be that you have good reasons not to perform the logistic regression and, when that happens you may need to ask a different question.

This doesnt really answer the question but may still be of some interest. The standard assumption of a two sample $t$-test is tha
t the conditional normal distribution of $X$ given a binary variable $Y$, $$ XY=i \sim N(\mu_i,\sigma^2). $$ This together with the assumption that $Y \sim \operatornamebernoulli(p)$ marginally, implies that the conditional distribution of the binary variable $Y$ given $X=x$ is \beginalign P(Y=1X=x) &=\fracf_XY=1(x)P(Y=1)\sum_i=0^1 f_XY=i(x)P(Y=i) \\&=\fracpe^-\frac12\sigma^2(x-\mu_1)^2pe^-\frac12\sigma^2(x-\mu_1)^2 + (1-p)e^-\frac12\sigma^2(x-\mu_0)^2 \\&=\frac11+\frac1-ppe^-\frac12\sigma^2(x-\mu_0)^2+\frac12\sigma^2(x-\mu_1)^2 \\&=\operatornamelogit^-1(\beta_0 + \beta_1 x) \endalign that is, a logistic regression model with intercept and slope \beginalign\beta_0 &= \ln\frac p1-p -\frac12\sigma^2(\mu_1^2-\mu_0^2) \\ \beta_1&=\frac1\sigma^2(\mu_1-\mu_0). \endalign

So in this sense the two conditional models are compatible.

By posting your answer, you agree to theprivacy policyandterms of service.

What can I as a teenager do about my insanely strict parents?

Attack a creature you control with instants?

What tools were used to analyze & challenge the records for Dragster?

Company does not want any names on phishing reports

Reference ON/OFF voltage from microcontroller Raspberry Pi

How can time dilation be symmetric?

How can I convincingly communicate to a friend that I dont have a personal mobile number?

Google Calendar Share single event as ics?

The land grows evil and corrupted … but why?

Colonizing the galaxy by slow boating reality check

Denied entry to Canada. Will this history deny me entry to other countries?

How to make an ideal diode model in LTspice

How to think about proofs of inequalities (precalculus)?

What was the TOP SECRET information in the Nunes Memo?

Which is closer to Mars, Earth or the Moon?

How to remark politely that you are not an idiot, just not fluent on english language?

Are we living in a simulation? The evidence

Kickstart Kimchi with sourdough starter

Its not my job to answer the phone

Does Hellys theorem hold in the hyperbolic plane?

Two little puzzles of pattern or reformat

How to deal with [male] friends mocking me about my life choices

site design / logo 2018 Stack Exchange Inc; user contributions licensed undercc by-sa 3.0withattribution required.rev 2018.2.2.28744