That depends immensely on the type of effect you're looking for.
Within-subject effects (this happens when one does A, but not when doing B) can be fine with small sample sizes, especially if you can repeat variations on A and B many times. This is pretty common in task-based fMRI. Indeed, I'm not sure why you need >2 participants expect to show that the principle is relatively generalizable.
Between-subject comparisons (type A people have this feature, type B people don't) are the problem because people differ in lots of ways and each contributes one measurement, so you have no real way to control for all that extra variation.
Precisely, and agreed 100%. We need far more within-subject designs.
You would still in general need many subjects to show the same basic within-subject patterns if you want to claim the pattern is "generalizable", in the sense of "may generalize to most people", but, precisely depending on what you are looking at here, and the strength of the effect, of course you may not need nearly as much participants as in strictly between-subject designs.
With the low test-retest reliability of task fMRI, in general, even in adults, this also means that strictly one-off within-subject designs are also not enough, for certain claims. One sort of has to demonstrate that even the within-subject effect is stable too. This may or may not be plausible for certain things, but it really needs to be considered more regularly and explicitly.
Between-subject heterogeneity is a major challenge in neuroimaging. As a developmental researcher, I've found that in structural volumetrics, even after controlling for total brain size, individual variance remains so large that age-brain associations are often difficult to detect and frequently differ between moderately sized cohorts (n=150-300). However, with longitudinal data where each subject serves as their own control, the power to detect change increases substantially—all that between-subject variance disappears with random intercept/slope mixed models. It's striking.
Task-based fMRI has similar individual variability, but with an added complication: adaptive cognition. Once you've performed a task, your brain responds differently the second time. This happens when studies reuse test questions—which is why psychological research develops parallel forms. But adaptation occurs even with parallel forms (commonly used in fMRI for counterbalancing and repeated assessment) because people learn the task type itself. Adaptation even happens within a single scanning session, where BOLD signal amplitude for the same condition typically decreases over time.
These adaptation effects contaminate ICC test-retest reliability estimates when applied naively, as if the brain weren't an organ designed to dynamically respond to its environment. Therefore, some apparent "unreliability" may not reflect the measurement instrument (fMRI) at all, but rather highlights the failures in how we analyze and conceptualize task responses over time.
Yeah, when you start getting into this stuff and see your first dataset with over a hundred MRIs, and actually start manually inspecting things like skull-stripping and stuff, it is shocking how dramatically and obviously different people's brains are from each other. The nice clean little textbook drawings and other things you see in a lot of education materials really hide just how crazy the variation is.
And yeah, part of why we need more within-subject and longitudinal designs is to get at precisely the things you mention. There is no way to know if the low ICCs we see now are in fact adaptation to the task or task generalities, if they reflect learning that isn't necessarily task-relevant adaptation (e.g. the subject is in a different mood on a later test, and this just leads to a different strategy), if the brain just changes far more than we might expect, or all sorts of other possibilities. I suspect if we ever want fMRI to yield practical or even just really useful theoretical insights, we definitely need to suss out within-subject effects that have high test-retest reliability, regardless of all these possible confounds. Likely finding such effects will involve more than just changes to analysis, but also far more rigorous experimental designs (both in terms of multi-modal data and tighter protocols, etc).
FWIW, we've also noticed a lot of magic can happen too when you suddenly have proper longitudinal data that lets you control things at the individual level.
They are indeed coupled, but the coupling is complicated and may be situationally dependent.
Honestly, it's hard to imagine many aggregate measurements that aren't. For example, suppose you learn that the average worker's pay increased. Is it because a) the economy is booming or b) the economy crashed and lower-paid workers have all been laid off (and are no longer counted).
I read that paper as suggesting that development, behavior, and fMRI are all hard.
It's not at all clear to me that teenagers' brains OR behaviours should be stable across years, especially when it involves decision-making or emotions. Their Figure 3 shows that sensory experiments are a lot more consistent, which seems reasonable.
The technical challenges (registration, motion, etc) like things that will improve and there are some practical suggestions as well (counterbalancing items, etc).
While I agree I wouldn't expect too much stability in developing brains, unfortunately there are pretty serious stability issues even in non-developing adult brains (quote below from the paper, for anyone who doesn't want to click through).
I agree it makes a lot of sense though the sensory experiments are more consistent, somatosensory and sensorimotor localization results generally seem to the be most consistent fMRI findings. I am not sure registration or motion correction is really going to help much here, I suspect the reality is just that the BOLD response is a lot less longitudinally stable than we thought (brain is changing more often and more quickly than we expected).
Or if we do get better at this, it will be more sophisticated "correction" methods (e.g. deep-learners that can predict typical longitudinal BOLD changes, and those better allow such changes to be "subtracted out", or something like that). But I am skeptical about progress here given the amount of data needed to develop any kind of corrective improvements in cases where there are such low longitudinal reliabilities.
===
> Using ICCs [intraclass correlation coefficients], recent efforts have examined test-retest reliability of task-based fMRI BOLD signal in adults. Bennett and Miller performed a meta-analysis of 13 fMRI studies between 2001 and 2009 that reported ICCs. ICC values ranged from 0.16 to 0.88, with the average reliability being 0.50 across all studies. Others have also suggested a minimal acceptable threshold of task-based fMRI ICC values of 0.4–0.5 to be considered reliable [...] Moreover, Bennett and Miller, as well as a more recent review, highlight that reliability can change on a study-by-study basis depending on several methodical considerations.
fMRI ususally measures BOLD, changes in blood oxygenation (well, deoxygenation). The point of the paper is that you can get relative changes like that in lots of ways: you could have more or less blood, or take out more/less oxygen from the same blood.
These can be measured themselves separately (that's exactly what they did here!) and if there's a spatial component, which the figures sort of suggest, you can also look at what a particular spot tends to do. It may also be interesting/important to understand why different parts of the brain seem to use different strategies to meet that demand.
I'll co-sign SubiculumCode's comment -- there's a lot of yelling about how bad fMRI is generally, which is not particularly fair to the fMRI research (or at least the better parts of it) or related to the argument.
The BOLD signal, the thing measured by fMRI, is a proxy for actual brain activity. The logic is that neural firing requires a lot of energy and so active neurons will being using more oxygen for their metabolism, and this oxygen comes from the blood. Thus, if you measure local changes in the oxygenation of blood, you'll know something about how active nearby neurons are. However, it's an indirect and complicated relationship. The blood flow to an area can itself change, or cells could extract more or less oxygen from the blood--the system itself is usually not running at its limits.
Direct measurements from animals, where you can measure (and manipulate) brain activity while measuring BOLD, have shown how complicated this is. Nikos Logathetis and Ralph Freeman's groups, among many others did a lot of work on this, especially c. 2000-2010. If you're interested, you could check out this news and views on Logathetis's group's 2001 Nature paper [1]. One of the conclusions of their work is that BOLD is influenced by a lot of things but largely measure the inputs to an area and the synchrony within it, rather than just the average firing rate.
In this paper, the researchers adjust the MRI sequences to compare blood oxygenation, oxygen usage, and blood flow and find that these are not perfectly related. This is a nice demonstration, but not a totally unexpected finding either. The argument in the paper is also not "abandon fMRI" but rather that you need to measure and interpret these things carefully.
In short, the whole area of neurovascular coupling is hard--it includes complicated physics (to make measurements), tricky chemistry, and messy biology, all in a system full of complicated dynamics and feedback.
It looks like the developer was so hooked on the idea of making it minimalistic, he forgot to make it a language-learning app. So it's a blob with a backstory. Design with no substance.
Those quizzes are part of the problem. It was so dispiriting to read, even enjoy, the assignment and then get dinged because you couldn’t remember whether the protagonist put on an otherwise irrelevant blue sweater or red jacket.
> It was so dispiriting to read, even enjoy, the assignment and then get dinged because you couldn’t remember whether the protagonist put on an otherwise irrelevant blue sweater or red jacket.
Maybe things have really changed a lot since I was in school, but that was certainly not the type of questions that were asked of set works.
The questions were asked such that, the more the student got into the book, the higher the mark they were able to get.
Easy questions (everyone gets this correct if they read the book): Did his friends and family consider $protagonist to be miserly or generous.
Hard questions (only those slightly interested got these correct): Examine the tone of the conversation between $A and $B in $chapter, first from the PoV of $A and then from the PoV of $B. List the differences, if any, in the tone that $A intended his instructions to be received and the tone that $B actually understood it as.
Very hard questions (for those who got +90% on their English grades): In the story arc for $A it can be claimed that the author intended to mirror the arc for Cordelia from King Lear. Make an argument for or against this claim.
That last one is the real deal; answerable only by students who like to read and have read a lot - it involves having read similar characters from similar stories, then knowing about the role of Cordelia, and at least a basic analysis of her character/integrity, maybe having read more works by this same author (they'll know if the mirroring is accidental or intentional), etc.
We were never asked "what color shirt did $A wear to the outing" types of questions (unless, of course, that was integral to the plot - $A was a double-agent, and a red shirt meant one thing to his handler while a blue shirt meant something else).
Did I like the set works? Mostly not, but I had enough fiction under my belt in my final two years of high-school that I could sail through the very difficult questions, pulling in analogies and character arcs, tone, etc from a multitude of Shakespeare plays, social issue fictional books ("Cry, The Beloved Country", "To Kill a Man's Pride", "To Kill a Mockingbird", etc), thrillers (Frederick Forsythe, et al), SciFi (Frederick Pohl, Isaac Asimov, Philip K. Dick), Horror-ish (Stephen Kind, Dean R Koontz) and more.
With my teenager now, second-final year of high-school, I keep repeating the mantra of "To get high English marks, you need to demonstrate critical thinking, not usage of fancy words", but alas, he never reads anything that can be considered a book, so his marks never get anywhere near the 90% grade that I regularly averaged :-(
The only books he's ever read are those he's been forced to read in school.
> Hard questions (only those slightly interested got these correct): Examine the tone of the conversation between $A and $B in $chapter, first from the PoV of $A and then from the PoV of $B. List the differences, if any, in the tone that $A intended his instructions to be received and the tone that $B actually understood it as.
I always got As on these ... but the primary reason was that I was good at bullshitting. They are super easy when you are good at bullshitting. The trick is not to care that your answer sounds royally stupid. Then you will get A.
And all you need is to check those dialogs when writing the test. If you are expecting me to remember those dialogs, then we are back to the expectation that I basically memorized the book.
> Very hard questions (for those who got +90% on their English grades): In the story arc for $A it can be claimed that the author intended to mirror the arc for Cordelia from King Lear. Make an argument for or against this claim.
Again, I got As ... but they were solidly in the "kind of test that convinces you literature is stupid class" kind of questions. Unless there is some kind of actual interesting insight to be had, this question just shows how empty the whole exercise is.
I never understood people who say their English writing is "bullshitting". Oh you read the book and offered a plausible interpretation? That sounds to me like learning and creating.
> I never understood people who say their English writing is "bullshitting". Oh you read the book and offered a plausible interpretation? That sounds to me like learning and creating.
I'm skeptical that watwut was as good in HS English lit as he claims to be: see his other response to me - he basically believes that recycling plot points is both meaningless (correct) and a path to high marks in literature (incorrect) :-/
I mean, it's possible but unlikely that plot examination will lead to good marks.
> I never understood people who say their English writing is "bullshitting". Oh you read the book and offered a plausible interpretation? That sounds to me like learning and creating.
It was bullshitting, because it was not even an attempt at a good analysis of what the author actually wrote or meant to write. I know for a fact that I was not trying to analyze the book. When I am actually trying to analyze a text, I think about it differently and treat it differently.
The thought process was closer to what I do when I am joking around, it is just that result was put into formal language.
It was a bit like a debate club - you know you are having sleazy untrue argument, but it is being rewarded, so.
> That sounds to me like learning and creating.
I dont think I learned much or anything about literature. I did created something. I learned something that, imo, schools should teach less - generate BS on command.
> Again, I got As ... but they were solidly in the "kind of test that convinces you literature is stupid class" kind of questions. Unless there is some kind of actual interesting insight to be had, this question just shows how empty the whole exercise is.
You are not making much sense.
You got As in the type of question that required demonstration of a broad swath of literature ... but that just shows you how empty the question is?
> You got As in the type of question that required demonstration of a broad swath of literature ... but that just shows you how empty the question is?
Yeah. I think that the key to achieving A is that you must not care and just let the creativity in your brain go. Thinking back, basically I did what LLM do today. You have to be able to vaguely associate plot points you vaguely remember from books you did not liked. You have to be able to write argument sounding constructions without care for how much they are true. Without feeling ashamed that you wrote something meaningless.
It is the kind of question that does not provide any meaningful insight to anything. The answer does not matter except for the grade. It wont give you any insight to literature. It does not demonstrate you understood something about the book either. That is why it is empty question - its only purpose is to prove you vaguely remember plot points.
Kids dont read for fun, but they have vague idea that books are something educational that is generally good to do. These sort of exercises will only convince them that reading books is both unfun drag and meaningless thing to do.
We certainly had questions like that as part of bigger assessments and they were pretty reasonable.
However, some of the teachers at my school also had short pop-quizzes meant to ensure that everyone kept up with the reading. These were usually just some details from the assigned chapters and, IMO, often veered into minutia. One really was about the color of something and I don’t remember it being particularly plot-relevant or symbolic, even if it was mentioned a few times.
It wasn’t a huge part of one’s grade, but I distinctly remember being frustrated that these quizzes effectively penalized me for “getting into” the book and reading ahead.
It's always fun when people point out an LLMs insane responses to simple questions that shatter the illusion of them having any intelligence, but besides just giving us a good laugh when AI has a meltdown failing to produce a seahorse emoji, there are other times it might be valuable to discuss how they respond, such as when those responses might be dangerous, censored, or clearly being filled with advertising/bias
I find a lot of the low-key things helpful: I use an app at the same time and place every day, and it’s nice to have a handy one-tap way to open it. It does a decent job organizing photos and letting me search text in screenshots.
If some dogs chew up an important component, the CERN dog-catcher won't avoid responsibility just by saying "Well, the computer said there weren't any dogs inside the fence, so I believed it."
Instead, they should be taking proactive steps: testing and evaluating the AI, adding manual patrols, etc.
Within-subject effects (this happens when one does A, but not when doing B) can be fine with small sample sizes, especially if you can repeat variations on A and B many times. This is pretty common in task-based fMRI. Indeed, I'm not sure why you need >2 participants expect to show that the principle is relatively generalizable.
Between-subject comparisons (type A people have this feature, type B people don't) are the problem because people differ in lots of ways and each contributes one measurement, so you have no real way to control for all that extra variation.
reply