The issue is the "sufficiently anonymized" part. Given a large enough number of dimensions, you may be able to identify students well enough.
For example, if you take all students that took course A at time X, course B at time Y, course C at time Z and so on, eventually you might be able to narrow it down to a very small group, perhaps to even a single student.
How can the researchers simultaneously publish research and not be allowed to testify to their conclusions in litigation though? It seems clear that this is not a privacy concern and is rather a protective measure.
This will also probably follow a power series too. So it isn't unlikely that you could deanonymize someone given just 2 courses. Not much information is needed to encode a lot of things.
You're attempting to make an argument against any anonymized data being used in research. You'd have to do better than a hypothetical to make headway with it.
Moreover, the logic would have to carry over to the very common practice of anonymizing data in professional communications (like training). Which would have HIPAA implications for some students.
The common anonymizing practices have been utilized for decades without privacy breaches of note. That track record is also what your argument would have to defeat.
What I mentioned applies to safeguards against de-anonymization in the event of public access. Ie: a published research paper or professional notes left behind on a bus.
Anonymizing a huge data set like this is impossible.
Also, the burden of proof is on those that say that the data has no privacy implications, not on those who are like "ehhh, it's probably safe to release this."
> Anonymizing a huge data set like this is impossible.
That depends. The entire dataset of course because it’s everyone’s student records. But you can probably subset it to the extent that it’s still useful and perturb enough to protect individuals and be statistically equivalent.
And you could also generate a bunch of aggregate results that do stuff like identify average grade differences before and after periods while correcting for other differences without including individual identifiers.
You're moving goalposts, friend. What you're suggesting and what OP are suggesting are in two completely different categories of disclosure extent. I don't think anybody here is suggesting that no data should be available to the public.
I agree with the person's criticism of your comment.
Yes, obviously there is a level of aggregation where privacy concerns no longer hold.
But there is no trivial transformation that allows education researchers the data they need but preserves anonymity. Education researchers want to aggregate and statistically sample the data in new ways; pre-aggregating it removes most ability to do so. If you want to do a principal component analysis of a few variables-- good luck with aggregate data.
If you provide nearly any data at the student-level, there's a pretty high chance that it can be deanonymized.
At the same time, the state's position of attempting to prevent education researchers from participating in litigation (when using only public, non-restricted data) is egregious.
I’m talking about perturbing the micro data enough to allow researchers to answer their questions while remaining analytically valuable.
For example with school attendance data you could easy release a dataset at the county level with every student’s record with unique generated student id, race, grade, gender, absences by year (or even month) and still have 5-20 of each category to be able to show attendance trends before and after Covid without being able to identify individuals. And, if necessary, suppress really unique race or gender instances (eg, maybe there’s only one trans, Native American in a school) while still being useful enough to make useful findings for general trends.
I don’t know what specific questions, but the state not releasing any data to them and claiming privacy is silly.
The census and department of ed already do this and the department of ed has a very useful description of how they apply privacy protections and validate that data are sufficiently anonymized for public release, https://studentprivacy.ed.gov/sites/default/files/resource_d...
> For example with school attendance data you could easy release a dataset at the county level with every student’s record with unique generated student id, race, grade, gender, absences by year (or even month) and still have 5-20 of each category to be able to show attendance trends before and after Covid without being able to identify individuals.
Yes, for any kind of specific study you want to do, you can form aggregates that support it. Indeed, lots of aggregate data is already released publicly.
Lots of aggregate data is already released publicly.
If you want the actual, real data, so you can answer questions like --- "what about attendance on Mondays in students receiving subsidized lunch-- what does it predict about that student's attendance in the future?" --- you'll either need the real data, or for the state to basically do your data aggregation for your specific question.
The word "anonymized" needs to be excised from our collective vocabulary. "Anonymization" is not a thing that can be meaningfully done to a dataset about individuals. Coarse aggregation is possible, and the only practical way to achieve this end, but this has its own drawbacks in a research context.
I've actually gone through this process with the CDE and I was denied access. The privacy issue is a huge red herring, used to co-opt well meaning people like you.
I requested data about STAR, the California standardized test used for Lowell's admissions. I wanted rows of the form (randomized student ID, STAR question ID, answered correctly), however they were recorded, and literally nothing more.
They rejected the request because (1) they claimed such records didn't exist, which makes no sense because how exactly did they administer the test then; and (2) because standardized testing is carved out, in their opinion, from the related sunshine law.
Why did I want these records? I wanted to show that scoring well on tests and using them to gate admissions doesn't mean what people think it means. Specifically, that if you administered the test Lowell used (STAR) by hardest question first, then terminated the test after the student gets N (close to 1) questions wrong, you would select nearly the same list of students. Only asking the vast majority of students only e.g. 1 question, which they all get wrong, can't possibly measure how much they study, how comprehensive their knowledge is, etc. But these claims are routinely made in defense of the test and its purpose in selecting a class. This is coming from someone who wants test based admissions.
So clearly political, right? I had to carefully word my request around all these conclusions. If you read the CDE's requirements, they really have specific political goals. You either align with them or you don't. And I tried to work around that, and I stilled failed. They just looked at the absence of a political bent, and correctly concluded that it wasn't evidence of absence.
If you want to do good, politically impactful educational research: run your own school. That's what the CDE wants you to do. It's not about discovering how to improve public schools.
Not sure it's fair to say it's a red herring, or that I'm "co-opted" like you suggest. Transparency is kind of my main dig -- I get it. Like, I recently helped a small team of researchers with some FOIA requests to get access to similar information you were denied.
But at the end of the day it's fundamentally important to understand at what point transparency and privacy intersect.
> But at the end of the day it's fundamentally important to understand at what point transparency and privacy intersect.
"At the end of the day," these conversations about privacy are like 15 minutes long at private schools. People still keep sending their kids to private schools. I just don't know how much it matters.
They surely care about privacy in their internal research and metrics, but they don't employ a full time Privacist. They might employ someone who checks the right boxes for them and deals with FERPA shit. But because they are aligned with the parents in delivering the best educations, for the most part, they are trusted to do with data what they want, and that sometimes includes inviting outside collaborators to look at it, without anywhere near the same faff as the CDE.
If you're a journalist and you want to help a private school make a better education, out of the thousands of private schools, one of them will both let you write about it and also tell them something they don't wanna hear. Some might use privacy or whatever as the reason they don't want to collaborate with you, but on average it will be about trust.
The CDE is never going to do that. There's only 1 CDE, and they are there to preserve the status quo.
Very similar things happen when investigating criminal cases. There's possibly hundreds or thousands of instances of some type of misconduct or improper arrest... but none of the defense attorneys with those sorts of cases will talk about it with the press because of the very real potential harms of talking with the press. Or the ones that do talk are too high level.. or they might have some ulterior motive like self-promotion. It's really hard to express how many issues are a direct result of lawyers understandably, but systematically not raising any public awareness about truly awful things.
Have you tried getting your data through CPRA requests? I'm out in Illinois and our law is pretty decent and not super familiar with CA's public records nuance, but it's really worth a try. What I know though is that California CPRA officers get away with a strange amount of abuse of the law. But even with that, you might be surprised what records are available. So if you do submit some requests, don't exactly expect it to be easy or immediate. Expect to be stonewalled, and need to sue at some point though. But IME public record suits are pretty hands-off (except when they're not..). And most of the lawyers I've worked with are upfront about what they will and won't litigate over.
One thing you'll find is that.. basically nobody is looking into most of the awful things you'd expect would have eyes. It's very likely you'll be the only one doing those requests, or incrementally identifying how to get what you want through multiple requests over months. But each step breaks new ground and turns into feedback loops if you can build a community around it.
If going until the student gets one or two of the hardest questions wrong is highly predictive of whether they get selected, that implies that students near the selection threshold are getting very few questions wrong, right?
> Only asking the vast majority of students only e.g. 1 question, which they all get wrong, can't possibly measure how much they study, how comprehensive their knowledge is, etc.
This seems like a strawman?
Yes, a single question can't measure those things to a high degree of certainty.
But if you have students that do poorly on all the hard questions, and students that do well on all the hard questions, then asking them a single hard question might be 80% predictive of what group they're in.
Why is it bad for that percentage to be high?
The reason the test has lots of questions is specifically to increase the predictive quality. Being able to loosely predict from a small subset of questions seems reasonable to me. It doesn't mean the test is failing to measure the student's knowledge.
>I had to carefully word my request around all these conclusions.
Aren't you clearly saying you already had a desired outcome and were just fishing for the data to confirm it? I mean, I wouldn't give you any data in that case either. It's a strong signal that you are motivated by something other than what the data shows.
Do I think that, in principle, data sets can be anonymized? Of course I do.
Your incredulous tone and excessive ellipsis seems to imply you find this position to be ridiculous, so maybe you'd better be a little less snide and a little more expansive on, what, exactly, your problem is.
It's because their confidence in deidentifying data doesn't match the significant risk. If you think it's worth it and are willing to take that risk that's on you and those you risk harming.
That's not strictly true. There's some recent work (as fascinating as it is incomprehensible) on generating datasets that share most aggregate properties with the actual dataset (measured through joint probability distributions), but do not reveal more than some epsilon of information about any individual contained in the original data set.
These have the potential to revolutionize private computation and analysis, as they provide provable hard (theoretical) limits on the amount of information you can learn about individuals regardless of the type of analysis performed on the proxy dataset.