.....do you genuinely think that the data can be sufficiently anonymized to prot...

doctorpangloss · on July 28, 2023

I've actually gone through this process with the CDE and I was denied access. The privacy issue is a huge red herring, used to co-opt well meaning people like you.

I requested data about STAR, the California standardized test used for Lowell's admissions. I wanted rows of the form (randomized student ID, STAR question ID, answered correctly), however they were recorded, and literally nothing more.

They rejected the request because (1) they claimed such records didn't exist, which makes no sense because how exactly did they administer the test then; and (2) because standardized testing is carved out, in their opinion, from the related sunshine law.

Why did I want these records? I wanted to show that scoring well on tests and using them to gate admissions doesn't mean what people think it means. Specifically, that if you administered the test Lowell used (STAR) by hardest question first, then terminated the test after the student gets N (close to 1) questions wrong, you would select nearly the same list of students. Only asking the vast majority of students only e.g. 1 question, which they all get wrong, can't possibly measure how much they study, how comprehensive their knowledge is, etc. But these claims are routinely made in defense of the test and its purpose in selecting a class. This is coming from someone who wants test based admissions.

So clearly political, right? I had to carefully word my request around all these conclusions. If you read the CDE's requirements, they really have specific political goals. You either align with them or you don't. And I tried to work around that, and I stilled failed. They just looked at the absence of a political bent, and correctly concluded that it wasn't evidence of absence.

If you want to do good, politically impactful educational research: run your own school. That's what the CDE wants you to do. It's not about discovering how to improve public schools.

chaps · on July 28, 2023

Not sure it's fair to say it's a red herring, or that I'm "co-opted" like you suggest. Transparency is kind of my main dig -- I get it. Like, I recently helped a small team of researchers with some FOIA requests to get access to similar information you were denied.

But at the end of the day it's fundamentally important to understand at what point transparency and privacy intersect.

doctorpangloss · on July 28, 2023

I appreciate you're sincere about these issues.

> But at the end of the day it's fundamentally important to understand at what point transparency and privacy intersect.

"At the end of the day," these conversations about privacy are like 15 minutes long at private schools. People still keep sending their kids to private schools. I just don't know how much it matters.

They surely care about privacy in their internal research and metrics, but they don't employ a full time Privacist. They might employ someone who checks the right boxes for them and deals with FERPA shit. But because they are aligned with the parents in delivering the best educations, for the most part, they are trusted to do with data what they want, and that sometimes includes inviting outside collaborators to look at it, without anywhere near the same faff as the CDE.

If you're a journalist and you want to help a private school make a better education, out of the thousands of private schools, one of them will both let you write about it and also tell them something they don't wanna hear. Some might use privacy or whatever as the reason they don't want to collaborate with you, but on average it will be about trust.

The CDE is never going to do that. There's only 1 CDE, and they are there to preserve the status quo.

chaps · on July 28, 2023

For sure :)

Very similar things happen when investigating criminal cases. There's possibly hundreds or thousands of instances of some type of misconduct or improper arrest... but none of the defense attorneys with those sorts of cases will talk about it with the press because of the very real potential harms of talking with the press. Or the ones that do talk are too high level.. or they might have some ulterior motive like self-promotion. It's really hard to express how many issues are a direct result of lawyers understandably, but systematically not raising any public awareness about truly awful things.

Have you tried getting your data through CPRA requests? I'm out in Illinois and our law is pretty decent and not super familiar with CA's public records nuance, but it's really worth a try. What I know though is that California CPRA officers get away with a strange amount of abuse of the law. But even with that, you might be surprised what records are available. So if you do submit some requests, don't exactly expect it to be easy or immediate. Expect to be stonewalled, and need to sue at some point though. But IME public record suits are pretty hands-off (except when they're not..). And most of the lawyers I've worked with are upfront about what they will and won't litigate over.

One thing you'll find is that.. basically nobody is looking into most of the awful things you'd expect would have eyes. It's very likely you'll be the only one doing those requests, or incrementally identifying how to get what you want through multiple requests over months. But each step breaks new ground and turns into feedback loops if you can build a community around it.

Dylan16807 · on July 28, 2023

If going until the student gets one or two of the hardest questions wrong is highly predictive of whether they get selected, that implies that students near the selection threshold are getting very few questions wrong, right?

> Only asking the vast majority of students only e.g. 1 question, which they all get wrong, can't possibly measure how much they study, how comprehensive their knowledge is, etc.

This seems like a strawman?

Yes, a single question can't measure those things to a high degree of certainty.

But if you have students that do poorly on all the hard questions, and students that do well on all the hard questions, then asking them a single hard question might be 80% predictive of what group they're in.

Why is it bad for that percentage to be high?

The reason the test has lots of questions is specifically to increase the predictive quality. Being able to loosely predict from a small subset of questions seems reasonable to me. It doesn't mean the test is failing to measure the student's knowledge.

mrguyorama · on July 28, 2023

>I had to carefully word my request around all these conclusions.

Aren't you clearly saying you already had a desired outcome and were just fishing for the data to confirm it? I mean, I wouldn't give you any data in that case either. It's a strong signal that you are motivated by something other than what the data shows.

Dylan16807 · on July 28, 2023

"I wanted to show X" sounds like a normal hypothesis to me.

What's wrong with how they want to use the data? Sort the questions, run the algorithm, see how well the scores match the real scores.

nvy · on July 28, 2023

The specific data? No idea, never seen it.

Do I think that, in principle, data sets can be anonymized? Of course I do.

Your incredulous tone and excessive ellipsis seems to imply you find this position to be ridiculous, so maybe you'd better be a little less snide and a little more expansive on, what, exactly, your problem is.

chaps · on July 30, 2023

It's because their confidence in deidentifying data doesn't match the significant risk. If you think it's worth it and are willing to take that risk that's on you and those you risk harming.

eimrine · on July 28, 2023

I believe that the data can be sufficiently faked to be anonymized.

pjc50 · on July 28, 2023

That also renders it useless!

IX-103 · on July 28, 2023

That's not strictly true. There's some recent work (as fascinating as it is incomprehensible) on generating datasets that share most aggregate properties with the actual dataset (measured through joint probability distributions), but do not reveal more than some epsilon of information about any individual contained in the original data set.

These have the potential to revolutionize private computation and analysis, as they provide provable hard (theoretical) limits on the amount of information you can learn about individuals regardless of the type of analysis performed on the proxy dataset.

mrguyorama · on July 28, 2023

Two datasets that share many aggregate statistics are not interchangeable.

mrangle · on July 28, 2023

Data has been sufficiently anonymized for decades.