A good test might be to provide it only about a third of the tests, then when it says it's done, run it on the holdout 2/3 of tests and see how well it did. Of course it may have already seen the other tests during training, but that's not relevant here since the goal is to find whether or not it's just "brute force bumbling" its way through the task relying heavily on the test suite as bumper rails for feedback, or if it's actually writing generalizable bug-free code with active awareness of pitfalls and corner cases. (Then again it might be invalidated if this specific project was part of the RL training process. Which it may well have been, it's low hanging fruit to convert any repo with comprehensive test suite into training data).
Either way, most tasks don't have the luxury of a thorough test suite, as the test suite itself is the product of arduous effort in debugging and identifying corner case.
Your Transamerica pyramid picture is incredible among really cool pictures you have there. Quite cool to photograph for wikipedia like this, the world needs more people like you!
It seems to be the case here. I cannot edit it or remove it, probably HN mods took ownership for them to decide what to do with it, especially given the provocative nature of the listing.
I edited the title at one point to add a question mark to the end - this is a standard moderation device when commenters start questioning whether a story is accurate. That had the side effect of preventing you from editing it, though this wasn't my intention.
We're now fact-checking the video to verify how real it is. All we know is that it's part of the Volume 8 of the dataset from the DOJ and what we know for sure is that it was uploaded to justice.gov
reply