This is very interesting work. I've been looking for something exactly like this to use on a large C application -- specifically to be able to annotate various API's as sources of different kinds of data, checks on how the data types are permitted to be used together, and operations that transform one kind to another. Compared to taint analysis we want to allow more categories than tainted/untainted, and transforming items between categories. Do you have any recommendations for similar tools that work with C?
It seems one of the major downfalls is that the user has to define all sources and sinks. I might have missed it but how do you systematically define/find these? Personally was interested in a similar topic for a thesis and stumbled upon deepcode.ai which started out of ETH Zurich (https://files.sri.inf.ethz.ch/website/papers/scalable-taint-...). Are there any plans or reasons why you would not want such a system?
The article briefly mentions this, although it might not be super clear from the short description - "We regularly review issues reported through other avenues, such as our bug bounty program, to ensure that we correct any false negatives." We rely on these mechanisms to find places where we're missing taint coverage and write sources and sinks as necessary. As of right now, all the annotations are manual.
I hadn't looked too deeply into the literature there, the paper looks really interesting! We don't have any concrete plans to implement such a system, but I don't think there's any fundamental reason we wouldn't want automatic taint model generation. I'll give the paper a read on Monday to learn more :)
Not sure if you can answer this, but what are some classes of security bugs you can find with Pysa? I've only worked on smaller codebases so security I've dealt with is mostly AuthN/AuthZ.
Pysa can find any bug that you can model as a flow of data from one place to another. That includes your standard web app bugs like SQLi, RCE, etc., also some AuthN/AuthZ bugs depending on how you do your checks. Concretely, this is a list of the vulnerabilities Pysa able to catch out of the box without any customization:
https://github.com/facebook/pyre-check/blob/6975ff55fc59b7b9...
You can find most of security issues with Pysa that you can model as a taint flow problem. Examples could be flows to function that enable code execution or shell injection, SQL injection, SSRF, XSS and many others. As long as you can model the security issue in a taint-flow model then Pysa should be able to detect these issues. These are the configuration we share with Pysa where you can find examples of bug categories we detect https://github.com/facebook/pyre-check/blob/master/stubs/tai...
In addition to what Sinan said, we've had success running on fully untyped codebases. These are you some strategies that you can use to get results on untyped codebases:
https://pyre-check.org/docs/pysa-coverage.html
Types will definitely make Pysa find more issues, but you don't need 100% coverage, or really more than the minimal coverage described in that doc I linked, to start finding some issues.
Pysa will try to analyze all functions regardless of whether they have type hints, but it work better if the function under consideration is typed. Namely, without type hints, it won't be able to pick up on tainted method calls or attribute accesses. However, regular function calls, etc. and standard data structures like dicts and lists should still be tracked normally.
How fast does Pysa typically run? If I want to run it as part of my CI system, how much additional time might I expect it to add? Obviously this varies from code base to code base, but I'm curious what the experience at Instagram is like?
For Instagram (millions of LOC), the analysis gives feedback to engineers in about 65 minutes on average - note that this is in the context of a diff run: We compare the results of a run on the base revision to the proposed changes, running the tool once or twice depending on whether we hit the cache. It's hard to say how long it'll take on your repository as it depends on a lot of factors, but hopefully that provides some intuition.
1. Pysa should work without watchman - it shares some code and infrastructure with Pyre, but doesn't need Watchman to complete its analysis.
2. Hopefully the answer to (1) helps here. Pysa shares some code with Pyre, including the parallelization infrastructure
- the same infrastructure that makes Pyre fast interactively makes Pysa fast on large codebases. Living on the Pyre GitHub repo allows Pysa to use the parallelization infra, in addition to the type checking APIs of Pyre as necessary.
Pyre & Pysa try to do a best-effort analysis of Python 2, and supports Python 2 style taint annotations, but most of the code we analyze at Facebook is Python 3.6+.
The reasons turn out to be decently boring here. Zoncolan [0] and Pyre [1], which Pysa shares core libraries with, are also written in OCaml, and the language made sense to use from the perspective of both sharing code and having people who are proficient and comfortable writing in OCaml working on the project.
Thanks! I'll dig into it later, as I've been looking for a tool like his, and it so happens that the details are interesting to me :) (I built my own namedtuple-style library for sum types, go figure). It didn't seem like the easiest reading though - that could make some potential users bounce... Are you working on docs that expose these features in a friendlier way?
edit: Also, was I correct in that it's basically "The stuff that's specified in the PEP and works in mypy should work"?
You can have multi-threaded OCaml (our server actually has two threads), but only one can run at a time currently. We solve this problem by having a multi-process architecture where workers only communicate through a lock-free hash table in shared memory.
Exactly - the LSP portion of pyre works for files you have open in an editor, but might miss changes due to a rebase or files you edit on the terminal. The watchman integration is there to make sure that pyre's aware of changes outside your editor.
Edit: pyre will show type errors for all files in your repository, not just the ones you have open.
Ah, that's nice. So you can change the type of a function in one file that you are editing, and pyre will show you all the other places that are now using the wrong types.