sinancepel's comments

sinancepel · on Aug 7, 2020

One of the authors of the blog post and software engineer working on Pysa here - happy to answer any questions you may have :)

therealcamino · on Aug 7, 2020

This is very interesting work. I've been looking for something exactly like this to use on a large C application -- specifically to be able to annotate various API's as sources of different kinds of data, checks on how the data types are permitted to be used together, and operations that transform one kind to another. Compared to taint analysis we want to allow more categories than tainted/untainted, and transforming items between categories. Do you have any recommendations for similar tools that work with C?

remram · on Aug 14, 2020

Can the CSS be fixed so that the right side of the text fit in the window? https://imgur.com/a/hnRpLZH

hkstm · on Aug 8, 2020

It seems one of the major downfalls is that the user has to define all sources and sinks. I might have missed it but how do you systematically define/find these? Personally was interested in a similar topic for a thesis and stumbled upon deepcode.ai which started out of ETH Zurich (https://files.sri.inf.ethz.ch/website/papers/scalable-taint-...). Are there any plans or reasons why you would not want such a system?

sinancepel · on Aug 8, 2020

The article briefly mentions this, although it might not be super clear from the short description - "We regularly review issues reported through other avenues, such as our bug bounty program, to ensure that we correct any false negatives." We rely on these mechanisms to find places where we're missing taint coverage and write sources and sinks as necessary. As of right now, all the annotations are manual.

I hadn't looked too deeply into the literature there, the paper looks really interesting! We don't have any concrete plans to implement such a system, but I don't think there's any fundamental reason we wouldn't want automatic taint model generation. I'll give the paper a read on Monday to learn more :)

yingw787 · on Aug 7, 2020

Not sure if you can answer this, but what are some classes of security bugs you can find with Pysa? I've only worked on smaller codebases so security I've dealt with is mostly AuthN/AuthZ.

gbleaney · on Aug 7, 2020

Pysa can find any bug that you can model as a flow of data from one place to another. That includes your standard web app bugs like SQLi, RCE, etc., also some AuthN/AuthZ bugs depending on how you do your checks. Concretely, this is a list of the vulnerabilities Pysa able to catch out of the box without any customization: https://github.com/facebook/pyre-check/blob/6975ff55fc59b7b9...

the_storm · on Aug 7, 2020

You can find most of security issues with Pysa that you can model as a taint flow problem. Examples could be flows to function that enable code execution or shell injection, SQL injection, SSRF, XSS and many others. As long as you can model the security issue in a taint-flow model then Pysa should be able to detect these issues. These are the configuration we share with Pysa where you can find examples of bug categories we detect https://github.com/facebook/pyre-check/blob/master/stubs/tai...

dkarp · on Aug 7, 2020

Does Pysa require the code to be fully type hinted? or will it work on non-type hinted code also?

gbleaney · on Aug 7, 2020

In addition to what Sinan said, we've had success running on fully untyped codebases. These are you some strategies that you can use to get results on untyped codebases: https://pyre-check.org/docs/pysa-coverage.html

Types will definitely make Pysa find more issues, but you don't need 100% coverage, or really more than the minimal coverage described in that doc I linked, to start finding some issues.

sinancepel · on Aug 7, 2020

Pysa will try to analyze all functions regardless of whether they have type hints, but it work better if the function under consideration is typed. Namely, without type hints, it won't be able to pick up on tainted method calls or attribute accesses. However, regular function calls, etc. and standard data structures like dicts and lists should still be tracked normally.

cubes · on Aug 7, 2020

How fast does Pysa typically run? If I want to run it as part of my CI system, how much additional time might I expect it to add? Obviously this varies from code base to code base, but I'm curious what the experience at Instagram is like?

sinancepel · on Aug 7, 2020

For Instagram (millions of LOC), the analysis gives feedback to engineers in about 65 minutes on average - note that this is in the context of a diff run: We compare the results of a run on the base revision to the proposed changes, running the tool once or twice depending on whether we hit the cache. It's hard to say how long it'll take on your repository as it depends on a lot of factors, but hopefully that provides some intuition.

cubes · on Aug 7, 2020

That's super helpful. I'm currently at Eventbrite, and we're probably in the same order of magnitude.

devy · on Aug 7, 2020

Hi,

I have 2 questions:

1. The installation doc specified `watchman` as a dependency. Why is that used? without watchman, would pysa not work?

2. Also, why can the pysa become a separate stand alone tool instead of living in the pyre Github repo?

sinancepel · on Aug 7, 2020

1. Pysa should work without watchman - it shares some code and infrastructure with Pyre, but doesn't need Watchman to complete its analysis.

2. Hopefully the answer to (1) helps here. Pysa shares some code with Pyre, including the parallelization infrastructure - the same infrastructure that makes Pyre fast interactively makes Pysa fast on large codebases. Living on the Pyre GitHub repo allows Pysa to use the parallelization infra, in addition to the type checking APIs of Pyre as necessary.

See also our original post introducing Pyre - our goal from the outset was to build a platform for deeper static analyses: https://www.facebook.com/notes/protect-the-graph/pyre-fast-t...

cubes · on Aug 7, 2020

Python 2 and 3 or 3 only? I looked for this on the linked site, and it wasn't immediately obvious.

sinancepel · on Aug 7, 2020

Pyre & Pysa try to do a best-effort analysis of Python 2, and supports Python 2 style taint annotations, but most of the code we analyze at Facebook is Python 3.6+.

galtwho · on Aug 8, 2020

What is the story behind the name?

Was that always the name?

BerislavLopac · on Aug 8, 2020

My guess: it's an acronym, probably for "Python security analysis" or something similar.

sdigital · on Aug 8, 2020

It reminds me of the Spanish word "paisa"...

quattrofan · on Aug 8, 2020

It says in the article...

cheez · on Aug 8, 2020

You guys use OCaml? Why?

sinancepel · on Aug 8, 2020

The reasons turn out to be decently boring here. Zoncolan [0] and Pyre [1], which Pysa shares core libraries with, are also written in OCaml, and the language made sense to use from the perspective of both sharing code and having people who are proficient and comfortable writing in OCaml working on the project.

[0]: https://engineering.fb.com/security/zoncolan/

[1]: https://github.com/facebook/pyre-check

sinancepel · on May 11, 2018

If you're interested in the details of how pyre handles generics, unions, etc., you can take a look at the end-to-end typechecking tests, which provide lots of examples: https://github.com/facebook/pyre-check/blob/master/analysis/...

uryga · on May 11, 2018

Thanks! I'll dig into it later, as I've been looking for a tool like his, and it so happens that the details are interesting to me :) (I built my own namedtuple-style library for sum types, go figure). It didn't seem like the easiest reading though - that could make some potential users bounce... Are you working on docs that expose these features in a friendlier way?

edit: Also, was I correct in that it's basically "The stuff that's specified in the PEP and works in mypy should work"?

sinancepel · on May 11, 2018

You can have multi-threaded OCaml (our server actually has two threads), but only one can run at a time currently. We solve this problem by having a multi-process architecture where workers only communicate through a lock-free hash table in shared memory.

leshow · on May 11, 2018

Is it fair to say that it runs in parallel if only one thread may run at a time? Isn't that concurrent, but not parallel?

platinumrad · on May 12, 2018

Both responses to your original question mention that multiple processes are run.

sinancepel · on May 11, 2018

If you're having trouble with pip, I'd recommend trying `pip3.6 install pyre-check`.

sinancepel · on May 11, 2018

Yes! Pyre adopts types from the typing module, and relies on typeshed[1] to infer types for standard library functions.

[1]: https://github.com/python/typeshed

sinancepel · on May 11, 2018

Exactly - the LSP portion of pyre works for files you have open in an editor, but might miss changes due to a rebase or files you edit on the terminal. The watchman integration is there to make sure that pyre's aware of changes outside your editor.

Edit: pyre will show type errors for all files in your repository, not just the ones you have open.

neuland · on May 11, 2018

Ah, that's nice. So you can change the type of a function in one file that you are editing, and pyre will show you all the other places that are now using the wrong types.