Original: Please don't upload any private or confidential pdfs right now. I emailed OP two security concerns that trivially allow anybody to see any of the converted pdfs.
This is rather less than secure; output files are named, e.g., "Scan_2020512_{four random lower-case letters}.pdf" into a web-server-readable directory.
That gives a total of 456976 different possible filenames on a day. It's more than feasible to brute-force that many filenames in the hour before files get deleted.
OP: I don't think randomly-suffixed file names are an inherently bad way to approach this. But you should definitely consider using a longer random string, and definitely consider not using the `random` module too (it is not secure and is not intended to be).
Thank you for the comments. I agree with you, I will decrease how long the file is in the server (I just hit 40gb from hacker news) as well as implement rate limiting to prevent any brute force
Rate limiting (if by that you mean at the firewall or the web server) is not the way to do it. That shifts the problem somewhere else in the stack, into a place that isn't under version control in the same repository.
Consider: If you moved this on to another server, would you remember to enable rate limiting there? If someone else uses your code, will they know to enable rate limiting?
Rate limiting isn't a bad idea, but your security should not depend on it, especially as you have a way of securing it in your application. base64.b16encode(os.urandom(8)) will give you a 64-bit, filename-safe, as-close-to-random-as-reasonable suffix that should be long enough to make it brute-force-proof :)
The same reasoning applies to the cron job (I presume) that is cleaning your files - that's something you have to remember to set up for future (re-)deployments.
Edit: I'd also like to add that showing your code on HN takes bravery and this is, in fact, a neat tool that solves a problem I really wish didn't exist. So, good work on both counts :)
I know this doesn't really add much to the discussion, I just wanted to let you know I really, really appreciate HN over other sites for comments like this. Ones that help you learn something new in a really intuitive and on top of that "non-condescending" way (for lack of a better word I can think of). Thank you!
Hey, I know I was not as positive & encouraging as I should have been initially, hence the edit on the end. But thank you for the kind words mate, that actually means a lot to me. <3
thanks! took me quite a while to prepare as I read a bunch of other servers failing catastrophically when posting on HN due to the sheer amount of traffic.
I will start working on your comments throughout the weekend, I agree with most of them. Would love for you to follow the github page for any other comments you may have, all are appreciated
It was while reading my above comments that I realised I should have shut up and contributed code instead, because that's definitely more helpful than being critical on HN, especially to a newcomer & their first project.
So that is what I've decided to do! First step: a PR coming out of getting this up and running on my Ubuntu box. :)
Dotenv libraries are just for dev and other similar environments. In production you should still use normal environment variables (or whatever system you use to load your configuration), as dotenv files stay on the filesystem and sometimes even committed to your SCM.
haha this is like those domain name search websites that just automatically register the good sounding domain names for themselves once the user types it in.
do you OP! I think it still provides a service, enjoy all the secrets
I thought github had hooks for this kind of thing now? I remember it caught a private key I tried to push to a similar django repo (not for a prod site or anything), and that was about 2 years ago
Original: Please don't upload any private or confidential pdfs right now. I emailed OP two security concerns that trivially allow anybody to see any of the converted pdfs.