I'd check out the OpenCV documentation and examples. This is basically what I use for face recognition in videos[0]; for recognising cars or other objects, you'd probably want to either train your own model or use something like OpenCV's YOLOv3 (example: [1] but you'd need to steal the video reading code from the first link[0])
Thanks. Also just kinda wondering if there's been any leaps lately, as I guess this is the same way as one would have done it a few years ago as well. But now that one can upload images and chat about them to multi modal LLMs, wondering if there's easier ways now (but preferable not uploading a million images to chatgpt api and paying the cost).
Like, could I avoid training or specifying much or becoming very knowledgeable in this domain, are we there yet?
Could I say "detect the frames of every car when it passes position X in the video, and then grab the frame when the same car passes position Y", and then I could calculate the frame difference to know the speeds. Or would I have to do loads of code and training still for something like this?
(I know I'm asking for much here, just curious what the SOTA is in this right now)
[0] https://github.com/ageitgey/face_recognition/blob/master/exa...
[1] https://github.com/deveth0/python-opencv/tree/master/objectD...