Analyzing videos presents a unique challenge due to their rich content compared to images. Furthermore, processing lengthy videos efficiently necessitates segmenting them into scenes. Focusing on individual scene analysis offers an efficient alternative to analyzing entire videos. To facilitate this approach, images are extracted from these scenes, and each image is subjected to processing and analysis. In this study, we introduce a novel framework for video analysis employing state-of-the-art open-source models, including Contrastive Language–Image Pre-training (CLIP) for image classification, Bidirectional and Auto-Regressive Transformer (BART) language model for text classification, WHat-If Scenario Planning for Event Recognition (WHISPER) for Speech Recognition, and Yet Another Multiscale Convolutional Neural Network (YAMNET) audio model for audio classification. The proposed methodology starts with a video, and then, audio features are harnessed to extract scenes. A multimodal approach is applied to these scenes, capturing audio features, and text features through Automatic Speech Recognition (ASR), and image classification using CLIP. This fusion of multimodal features enhances the video intelligence process, making it a highly effective approach. The application of this approach extends to a variety of Video Intelligence tasks, from surveillance applications to comprehensive video analytics. By capitalizing on opensource foundation models and leveraging audio and text features, our framework offers a versatile solution to the intricate task of video analysis, catering to a multitude of real-world applications. We published the code at https URL [https://github.com/akewarmayur/VideoIntelligence] for future consideration and study.