Optasia: A Relational Platform for Efficient Large-Scale Video Analytics
Yao Lu, Aakanksha Chowdhery, Srikanth Kandula
Microsoft
ACM SoCC 2016. PDF

The publicly avaliable version of Optasia is now coming via Azure Data Lake.
A tutorial can be found here.

Camera deployments are ubiquitous; however exiting methods to analyze video
feeds from many cameras do not scale and are error-prone. For example, below lists
lists code examples to extract SIFT keypoints from many video frames.

Spark script to extract SIFT keypoints

            import logging
import io
import sys
import os
 
import cv2
import numpy as np 
 
def extract_sift_features:
 
    def extract_sift_features_nested(imgfile_imgbytes):
        try:
            imgfilename, imgbytes = imgfile_imgbytes
            nparr = np.fromstring(buffer(imgbytes), np.uint8)
            img = cv2.imdecode(nparr, 0) 
            extractor = cv2.SIFT()
            kp, descriptors = extractor.detectAndCompute(img, None)
            return [(imgfilename, descriptors)]
        except Exception, e:
            logging.exception(e)
            return []
 
    return extract_opencv_features_nested
 
if __name__ == "__main__":
    sc = SparkContext(appName="sift_extractor")
    sqlContext = SQLContext(sc)
 
    try: 
        image_seqfile_path = sys.argv[1]
        feature_parquet_path = sys.argv[2]
        partitions = int(sys.argv[3])
    except:
        print("Usage: spark-submit sift_extraction.py "
        <image_input_path> <feature_output_path> <partitions>")
 
    images = sc.sequenceFile(image_seqfile_path, minSplits=partitions)
 
    features = images.flatMap(extract_sift_features)
    features = features.filter(lambda x: x[1] != None)
    features = features.map(lambda x: (Row(fileName=x[0], features=x[1].tolist())))
    featuresSchema = sqlContext.createDataFrame(features)
    featuresSchema.registerTempTable("images")
    featuresSchema.write.parquet(feature_parquet_path)

        

To run this script, one must upload the images to Hadoop compatible storage
(Swift), specify degree of parallism (e.g., 100), and run:
spark-submit --executor-memory 8g sift.py swift://spark.swift1/images.hseq ...
swift://spark.swift1/images.parquet 100
The execution optimality cannot be garanteed.

We present a system, Optasia, that is friendly to end-users while efficiently executes
the scripe on the cluster. End-users do not need to worry about degree of parallism,
task scheduling, etc. The analystic units are highly modulized, decoupling the roles
of algorithm and application engineers.

Optasia script to extract SIFT keypoints

            USING Optasia;
 
images =
    EXTRACT id : int, 
            frame : string 
    FROM SPARSE STREAMSET @"images/" 
    USING ImageExtractor();
 
feat =
    PROCESS images
    USING SIFTProcessor()
    PRODUCE id,
            feature;
 
OUTPUT feat
TO @"features.txt"
USING DefaultTextOutputter();

        

A powerful query optimization technique is applied to generate optimal execution
plans so that resource useage is greatly reduced. More details described in the paper.

Click here for example dataflow and queries used in the paper.
Click here for example SCOPE wrappers in C#.