Very interesting work here that uses recurrent neural network ideas to predict next frames in a video sequence. It’s amazing how many times LSTM pops up these days. Unsupervised learning is one of the most interesting areas of machine learning at the moment and the potential is seemingly unlimited. This is another example of using LSTM for understanding video representations using LSTM. It’s a fascinating area.
Some every interesting software from Facebook’s AI Research that implements segmentation and labelling of images. Code is available on GitHub that uses Torch as its AI engine. Could be a good addition to rtndf as part of a video pipeline. Even if the segmentation and labelling is slower than real time, it’s possible to use a bypass system to keep the frame rate up while also processing selected key frames. This is done by the OpenFace PPE already. As things may move between key frames in a video pipeline, a strategy might be to buffer frames after the first key frame until the results from the second key frame are available and interpolate the segmentation results for the intermediate frames. Then, the buffered frames can be played out at the correct rate. Obviously this adds latency but might be acceptable in some situations.
Yes, that is me waving my Taylor (made in San Diego 🙂 ) guitar around in a very careless manner. It’s all in a good cause though. Turns out that Inception-v3 is very good at recognizing acoustic and electric guitars. I put together a new rtndf PPE called recognize based on the code here from the TensorFlow repo.
In its simplest mode, the recognize PPE takes an incoming video stream and tries to recognize an object in the entire frame. If it finds something, it adds a label in the bottom left corner of the image and uses that to generate a new output stream. That’s ok, but what’s more interesting is when it works with another PPE, modet. modet detects moving objects in the stream and draws a box around them. It now also adds metadata to the outgoing pipeline messages that can be used by downstream PPEs to do something with the regions where motion has been detected.
recognize can work in a mode where it uses the modet metadata to recognize moving objects in the stream. The screen capture with the guitar is an example. That’s why I was waving it around – it had to be in motion to get detected and recognized. The box is that big because I am in motion too! However, Inception-v3 seems quite able to recognize the dominant object in the image segment. While there is only one recognized object in this example, if there were more regions they would be individually recognized.
Of course, the example data set for Inception-v3 only knows so many things, guitars being an example. However, something I want to use this for is to detect a UPS truck coming up the drive. I’ll probably have to try retraining the final layer to do this.
I am currently reading Python Machine Learning as I wanted to know more about scikit-learn, amongst other things. It’s a very practical guide with just enough theory to make sense of it all. A lot of machine learning books dive pretty deep into the theory, which is great if that’s what you want. On the other hand, if the idea is to get doing something fast, this book seems like a great place to start. It’s always easier to delve into theory when its relevance is clear and there’s nothing like actually writing and running code to get a feel for relevance.
I am currently working with TensorFlow and I thought it’d be interesting to see what kind of performance I could get when processing video and trying to recognize objects with Inception-v3. While I’d like to get TensorFlow integrated with some of my Qt apps, the whole “build with Bazel” thing is holding that up right now (problems with Eigen includes – one day I’ll get back to that). As a way of taking the path of least resistance, I included TensorFlow in an inline MQTT filter written in Python. It subscribes to a video topic sourced from a webcam and outputs recognized objects in the stream.
As can be seen from the screen capture, it’s currently achieving 11 frames per second using 640 x 480 frames with a GTX 970 GPU. With a GTX 960 GPU, the rate falls to around 8 frames per second. This is pretty much what I have seen with other TensorFlow graphs – the GTX 970 is about 50% faster than a GTX 960, probably due to the restricted memory bus width on the GTX 960.
Hopefully I’ll soon have a 10 series GPU – that should be an interesting comparison.
This one looks quite a bit nicer than my previous attempt at this design! The functionality is the same but now a lot of the heavier processing has been moved into a new infrastructure that’s been developed to integrate artificial intelligence and machine learning functions into data flows very efficiently. Now I am able to leverage Apache NiFi‘s extensive range of processors to interface to all kinds of things but also escape the JVM environment to get bare metal performance for the higher level functions including access to GPUs and things like that. In this design I am just using NiFi’s MQTT and Elasticsearch processors but it could just as easily fire processed data into HDFS, Kafka etc.
I am working with a heavily modified version of the OpenFace real-time web demo and needed to be able to detect an unknown face. The standard version of the demo, once trained with at least two people, always chooses the best fit even if it isn’t very good. Turns out it isn’t too hard to modify it to get extra information that’s helpful in making this extra decision.
If you look at demos/web/websocketserver.py in the OpenFace GitHub repo, line 243 looks like this:
self.svm = GridSearchCV(SVC(C=1), param_grid, cv=5).fit(X,y)
To get the extra information, the estimator constructor needs to be changed to:
self.svm = GridSearchCV(SVC(C=1, probability=True), param_grid, cv=5).fit(X,y)
This tells the estimator to return probabilities in addition to the best identity. The predictor is called in line 306:
identity = self.svm.predict(rep)
However, due to the change in the estimator, this new call will return probabilities:
probs = self.svm.predict_proba(rep)
There will be one entry in the probs list for each identity. The best match is obviously the one with the highest probability. I am thresholding this at 0.85 which seems to work reasonably well. If the best probability is lower than 0.85, I set the identity to -1 (=Unknown). Otherwise, identity is set to the index in the probs list with the highest value.