Select Page

Image recognition basics – AI for Dummies (1/4)

by | Tech

AI For Dummies (1/4)

And why Sarah Connor better watch out…

Much ink has been spilled on Artificial Intelligence and at times, it can feel a bit overwhelming. You have those that tell you AI is going to revolutionize the world, those that tell you AI will dehumanize society and those that tell you it’s just a fad and that it’ll pass.

To celebrate the end of 2017, we thought we’d give you an early present: a 4-part series on Artificial Intelligence applied to images. The objective? Help you navigate those troubled waters with a bit of background, and more importantly, cut through all the mumbo jumbo. How was it done before? What led to the revolution we see today? How does it work? What are the main challenges? All you need to know to impress your guests this Christmas!

So without further ado let’s dive right in…


What Did You Say?

Although less sexy, when it comes to images, Artificial Intelligence has existed under different names since the 60s: image recognition and also computer vision. What exactly is computer vision you might be wondering?

Computer Vision is the art and science of making computers understand images.

You might not realize it, but your brain is indeed a beautiful machine. From one single picture, you can retrieve more information than we know what to make of. Have a look at the picture below.

Barry is a cool dog, he likes surfing in Hawaii.

If I were to ask you what’s in the image, you would probably tell me there’s a dog, on the beach, with some kind of bodyboard, wearing red sunglasses and a Hawaiian necklace made of fake flowers with white thread to link them… and so on and so on.

Well, spoiler alert, the day a computer will be able to get to this level of both precision and generality at the same time, has not come yet. Fortunately for us — otherwise we’d be out of business — there are already some practical use cases where computer vision proves highly valuable.

Tell Me What You See

So, what do we teach computers then? Simple: recognize, identify and locate objects with different levels of precision. Barry and his friend Ducky will show you what I mean. For simplicity’s sake, I will illustrate the four main tasks used today in real-world applications.

Classification on the left, we’re pretty sure there’s only a dog and no cat. Tagging on the right, there’s both a dog and a duck.

The first and most straightforward task we can accomplish is to identify what is in an image and how sure of it we are — that’d be the probability percent represented in the two pictures above. For this, there are two main points you need to consider:

  • What is the list of objects you want to detect? That’s what is called the ontology. In the first image, it’s cats and dogs. To keep it very simple, you need to tell the algorithm what classes of objects it should identify beforehand. And as with all things simple… it’s actually more complicated than that. You don’t always have to list all objects, but this is an open area of research called Unsupervised Learning so we’ll steer clear of it for the time being.
  • Are there multiple objects in the same picture? That’s a very significant differentiation. If only one item is present at the same time, we call it Classification (left). Otherwise, it’s what known as Tagging (right), when several objects are found in the same picture.
Detection on the left, we know in which box in the image Ducky and Barry are. Segmentation on the right, we have the information at pixel-level.

Now that we’ve answered the What, the question becomes: where are the objects we’re looking for? There are two ways to do it:

  • Detection outputs the rectangle on the image — also called bounding box- where the objects are. It can be prone to small errors and imprecisions on the position, but it’s a very robust technology.
  • Segmentation goes one step further. For each pixel — the most atomical element of information in an image — we identify to which, if any, objects it belongs to. The result is a very precise map, although it requires a lot of carefully annotated data. That’s a tedious task when you have to do it for every pixel, but it’s one that can deliver impressive results… This is one of the reasons why use-cases in healthcare, in particular cancer detection, are becoming more and more widespread.

Those are the four main building blocks of computer vision but you also have: instance identification, face key-points detection, action recognition, tracking, optical character recognition, image generation, style-transfer, denoising, depth estimation, 3D reconstruction, motion estimation, optical flow, etc. You get the idea; there’s lots and lots to do!

Behind The Curtain

Arthur C. Clarke — who you might know because he wrote 2001: A Space Odyssey — said it better than anyone else: “Any sufficiently advanced technology is indistinguishable from magic.” My spin on this quote is that until you explain exactly how something works, it will never be completely understood and accepted. It’s especially true when it comes to AI. Once you start peeling the onion, you realize it’s just another technology with its strength and weaknesses. You shouldn’t be scared of it any more than you are of electricity.

Magic or trickery? Sometimes you shouldn’t believe what your eyes tell you. Especially if Weird Al is around.

That’s a lot of talking, let’s get down to business: how did it use to work? I say used, however it’s still the de facto standard in some fields and industries.

The real game changer and the most fundamental difference between traditional computer vision and what’s now called deep learning lays in how you build the algorithms.

  • The new ways. With deep learning, everything relies on examples. You need a collection of several dog and cat images; then the algorithm will build on its knowledge of the images you’ve given it to make predictions on pictures it’s never seen before. This is what’s called generalization. Also big warning ⚠️️ 🚨 , you should always be very suspicious of people who speak of algorithms as if they were sentient beings and had motives, which is what I just did. Just because they appear to learn like we do doesn’t mean they are actually able to think.
  • The old ways. On the other hand, traditional computer vision is mostly rule-based. What this means is that you will look at images of what you want to detect and then you will use your imagination and logical thinking. The objective? Design a set of rules and instructions that will lead to the result you’re looking for.

Put Your Seatbelt On

Rules and instructions, not really crystal clear, is it? Let’s have a look at one example.

Wouldn’t it be sweet if you didn’t have to halt at the highway toll to pay your fare? What about ensuring that your crazy neighbor stops speeding down the road when your kids are playing? And if your garage door could automatically recognize you and open itself? What you need is first to detect license plates, and then to be able to read them.

For now let’s focus on the detection part. There are six main steps that I’m going to illustrate using our company car, the Deepomobile:

The Deepomobile 🚗 at its finest, all prepped up for its grand debut in license plate detection.
  • Step 1. Nothing much to say here except that it’s very convenient to go from point A to point B and get all the heads to turn.
  • Step 2. Here two actions are performed at the same time. First, we transform the image to black and white by merging the red green and blue channels. Then, we blur it to remove the small artifacts and detect more general shapes.
  • Step 3. The gradient magnitude is computed. Put simply, the gradient is the difference between two adjacent pixels. The higher it is, the more different the pixels are which is why it’s used to detect edges.
  • Step 4. Non-maximum suppression ensure that even if one edge spans multiple pixels, we only consider the most likely line.
  • Step 5. Hysteresis thresholding reinforces this and provides clean-cut edges.
  • Step 6. The edges are converted to geometric lines which are then in turn used to detect the rectangle shape of the license plate.

Each step has its own set of parameters and needs specific tuning. As a consequence, traditional computer vision techniques are not always reliable when the conditions change. For instance, if we designed our license plate detector to work in a garage, then using it outside, in the presence of shadows, at night or in broad light might yield less than optimal results rendering it useless.

Long story short, we used to design specific and tailored recipes for each computer vision task. Now, with deep learning, we build algorithms that learn to make their own rules.

That’s it for part 1! Next week we’ll go over how we do things nowadays and what the term deep learning really means. Stay tuned!


If you like what you read, give us a clap 👏, share it with your friends and if you want to make sure not to miss the next one, subscribe with the form below 👇👇👇!

https://upscri.be/edd62c/

Share This