[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y ] [Home]
4chanarchives logo
Coding run time efficiency
Images are sometimes not shown due to bandwidth/network limitations. Refreshing the page usually helps.

You are currently reading a thread in /g/ - Technology

Thread replies: 37
Thread images: 1
File: salad lady.jpg (94 KB, 900x673) Image search: [Google]
salad lady.jpg
94 KB, 900x673
Got a coding (run time) efficiency question here, /g/. Very long, sorry.

So I made this small program in java since it's the language I'm most comfortable with. Basically what it does is compare a sample small part of an image (say, 10x10 pixels, like on the right) with every other image in a database, looping through each pixel in each image looking for an exact match (like on the left). Uses the BufferedImage class and .getRGB() etc.

The database is currently just a folder with subfolders. The program first makes an arraylist of file names (with C:/Database/[foldername], etc) and then goes through each picture. The picture files are named after the related link it can be found at. (Eg. The google logo would be at location (C:/Database/google.com/logo.png, if an image were from google.com/subsite/somepic.png it'd be called C:/Database/google.com/subsite/somepic.png).

So, it works. The only problem is, there's thousands of pictures (at most 100k+, but most of them under 100x100 pixels) it needs to potentially iterate through, and this is related to a contest (details not important, it's legal and everything), so time matters. I know someone made a similar program that can find the answer in under a minute, and mine can take a long time - say 30 minutes + (probably more, I forget the exact amount), depending on how many images it checks before the match.

Now the tips this person gave me were that he used an actual database (MongoDB) and he used Python. Also, apparently he made it a lot faster by checking if the colors in the sample were similar to the colors in the image and it works that way. But I can't figure out how he did that, since template sizes/image sizes/range of colors used in the match image are all variable.


CONTINUED SLIGHTLY ON NEXT POST (sorry this is so long, I'm aware there's a decent chance no one will even read this).
>>
My main questions here are, generally speaking, how to improve the efficiency of this program. I know java is often seen as clunky and slow, but would moving to another language be that much faster? Would C be fine? I don't know any python, and I know a bit of C, so I could try using that. Think the database is much faster than a filesystem? & Any ideas on that last part, the "checking for color similarity"?


I appreciate any input.
>>
No answer OP but I will still give you a bump
>>
C could make it potentially faster,
but first of all you have to get your algorithm right,
a lot of searches can be skipped,
look for something like text search algorithms, since they are similiar, but just in 2d, and then think about how you could apply that knowledge to 3dimensions.
Just think of more and more ways of how you know when to skip parts of the image, and you should be able to make it a lot faster.
Parallelize the process for each image (too bad you can't just spawn 100k threads in java)
Also java sucks in general, just switch for your own good.
Asking online for a contest isn't a really great idea either.
>>
>>51274705
In all honesty, it's a contest for a kids game. I'll barely use the program anyway, this is 95% just for practice. That's why I said the details aren't important.

I appreciate the advice a lot though, I kind of just used the most basic/first algorithm I thought of.
>>
>>51274602
>>51274705
This. You need to be able to check about 2000 pictures per second if you're going to beat the colors dude.

Thresholds and color difference are used a lot in cv stuff, but you don't have time for that.

My suggestion would be look at a specific arrangement of pixels in each image, say and 'x' in each corner and a cross through the center. This should rule out quite a few images, since a perfect pattern match is very unlikely in photgraphs. If you get a match, then you can use your current algo or something else.
>>
>>51274609
I have never seen a better use case for cuda

Too bad 3 people in the universe actually use it for something useful.
>>
>>51274763
Actual software engineer here

depending on how much you care, you should put your images into buckets with bloom filters, then implement your search in parallel over all of your buckets
>>
why don't you just generate a hash for all your images and then compare the hashes?
>>
>>51274989
No idea what that is honestly, but I'll look it up I suppose.

>>51275017
I've heard of bloom filters but know nothing about them, will look into it as well.

>>51275026
>>51274971
Pretty sure you're both misunderstanding, the image I'm searching with (the template) is only a tiny part of the larger image. Say 10 pixels of the 100 pixel main image.
>>
>iterating through every object to find things
Why cant you just use a hash table?
>>
>>51275426
Can I? I'm honestly getting really confused now. You understand what I'm trying to do, right?

Just trying to check if a 10x10 pixel image is inside another 100x100+ pixel image. But if it isn't in the first image it checks, it needs to check the next image. And so on. For 100k+ images. So like using the image used in the OP, the program checks for that tiny brown square inside the image on the left (it is in that image, it's her hair). But if it weren't in that image, the tiny 10 pixel block would need to be compared to the next image in the DB. Then the next image. Etc.

The way I see it, a hash table would only check if 2 images are identical, correct? I don't think I could use a hash table to check if a small image is inside another image in the database.
>>
>>51275488
you gotta do some reading on the rudiments of computer science, then it should become clear what you must do.

be patient
>>
>>51274602
> I want dem fast prorgrammz
> Java

Discarded.
>>
>>51275502
...what? I'm asked a pretty simple question. The way I see it a hash set doesn't help me in this case because comparing hashes only helps when the two images are identical - not if one image is a part of another. I'm just asking for clarification if I'm wrong, because people keep mentioning hash sets.

>>51275542
That's why I asked if this is my main problem. Would rewriting it in C make it much faster or is it more likely my poor algorithm code?
>>
>>51274602
>Hash every possible string of 10 pixels into a hash table
>???
constant lookup time at the cost of 100 tb of storage space
>>
>>51274705
No it cant, since you dont understand why OP´s program is slow
>>
>>51275426
>No it CANT, since you dont understand why OP´s program is slow

read:
>>51274705
>potentially
>>
>>51275896
... please never get into the computer science field, and if you are drop out
>>
Python is generally slow like Java so he just had a good algorithm. C could potentially speed it up but it sounds like you can get a better algorithm too. This I can say with some certainty.

However, I am not a great algorithm designer. One idea I just came up with is looking at the top left pixel of your 10x10 match_image. You compare that pixel to the top left pixel in your input_image. If it doesn't match, then you can jump 10 pixels to the right, because you know that it won't be a match. (If you're reaching the right edge of the image and there's less than 10 pixels left, you don't even have to check for a match. Just go 10 pixels down and repeat.) If the top left pixels do match, then (depending on how much speed you need), THEN you call your special function that compares the two 10x10 areas. You could do it pixel by pixel the whole way too.

Hopefully this makes sense.
>>
>>51275907
I already study CS and am in the top of my class...

Seriously, what do you not understand about the words "could" and "potentially"?
I think your reading comprehension skills might be seriously lacking.

And don't try to deny the fact that using a form of abstraction closer to the hardware can lead to a speedup,
which doesn't mean that in every situation this speedup is worth it.
>>
Only need to know if there is a match? Allowed to process the images ahead of time? (Sounds like it with the guy recommending a database)

consider my stupid idea
create a black and white bitmap of size 4096x4096. every possible r,g,b combination can be represented in this bitmap. apply gzip/deflate or use some compressed image format. at run time you just load up the image's corresponding bitmap and check all 10x10 pixels.

create the map
bitmap : bitset(size: 2^24-1)
for each pixel value in image
bitmap[r<<16 | g<<8 | b] = 1
serialize gzip(bitmap)


then to see if the image contains that color
deserialize(gunzip(bitmapFile))
for each pixel in target (10x10) image
if not bitmap[r<<16|g<<8|b] == 1 then break
at this point it means it contains the image


it'll be mainly zeroes so it should compress well. Worst case: if all the images are incompressible, it'll be 200gb of meta data for 100k images.

to find similar colors you could drop 1 or 2 of the least significant bits for color r,g,b. this also would reduce the bitmap size to 2^21-1 or 2^18-1 which is 8 & 64 times smaller respectively.
>>
>>51275607
your algorithm is shit, senpai. the other guys is going faster because he has a better algorithm.
>>
>>51276001
this sounds like it could speed your searches up a lot OP.

Theoretically this can work
>>
Just use the machine learning system Google uses. They've just open-sourced it a couple hours ago and it comes with noob friendly tutorials.
>>
>>51276266
does it work, though?
given the search square:
0 4
2 3

look in the image square:
1 0 4
1 2 3
5 7 5

if we check the top left and seeing that 1 != 0 we decide to jump right 2 pixels, we'll have missed the pattern.
>>
>>51276309
You applied my algorithm correctly and yes it is wrong. My idea only appeared to be sensible for a moment.
>>
If you're going to bruteforce, just make sure you're checking pixels individually and not calling some helper function that makes 10x10 comparisons for each pixel. You only need to compare pixels until a difference is found.
>>
>>51274602
> Uses the BufferedImage class and .getRGB() etc.
There's your first problem.

The first thing you should do is to convert all of the sample images to the optimal format for searching, so that the conversion doesn't need to be done as part of the search.

My inclination would be to quantise the images to 256 colours using e.g. a 6x6x6 colour cube (without dithering) and store each row of each image as a database entry, along with its image ID and row number. The main reason for quantising is that it allows you to use "text" search; storing RGB data would result in false matches where the 3 bytes of a pixel match the last byte(s) of one pixel and the first byte(s) of the next.

Then, given the small image to search for, you can take a row of the image, quantise it, then find all database entries which contain that sequence of bytes. This gives you a set of image IDs; do the same another row, but only check in the images which passed the previous check. Also check that the row numbers match up (i.e. if row x in the search image matches row y of the candidate, then row x+n must also match row y+n).

For the most part, it boils down to choosing a database with a fast text search.

There are many ways to improve upon this, but they require actual effort. The above is something that can be thrown together quickly given the right database.
>>
>>51274705
>look for something like text search algorithms, since they are similiar, but just in 2d, and then think about how you could apply that knowledge to 3dimensions.

That's so vague I don't know how anyone could find it useful.

>Parallelize the process for each image (too bad you can't just spawn 100k threads in java)

You could but it would be a bad idea. There's no language where you can spawn 100k OS threads and expect anything good, something like go might let you spawn a bunch of pseudo-threads but you're not actually going to beat a decently tuned java program for concurrency in this case. Most likely the scheduler would get it wrong and you'd spend all your time swapping.

But yeah, early exit is critical. If you know you're going to be matching a series of templates against the same set of images repeatedly then you can make performance enhancements through prescreening. The obvious approach would be to calculate all the colors present in each image and store it (only 3 bytes per image assuming RGB) and do the same for each template (only one image so trivially fast). And the template pallet summary with each of the candidates, if the result isn't equal to the pallet summary then you can reject it right there. You'd be doing that first pass in as little as two instructions per candidate which is blazing fast. More cycles of course because of memory, but even if you're totally memory bound you'd be doing memory frequency/2 images per second which will clear your 100k in no time.

Of course you're probably not going to get close enough to metal to make sure that's the case but you've got more than enough wiggle room.
>>
>There's no language where you can spawn 100k OS threads and expect anything good
>Using OS Threads
Green Threads is where its at my friend.
Ever heard of the massive parallelization capabilities of languages like Erlang and Haskell?
You might be surprised by what you find if you do some research.
Context Switches are light as fuck with them and require next to no work.
And thanks to the right abstractions spawning them is totally natural.
>>
>>51274602
cut the smaller image out and hash it
compare hashes
>>
>>51276407
It almost works. You just have to make a condition that you only jump right 2px if there are no matching columns. If you have a matching column, Then only jump right one. Do the same thing for rows, too. As long as there is only one instance of the given color pattern that would still be very fast
>>
>>51276617
This sounds fast to me. Serialising like that and taking advantage of already very good inbuilt search algos can only be better than what you have
>>
>>51276001

> not a great algorithm designer
> gives one of the best answers in the thread
>>
>>51276001

but what if the 'to be matched square' starts at 2, 2 not 10, 10 then you are fucked so that wouldnt work (or i just have no idea what i am talking about)
>>
>>51278402
yup....what came to my mind exactly
Thread replies: 37
Thread images: 1

banner
banner
[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y] [Home]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
If a post contains personal/copyrighted/illegal content you can contact me at [email protected] with that post and thread number and it will be removed as soon as possible.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com, send takedown notices to them.
This is a 4chan archive - all of the content originated from them. If you need IP information for a Poster - you need to contact them. This website shows only archived content.