[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y ] [Home]
4chanarchives logo
Guys I have an interesting coding problem for you that I'm
Images are sometimes not shown due to bandwidth/network limitations. Refreshing the page usually helps.

You are currently reading a thread in /sci/ - Science & Math

Thread replies: 31
Thread images: 1
File: coding.jpg (326 KB, 1438x809) Image search: [Google]
coding.jpg
326 KB, 1438x809
Guys I have an interesting coding problem for you that I'm really stuck with.

Say I have 2 lists of names. Both in random orders but with a few of them in both. I need to find out which names are in group B but not group A. How would I do this in excel or matlab?

To complicate matters further some entries aren't 100% similar but are very similar for example in list A a name might be Andrew B. Cosby and in group be just Andrew Cosby but obviously this is a match and should not be in my answer list.

Thanks guys!
>>
>>8154446
Do you know any other coding languages? I don't think excel (and I dont know about matlab) is the best way to handle lists of stings.
>>
>>8154446
>To complicate matters further some entries aren't 100% similar

I don't know what you mean with that but you can use the levenshtein difference with a tolerance level to find similar but not identical strings.

about the other thing,

repmat(A, N_b,1) - repmat(B,N_a,1)'

the zeros are your doubles.
>>
>>8154446
Nigger if you have data like that you need to standarize it first
>>
>>8154453
I don't really unfortunately, I'm a maths student so my using of coding is limited to calculations on matlab do you not think matlab or excel could handle something like this?
>>
>>8154457
Would take for slow going but I think you're right. Any ideas how to do it with standardised data?
>>
>>8154456
What language is this?
>>
>>8154462
I mean if you've standardized the data you can just use sets. Or a terrible ugly for loop.
>>
>>8154465

matlab
>>
>>8154479
What command would you use to compare an element of A with an element of B?
>>8154486
What are the inputs in this case?
>>
>>8154491

A, your first list, B, your second list, N_a, length of A, N_b, length of B

you're a math student and you've never used repmat?
>>
>>8154503
Nope, I'll give it a browse

Thanks all for your help, if this works I'll share some of the £65k with you!
>>
>>8154512
>£65k

pfff yea right
>>
uninteresting programming problem in a shit language
>>
>>8154446
First of all you should sort your fucking data.
After that its pretty simple:
Compare A[0] with the first letter of B[n]
If its a match; compare the names(just first and last) if the names match record the name/ remove from list
Else Break the loop and move onto A[1]
this is probably the simplest but it wont be terribly fast
>>
>>8154446

Sounds like a job for setdiff.
>>
it depends how your data is formatted. if you are using VBA for Excel you can use the Front() commands and compare the first n characters.
>>
>>8154446
Use sets in Python
Set B - ( Set A N Set B )
>>
>>8155252
He needs to normalize the data first so that equivalent names are equal
>>
>>8155280
sha4096
>>
>>8154446
Matlab is bad at this because it's a shit language (with shit string support), but you can do something like that:
First go through both lists of names and convert them to upper(or lower) case while also removing things like B. in your Andrew Cosby example (a good way to do this would probably be to take the first and last word).
After that, use the appropriate set operations on the lists.
>>
post it to mechanical turk for peanuts, your time clearly is more valuable

alternatively if your sets are really big make it into a captcha and let faggots do it for free
>>
>>8154491
FOR EACH X NOT LISTC()[] IN LISTA {
LISTD [] = X
}

Listc() {
For each X in LISTB[] {
LISTC [] = "*" & X & "*"
}
}

hisssss :^)
>>
>>8155709
[code]
void faggot {
FOR EACH X NOT LISTC()[] IN LISTA {
LISTD[] = X
}
}

static array listc()[] {
For each Y in LISTB[] {
LISTC[] = "*" & X & "*"
}
}
[/code]

Theres some python for you.
>>
>>8155717
>>8155709
Dont do this it makes mustard gas

But really this will infinitely loop and segfault Windows. 9/10
>>
not sure i can think of a non O(n^2) way to do it.

just go one by one thru list b, checking each value of list a. you should also do a isSimilar() method to take two names, split across whitespace and compare the first and last values (names).
>>
>>8155736
>O(n^2) way to do it.

Concatenate the lists in 1
Sort the list in N log N
Run through the list and check neighbors in N.

There you go, N log N solution

If the lists are already sorted it's an N solution.

Fucking noobs
>>
>>8154446
Perl has some lovely regular expression and this amazing data structure known as a hash for just this sort of thing. I encourage you to look it up, even if its the legacy of legacy.

Python has similar stuff going on, but regexp in Python is a little bit less intuitive for me (please dont ask me how /// is easier than regexp.) And a hash is just a 2 dimensional array in Python with naming and size restrictions.

Matlab has very poor regexp support from what I understand, even though I like it.

You have yourself there a week 1 day 5 regexp problem in Perl
>>
>>8154446
Seeing how you are thinking about excel or matlab you probably don't give a shit about time complexity.

Store both lists as simple arrays.

Take a name from list B and compare it to literally every other member in list A. If there is no match (track this with a boolean) then you output this name.

Repeat this for every member in list B and there you have.

Assuming lists of the same size this is just O(n squared) so it is not absolutely shit, but is literally as bad as you can do.
>>
in R, only considering exact matches:

unique(B[! B %in% A])
>>
>>8156148
>please dont ask me how /// is easier than regexp
it's not the syntax that's shit in python's regex, but the implementation.

they recommend you pre-compile your patterns, but have it set up so you can just pass a pattern string instead of a pattern object, but it's caching behind the scenes so there's sometimes no difference in the behavior no matter how you set up the search

it's a great example of horribly planned pre-optimization
Thread replies: 31
Thread images: 1

banner
banner
[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y] [Home]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
If a post contains personal/copyrighted/illegal content you can contact me at [email protected] with that post and thread number and it will be removed as soon as possible.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com, send takedown notices to them.
This is a 4chan archive - all of the content originated from them. If you need IP information for a Poster - you need to contact them. This website shows only archived content.