[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y ] [Home]
4chanarchives logo
Learning C++ for biologists <string manipulation>
Images are sometimes not shown due to bandwidth/network limitations. Refreshing the page usually helps.

You are currently reading a thread in /g/ - Technology

Thread replies: 14
Thread images: 6
File: barbie3.jpg (49 KB, 670x503) Image search: [Google]
barbie3.jpg
49 KB, 670x503
I'm a biologist -> computational biologist.

I know
 Perl, Python, R 


I am pretty experienced for most biologist->computational biologists out their (we tend to see a lot of compSci -> computational biology, which generally has a poor grasp on biology)

I do a lot of string manipulation and file parsing. This is why I know Perl

I would like to learn C++ because it's faster. I'm looking for resources that will aide me (programming for 5 years).

What books do you recommend for an experienced programmer who wants to do a lot of file parsing in C++.

For example this is a script I typically write in Perl

#!/usr/bin/perl
use strict; use warnings;
undef my %alleles;
foreach my $f (glob("bed/*.bed")){
my @f = split /\//, $f;
my $iid = pop @f;
$iid =~ s/_alleles\.bed//;
open IN, $f;
while(<IN>){
chomp $_;
my @r = split /\t/, $_;
my $k = join "\t", @r[0 .. 3];
$alleles{$k}++;
}
close IN;
}
open OUT, ">all_alleles.txt";
foreach (sort keys %alleles){ print OUT $_,"\n"; }
close OUT;


Obviously I do perform some CPU limited tasked (not in this script) but I envision I will benefit from C++. I also write tools (in python/cython) and would like to use C++ too.

**What books or resources are there that can help me learn how to do I/O operations and string manipulation in C++. I am somewhat experienced. Thank you /g/**
>>
>>55224093
>their

Sorry lads there. I am dyslexic
>>
>>55224107
>moar
So one example I can think of this parsing the allele depths for SNPs.

#!/usr/env python
ref_depth={}
alt_depth={}
with open('HG00096_SNP.vcf','r') as f:
for l in f:
if l.startswith('#'): continue
r = l.rstrip('\n').split('\t')
format=r[8].split(':')
ADP_IND=0
for x in xrange(len(format)):
if format[x] == 'ADP': ADP_IND=x
allelic_depth=r[9].split(':').split(',')
if len(allelic_depth) != 2: continue
allelic_depth=map(int,allelic_depth)
if ref_depth.get(r[0]) == None: ref_depth[r[0]]=[allelic_depth[0]]
else : ref_depth[r[0]].append(allelic_depth[0])
if alt_depth.get([r[0])==None: alt_depth[r[0]]=[allelic_depth[1]]
else: alt_depth[r[0]].append(allelic_depth[1])
import numpy as np
for chrom in ref_depth:
print '{} REF MEDIAN {} REF MEAN: {}: {}'.format(chrom,np.median(ref_depth[chrom]),np.mean(ref_depth[chrom])
print '{} ALT MEDIAN {} ALT MEAN: {}: {}'.format(chrom,np.median(alt_depth[chrom]),np.mean(alt_depth[chrom])


So this is a little simplistic but as you can hopefully see I am doing quite a bit of file parsing and manipulation of the lines. But I could also improve my code with using C types. I have used Cython but would like to learn C++ for tasks similar to this code.
>>
>>55224333
There are some errors in that script. But forgive me. I hope you saw what I am getting at.
>>
>>55224093
What exactly does a computational biologist do, anyway?
>>
File: auto_del_3d.gif (1 MB, 1044x634) Image search: [Google]
auto_del_3d.gif
1 MB, 1044x634
>>55224615
>>55224615
I am in psychiatric genetics. I find new autism candidate genes focusing on structural variation.

We perform whole genome sequencing on families with an autistic child. And look for new mutations in the genome.

My day to day is parsing BED files formatted like this

#CHROM    START     END     TYPE     IID
chr1 100 500 DEL HG00096


I also work with BAM files.

I wrote a machine learning algo to predict genotypes of putative copy number variants. It works really well.

I also do the same for schizophrenia.

>But other computational biologists (bioinformatician but I do not like that term.) do other things. It's a wide field I am in genomics and genetics.

>RED: Copy number 0 (two deletions)
>BLUE: Copy number 1 (one deletion)
>Green: Copy number 2 (diploid default)
>>
>>55224801
>I find new autism candidates
is this an elaborate joke about you finding autistic children on /g/?
>>
>>55224801
>autism candidate genes
You're in the right place for that
>>
1. Separate your I/O from your computation
2. Profile your code to see which one is slower
3. Optimize the one in need of optimization

For large data sets and cheap algorithms, I/O will generally be the bottleneck: Use non-blocking I/O and cheaper parsing primitives, or memory mapped files

For smaller data sets and expensive algorithms, computation will generally be the bottleneck: Use cache-friendly algorithms and vectorized or parallelized code.
>>
>>55224828
It's easier to find autism genes in really low IQ children. Aspies are harder.

>>55224834
If people are interested in joining the study I can give the link
>>
File: phred.png (141 KB, 507x515) Image search: [Google]
phred.png
141 KB, 507x515
>>55224839
>For large data sets and cheap algorithms, I/O will generally be the bottleneck: Use non-blocking I/O and cheaper parsing primitives, or memory mapped files

Finally a legitimate answer.

So it's definitely the I/O is the bottle neck

>Is C/C++ faster at I/O than Python/Perl?

Do you mind explaining what non-blocking, parsing primitives are?

Memory map? Is that like a index file (for example we have binary files .bam and their index .bam.bai)

I come from a high-level language background so forgive my ignorance.
>>
File: m_yoa120017f1.png (33 KB, 520x385) Image search: [Google]
m_yoa120017f1.png
33 KB, 520x385
>>55224828
>>55224834
I did find one aspie candidate gene

http://www.genecards.org/cgi-bin/carddisp.pl?gene=TESC
>>
>>55224891
>Is C/C++ faster at I/O than Python/Perl?
Languages aren't really slower or faster, programs are. Some languages just don't permit you to write some programs, though.

Rewriting your python and perl program in C will not necessarily make it faster - but rewriting it in C and then optimizing it will.

>Do you mind explaining what non-blocking, parsing primitives are?
By non-blocking I meant unlocked I/O (my bad), unlocked I/O means the I/O routines don't have to insert code and extra operations to make sure they're the only thing accessing the data. (In a single threaded program that doesn't share its FDs, this is always the case)

>parsing primitive
By parsing primitives I was specifically referring to stuff like the “decode an integer” subroutine. If you use the standard scanf() or whatever in C, your code will be significantly slower than a hand-written parser that can make more assumptions about the input format. You can also customize your input format to make it easier to parse, if possible - or use a 0-parse data structure like a memory mapped array of structs.

>Memory map? Is that like a index file (for example we have binary files .bam and their index .bam.bai)
No, it's a way to directly map a file's contents into your program's address space so you don't have to go through buffered, streaming I/O primitives in order to process or scan it, which is especially helpful if you only need random access (in particular, of few values)

>So it's definitely the I/O is the bottle neck
Benchmark it, don't assume
>>
>>55224991
Thanks for all the great information. I understand why people like Larry Wall actually took the time to write Perl.

I was reading a book about learning C and whatever I read my immediate thought was "oh this is why Perl exists"

I know it might sound naive to many programmers out there but for a lot of the work I do Perl is great.

Thanks again for breaking it down.
Thread replies: 14
Thread images: 6

banner
banner
[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y] [Home]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
If a post contains personal/copyrighted/illegal content you can contact me at [email protected] with that post and thread number and it will be removed as soon as possible.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com, send takedown notices to them.
This is a 4chan archive - all of the content originated from them. If you need IP information for a Poster - you need to contact them. This website shows only archived content.