Learning C++ for biologists <string manipulation>

Thread replies: 14
Thread images: 6

Anonymous
Learning C++ for biologists <string manipulation> 2016-06-23 16:11:44 Post No. 55224093
[Report] Image search: [Google]

File: barbie3.jpg (49 KB, 670x503) Image search: [Google]

49 KB, 670x503

Learning C++ for biologists <string manipulation> Anonymous 2016-06-23 16:11:44 Post No. 55224093 [Report]

I'm a biologist -> computational biologist.

I know
 Perl, Python, R 
I am pretty experienced for most biologist->computational biologists out their (we tend to see a lot of compSci -> computational biology, which generally has a poor grasp on biology)

I do a lot of string manipulation and file parsing. This is why I know Perl

I would like to learn C++ because it's faster. I'm looking for resources that will aide me (programming for 5 years).

What books do you recommend for an experienced programmer who wants to do a lot of file parsing in C++.

For example this is a script I typically write in Perl
#!/usr/bin/perl
use strict; use warnings;
undef my %alleles;
foreach my $f (glob("bed/*.bed")){
    my @f = split /\//, $f;
    my $iid = pop @f;
    $iid =~ s/_alleles\.bed//;
    open IN, $f;
    while(<IN>){
         chomp $_;
         my @r = split /\t/, $_;
         my $k = join "\t", @r[0 .. 3];
         $alleles{$k}++;        
    }
    close IN;
}
open OUT, ">all_alleles.txt";
foreach (sort keys %alleles){ print OUT $_,"\n"; }
close OUT;
Obviously I do perform some CPU limited tasked (not in this script) but I envision I will benefit from C++. I also write tools (in python/cython) and would like to use C++ too.

**What books or resources are there that can help me learn how to do I/O operations and string manipulation in C++. I am somewhat experienced. Thank you /g/**

>>

Anonymous 2016-06-23 16:12:39 Post No.55224103
[Report]

Anonymous 2016-06-23 16:12:39 Post No.55224103 [Report]

>>55224093
>their

Sorry lads there. I am dyslexic

>>

Anonymous 2016-06-23 16:27:24 Post No.55224333
[Report] Image search: [Google]

Anonymous 2016-06-23 16:27:24 Post No.55224333 [Report]

File: revised-barbie-computer-engineer.jpg (584 KB, 1100x550) Image search: [Google]

584 KB, 1100x550

>>55224107
>moar
So one example I can think of this parsing the allele depths for SNPs.

#!/usr/env python
ref_depth={}
alt_depth={}
with open('HG00096_SNP.vcf','r') as f:
    for l in f:
        if l.startswith('#'): continue
        r = l.rstrip('\n').split('\t')
        format=r[8].split(':')
        ADP_IND=0
        for x in xrange(len(format)):
                if format[x] == 'ADP': ADP_IND=x
        allelic_depth=r[9].split(':').split(',')
        if len(allelic_depth) != 2: continue 
        allelic_depth=map(int,allelic_depth)
        if ref_depth.get(r[0]) == None: ref_depth[r[0]]=[allelic_depth[0]]
        else : ref_depth[r[0]].append(allelic_depth[0])
        if alt_depth.get([r[0])==None: alt_depth[r[0]]=[allelic_depth[1]]
        else: alt_depth[r[0]].append(allelic_depth[1])
import numpy as np
for chrom in ref_depth:
     print '{} REF MEDIAN {} REF MEAN: {}:    {}'.format(chrom,np.median(ref_depth[chrom]),np.mean(ref_depth[chrom])
     print '{} ALT MEDIAN {} ALT MEAN: {}:    {}'.format(chrom,np.median(alt_depth[chrom]),np.mean(alt_depth[chrom])

So this is a little simplistic but as you can hopefully see I am doing quite a bit of file parsing and manipulation of the lines. But I could also improve my code with using C types. I have used Cython but would like to learn C++ for tasks similar to this code.

>>

Anonymous 2016-06-23 16:29:03 Post No.55224361
[Report] Image search: [Google]

Anonymous 2016-06-23 16:29:03 Post No.55224361 [Report]

File: 141120_FT_BarbieRemixed_05.png.CROP.promovar-mediumlarge.png (924 KB, 590x629) Image search: [Google]

924 KB, 590x629

>>55224333
There are some errors in that script. But forgive me. I hope you saw what I am getting at.

>>

Anonymous 2016-06-23 16:50:38 Post No.55224615
[Report]

Anonymous 2016-06-23 16:50:38 Post No.55224615 [Report]

>>55224093
What exactly does a computational biologist do, anyway?

>>

Anonymous 2016-06-23 17:02:44 Post No.55224801
[Report] Image search: [Google]

Anonymous 2016-06-23 17:02:44 Post No.55224801 [Report]

File: auto_del_3d.gif (1 MB, 1044x634) Image search: [Google]

1 MB, 1044x634

>>55224615
>>55224615
I am in psychiatric genetics. I find new autism candidate genes focusing on structural variation.

We perform whole genome sequencing on families with an autistic child. And look for new mutations in the genome.

My day to day is parsing BED files formatted like this
#CHROM    START     END     TYPE     IID
chr1    100    500    DEL    HG00096
I also work with BAM files.

I wrote a machine learning algo to predict genotypes of putative copy number variants. It works really well.

I also do the same for schizophrenia.

>But other computational biologists (bioinformatician but I do not like that term.) do other things. It's a wide field I am in genomics and genetics.

>RED: Copy number 0 (two deletions)
>BLUE: Copy number 1 (one deletion)
>Green: Copy number 2 (diploid default)

>>

Anonymous 2016-06-23 17:05:18 Post No.55224828
[Report]

Anonymous 2016-06-23 17:05:18 Post No.55224828 [Report]

>>55224801
>I find new autism candidates
is this an elaborate joke about you finding autistic children on /g/?

>>

Anonymous 2016-06-23 17:05:39 Post No.55224834
[Report]

Anonymous 2016-06-23 17:05:39 Post No.55224834 [Report]

>>55224801
>autism candidate genes
You're in the right place for that

>>

Anonymous 2016-06-23 17:06:03 Post No.55224839
[Report]

Anonymous 2016-06-23 17:06:03 Post No.55224839 [Report]

1. Separate your I/O from your computation
2. Profile your code to see which one is slower
3. Optimize the one in need of optimization

For large data sets and cheap algorithms, I/O will generally be the bottleneck: Use non-blocking I/O and cheaper parsing primitives, or memory mapped files

For smaller data sets and expensive algorithms, computation will generally be the bottleneck: Use cache-friendly algorithms and vectorized or parallelized code.

>>

Anonymous 2016-06-23 17:06:42 Post No.55224846
[Report]

Anonymous 2016-06-23 17:06:42 Post No.55224846 [Report]

>>55224828
It's easier to find autism genes in really low IQ children. Aspies are harder.

>>55224834
If people are interested in joining the study I can give the link

>>

Anonymous 2016-06-23 17:11:31 Post No.55224891
[Report] Image search: [Google]

Anonymous 2016-06-23 17:11:31 Post No.55224891 [Report]

File: phred.png (141 KB, 507x515) Image search: [Google]

141 KB, 507x515

>>55224839
>For large data sets and cheap algorithms, I/O will generally be the bottleneck: Use non-blocking I/O and cheaper parsing primitives, or memory mapped files

Finally a legitimate answer.

So it's definitely the I/O is the bottle neck

>Is C/C++ faster at I/O than Python/Perl?

Do you mind explaining what non-blocking, parsing primitives are?

Memory map? Is that like a index file (for example we have binary files .bam and their index .bam.bai)

I come from a high-level language background so forgive my ignorance.

>>

Anonymous 2016-06-23 17:13:53 Post No.55224915
[Report] Image search: [Google]

Anonymous 2016-06-23 17:13:53 Post No.55224915 [Report]

File: m_yoa120017f1.png (33 KB, 520x385) Image search: [Google]

33 KB, 520x385

>>55224828
>>55224834
I did find one aspie candidate gene

http://www.genecards.org/cgi-bin/carddisp.pl?gene=TESC

>>

Anonymous 2016-06-23 17:20:00 Post No.55224991
[Report]

Anonymous 2016-06-23 17:20:00 Post No.55224991 [Report]

>>55224891
>Is C/C++ faster at I/O than Python/Perl?
Languages aren't really slower or faster, programs are. Some languages just don't permit you to write some programs, though.

Rewriting your python and perl program in C will not necessarily make it faster - but rewriting it in C and then optimizing it will.

>Do you mind explaining what non-blocking, parsing primitives are?
By non-blocking I meant unlocked I/O (my bad), unlocked I/O means the I/O routines don't have to insert code and extra operations to make sure they're the only thing accessing the data. (In a single threaded program that doesn't share its FDs, this is always the case)

>parsing primitive
By parsing primitives I was specifically referring to stuff like the “decode an integer” subroutine. If you use the standard scanf() or whatever in C, your code will be significantly slower than a hand-written parser that can make more assumptions about the input format. You can also customize your input format to make it easier to parse, if possible - or use a 0-parse data structure like a memory mapped array of structs.

>Memory map? Is that like a index file (for example we have binary files .bam and their index .bam.bai)
No, it's a way to directly map a file's contents into your program's address space so you don't have to go through buffered, streaming I/O primitives in order to process or scan it, which is especially helpful if you only need random access (in particular, of few values)

>So it's definitely the I/O is the bottle neck
Benchmark it, don't assume

>>

Anonymous 2016-06-23 17:27:16 Post No.55225081
[Report]

Anonymous 2016-06-23 17:27:16 Post No.55225081 [Report]

>>55224991
Thanks for all the great information. I understand why people like Larry Wall actually took the time to write Perl.

I was reading a book about learning C and whatever I read my immediate thought was "oh this is why Perl exists"

I know it might sound naive to many programmers out there but for a lot of the work I do Perl is great.

Thanks again for breaking it down.