You can use Python instead of bash, Java, or C
Python can be run interactively or as a program
Create a file using editor, then:
$ python myscript.py
Run interpreter interactively
$ python
Use a python environment, e.g. Anaconda
We recommend Anaconda:
or
radius=2
pi=3.14
diam=radius*2
area=pi*(radius**2)
title="fun with strings"
pi='cherry'
radius=2.5
delicious=True
print(radius)
Lists are like arrays in other languages.
l=[1,2,3,4,5,6,7,8,9]
print(l)
l[3]
l[5:7]
l[2]=3.14
l
l.reverse()
l
dir(l)
l=[1,2,3,4,5,6,7,8,9]
l[2]=[11,12,13]
l[2][0]
l[3:6]=['four to six']
l
l=[x*x for x in range(10)]
l
tuples are like lists, but not modifiable
t=(1,2,3,4,5,6,7,8,9)
t
t[4:6]
t[5]=99
Strings are fully featured types in python.
s='Some String'
s
s[3]='a'
s="hello"
s.split('l')
Dicts are what python calls "hash tables"
coins={'penny':1, 'nickle':5, 'dime':10, 'quarter':25}
coins
coins['dime']
coins[3]
import random
v=random.randint(0,100)
if v < 50:
print ("small", v)
print ("another line")
else:
print ("big", v)
print ("after else")
import random
count=0
while count<100:
count=count+random.randint(0,10)
print (count)
print("done with loop")
For statements take some sort of iterable object and loop once for every value.
for fruit in "this string":
print(fruit)
list(range(10))
for i in range(100):
print(i)
for
loops and dicts
¶If you loop over a dict, you'll get just keys. Use items() for keys and values.
for denom in coins:
print (coins[denom])
While and For loops can skip steps (continue) or terminate early (break).
for i in range(10):
if i%2 != 0: continue
print (i)
for i in range(10):
if i>5: break
print (i)
In the previous example:
for i in range(10):
if i>5: break
print(i)
How did we know that print(i)
was part of the loop? What defines a loop?
Many programming languages use { } or Begin End to delineate blocks of code to treat as a single unit.
Python uses white space (blanks). To define a block of code, indent the block to the same level.
By convention and for readability, indent a consistent number, usually 3 or 4 spaces. Many editors will do this for you.
Functions allow you to write code once and use it many times.
Functions also hide details so code is more understandable.
def area(w, h):
return w*h
area(3, 10)
print("Simple")
import math
x=16
print("The sqrt of %i is %f" % (x, math.sqrt(x)))
print("The sqrt of {} is {}".format(x, math.sqrt(x)))
print("{1} squared is {0}". format(x, math.sqrt(x)))
print("The sqrt of %(x)i is %(xx).3f" % {"x":x, "xx":math.sqrt(x)})
The sqrt of 16 is 4.000000 The sqrt of 16 is 4.0 4.0 squared is 16 The sqrt of 16 is 4.000
l=[1,2,3]
l.clear()
l
# try: index, upper, isupper, startswith, replace
s="This is a String"
print(s.partition("i"))
print(s.split("s"))
print(s.partition.__doc__)
Task: given a file of hundreds or thousands of lines:
FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,...
160212,1,A1,human,TAAGGCGA-TAGATCGC,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A2,human,CGTACTAG-CTCTCTAT,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A3,human,AGGCAGAA-TATCCTCT,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A4,human,TCCTGAGC-AGAGTAGA,None,N,Eland-rna,Mei,Jon_mix10
...
Remove the last 3 letters from the 5th column:
FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,...
160212,1,A1,human,TAAGGCGA-TAGAT,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A2,human,CGTACTAG-CTCTC,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A3,human,AGGCAGAA-TATCC,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A4,human,TCCTGAGC-AGAGT,None,N,Eland-rna,Mei,Jon_mix10
...
In tnis example, we'll show:
open the input file
read the first header line, and print it out
for each remaining line in the file
read the line
find the value in the 5th column
truncate it by removing the last three letters
put the line back together
print it out
fp=open('badfile.txt')
fp
Open takes a filename, and returns a file pointer
We'll use that to read from the file.
fp=open('badfile.txt')
print (fp.readline())
We'll call readline() on the file pointer to get a single line from the file. (the header line).
Strip() removes the return at the end of the line.
Then we print it.
fp=open('badfile.txt')
print (fp.readline().strip())
for l in fp:
print(l)
A file pointer is an example of an iterator.
Instead of explicitly calling readline() for each line, we can just loop on the file pointer, getting one line each time.
Since we already read the header, we won't get that line.
fp=open('badfile.txt')
print (fp.readline().strip())
for l in fp:
flds=l.strip().split(',')
flds[4]=flds[4][:-3]
print(flds)
Like before, we strip the return from the line.
We split it into individual elements where we find commas.
The 5th field is referenced by flds[4], since python starts indexing with 0. [:-3] takes all characters of the string until the last 3.
fp=open("badfile.txt")
print (fp.readline().strip())
for l in fp:
flds=l.strip().split(',')
flds[4]=flds[4][:-3]
print (','.join(flds))
Join takes a list of strings, and combines them into one string using the string provided. Then we just print that string.
We would invoke it like this:
$ python Ex1.py badfile.txt
$ python Ex1.py badfile.txt > fixedfile.txt
Imagine you have a directory tree with many subdirectories.
In those directories are files named *.fastq. You want to:
In this example, we'll demonstrate:
for each directory
get a list of files in that directory
for each file in that directory
if that file's name ends with .fastq
create a new file name with .gz added
create a command to do the compression
run that command and check for success
if success
delete the original
else
stop
The conversion command is:
gzip -c file.fastq > file.fastq.gz
We need a way to traverse all the files and directories.
os.walk(dir)
starts at dir and visits every subdirectory below it.
It returns a list of files and subdirectories at each subdirectory.
For example, imagine we have the following dirs and files:
Ex2dir
Ex2dir/d1
Ex2dir/d1/d2
Ex2dir/d1/d2/f2.fastq
Ex2dir/d1/f1.fastq
import os
for d , dirs, files in os.walk('Ex2dir'):
print (d, dirs, files)
The subprocess module has a variety of ways to do this. A simple one:
import subprocess
ret=subprocess.call(cmd, shell=True)
ret is 0 on success, non-zero error code on failure.
import subprocess
ret=subprocess.call('gzip -c myfile.fastq > myfile.fastq.gz', shell=True)
ret
import os, sys, subprocess
sys.argv=['dummy', 'Ex2dir'] # for Jupyter we'll cheat
start=sys.argv[1]
for d, subdirs, files in os.walk(start):
for f in files:
if f.endswith('.fastq'):
fn=d+'/'+f
nfn=fn.replace('.fastq', '.fastq.gz')
cmd='gzip -c '+fn+' > '+nfn
print ("running", cmd)
ret=subprocess.call(cmd, shell=True)
if ret==0:
if os.path.exists(nfn):
os.remove(fn)
else:
print ("Failed on ", fn)
sys.exit(1)
print("Done")
We would invoke it like this:
$ python Ex2.py Ex2dir
Dictionaries associate names with data, and allow quick retrieval by name.
By nesting dictionaries, powerful lookups are fast and easy.
In this example, we'll:
genes.txt describes the locations of genes:
(name, chrom, strand, start, end)
uc001aaa.3 chr1 + 11873 14409
uc010nxr.1 chr1 + 11873 14409
uc010nxq.1 chr1 + 11873 14409
uc009vis.3 chr1 - 14361 16765
uc009vit.3 chr1 - 14361 19759
...
mappedreads.txt describes mapped dna sequences
(name, chrom, position, sequence)
seq1 chr1 674540 ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGT
seq2 chr19 575000 AGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCG
seq3 chr5 441682 TCTGCATCTGCTCTGGTGTCTTCTGCCATATCACTGC
...
We'd like to be able to quickly determine the genes overlapped by a dna sequence.
First, we need a simple way to determine if two intervals overlap.
intervaltree is a python module that makes that easy.
from intervaltree import IntervalTree
it=IntervalTree()
it[4:7]='gene1'
it[5:10]='gene2'
it[1:11]='gene3'
it
it[1:5]
{'chr1': IntervalTree([Interval(1000, 1100, 'GeneA'),
Interval(2000, 2100, 'GeneB'), ...
'chr2': IntervalTree([Interval(4000, 5100, 'GeneC'),
Interval(7000, 8100, 'GeneD'), ...
'chr3':
...
import sys
from intervaltree import IntervalTree
print("initializing table")
table={}
sys.argv=['dummy', 'genes.txt', 'mappedreads.txt', 'results.txt'] # for Jupyter
for line in open(sys.argv[1]):
genename, chrm, strand, start, end = line.split()
if not chrm in table:
table[chrm]=IntervalTree()
table[chrm][int(start):int(end)]=genename
print("done")
table['chr1'][670000:680000]
open the dna sequence file
for each line in the file:
get chrom, mapped position, and dna sequence
look up the interval tree for that chrom in the dict
search the interval tree for overlaps [pos, pos+len]
print out the gene names
print("reading sequences")
outfp=open(sys.argv[3], 'w')
for line in open(sys.argv[2]):
name, chrm, pos, seq = line.strip().split()
genes=table[chrm][int(pos):int(pos)+len(seq)]
if genes:
print("{}\t{}\t{}\t{}".format(name, chrm, pos, seq), file=outfp)
for gene in genes:
print ('\t{}'.format(gene.data), file=outfp)
print("done")
T = initialT
curr = best = initialsolution
while T > stopT:
candidate = pick two cities, swap and reverse path between them
if cost(candidate) < cost(best):
curr=candidate
best=candidate
elif cost(candidate)-cost(best) < T
curr=candidate
T = T * alpha
import random
def initial(n):
return [[random.randint(-1000,1000), random.randint(-1000,1000)] for i in range(n)]
initial(10)
import matplotlib.pyplot as plt
def plot(path):
xs = []; ys = []
for x,y in path:
xs.append(x)
ys.append(y)
plt.plot(xs, ys, 'co-')
plt.show()
cities=initial(100)
plot(cities)
import math
def dist(a,b):
return math.sqrt((a[0]-b[0])**2+(a[1]-b[1])**2)
def cost(curr):
total=0
n=len(curr)
for i in range(n):
total+=dist(curr[i], curr[(i+1)%n])
return total
cost(cities)
def doswap(curr):
cand=curr[:]
n=len(cand)
i,j = sorted(random.sample(range(n),2))
cand[i:j+1] = reversed(cand[i:j+1])
return cand
curr=initial(5)
cand=doswap(curr)
print(curr)
print(cand)
def p_accept(candcost, currcost, T):
p = math.exp(-abs(candcost - currcost) / T)
return p
p_accept(20010,20000,5)
def main(n):
#random.seed(0)
curr=initial(n)
plot(curr)
currcost=cost(curr)
T=math.sqrt(n)
while T > 1e-10:
cand=doswap(curr)
candcost=cost(cand)
if candcost<currcost:
curr=cand
currcost=candcost
print("forward T %f cost %f" % (T, currcost))
elif p_accept(candcost, currcost, T) > random.random():
curr=cand
currcost=candcost
print("backward T %f cost %f" % (T, currcost))
T=T*.999
plot(curr)
main(100)