Introduction to Python

Robert Bjornson

Yale Center for Research Computing

Mar 2019

What is the Yale Center for Research Computing?

  • Independent center under the Provost's office
  • Created to support your research computing needs
  • Focus is on high performance computing and storage
  • ~15 staff, including applications specialists and system engineers
  • Available to consult with and educate users
  • Manage compute clusters and support users
  • Located at 160 St. Ronan st, at the corner of Edwards and St. Ronan
  • http://research.computing.yale.edu

Why Python?

  • Free, portable, easy to learn
  • Wildly popular, huge and growing community
  • Intuitive, natural syntax
  • Ideal for rapid prototyping but also for large applications
  • Very efficient to write, reasonably efficient to run as is
  • Can be very efficient (numpy, cython, ...)
  • Huge number of packages (modules)

You can use Python to...

  • Convert or filter files
  • Automate repetitive tasks
  • Compute statistics
  • Build processing pipelines
  • Build simple web applications
  • Perform large numerical computations
  • ...

You can use Python instead of bash, Java, or C

Python can be run interactively or as a program

Different ways to run Python

  1. Create a file using editor, then:

    $ python myscript.py

  2. Run interpreter interactively

    $ python

  3. Use a python environment, e.g. Anaconda

We recommend Anaconda:

  • easy to install
  • easy to add additional packages
  • allows creation of custom environments

More involved...

Install Anaconda Python (includes Jupyter notebook)

  1. at www.continuum.io, download python 3.7 version of anaconda, and install
  2. in a terminal, run python

To get tutorial files

To follow presentation in Jupyter notebook

  1. in a terminal, cd to Python-Bootcamp
  2. run this command: jupyter notebook PythonNotebook.ipynb

Python 2 versus 3

  • Two major versions of python in use
  • For beginners, almost the same, major difference is print
  • This tutorial uses python 3

see https://wiki.python.org/moin/Python2orPython3

Basic Python Types: integer, floating point, string, boolean

In [ ]:
radius=2
pi=3.14
diam=radius*2
area=pi*(radius**2)
title="fun with strings"
pi='cherry'
radius=2.5
delicious=True
print(radius)
  • variables do not need to be declared or typed
  • integers and floating points can be used together
  • the same variable can hold different types

Other Python Types: lists

Lists are like arrays in other languages.

In [ ]:
l=[1,2,3,4,5,6,7,8,9]
In [ ]:
print(l)
In [ ]:
l[3]
In [ ]:
l[5:7]
In [ ]:
l[2]=3.14
l
In [ ]:
l.reverse()
l
In [ ]:
dir(l)

Lists are more flexible than arrays, e.g.:

  • Insert or append new elements
  • remove elements
  • nest lists
  • combine values of different types into lists
In [ ]:
l=[1,2,3,4,5,6,7,8,9]
l[2]=[11,12,13]
l[2][0]
In [ ]:
l[3:6]=['four to six']
l
In [ ]:
l=[x*x for x in range(10)]
l

Other Python Types: tuples

tuples are like lists, but not modifiable

In [ ]:
t=(1,2,3,4,5,6,7,8,9)
t
In [ ]:
t[4:6]
In [ ]:
t[5]=99

Other Python Types: strings

Strings are fully featured types in python.

  • strings are defined with ' or "
  • strings cannot be modified
  • strings can be concatenated and sliced much like lists
  • strings are objects with lots of useful methods
In [ ]:
s='Some String'
s  
In [ ]:
s[3]='a'
In [ ]:
s="hello"
s.split('l')

Other Python Types: dictionaries

Dicts are what python calls "hash tables"

  • dicts associate keys with values, which can be of (almost) any type
  • dicts have length, but are not ordered
  • looking up values in dicts is very fast, even if the dict is BIG.
In [ ]:
coins={'penny':1, 'nickle':5, 'dime':10, 'quarter':25}
coins
In [ ]:
coins['dime']
In [ ]:
coins[3]

Control Flow Statements: if

  • if statements allow you to do a test, and do something based on the result
  • else is optional
In [ ]:
import random
v=random.randint(0,100)
if v < 50:
    print ("small", v)
    print ("another line")
else:
    print ("big", v)  
print ("after else")

Control Flow Statements: while

  • While statements execute one or more statements repeatedly until the test is false
In [ ]:
import random
count=0
while count<100:
   count=count+random.randint(0,10)
   print (count)
print("done with loop")

Control Flow Statements: for

For statements take some sort of iterable object and loop once for every value.

In [ ]:
for fruit in "this string":
   print(fruit)
In [ ]:
list(range(10))  
In [ ]:
for i in range(100):
   print(i)

Using for loops and dicts

If you loop over a dict, you'll get just keys. Use items() for keys and values.

In [ ]:
for denom in coins:  
   print (coins[denom])

Control Flow Statements: altering loops

While and For loops can skip steps (continue) or terminate early (break).

In [ ]:
for i in range(10):
   if i%2 != 0: continue
   print (i)
In [ ]:
for i in range(10):
   if i>5: break
   print (i)

Note on code blocks

In the previous example:

In [ ]:
for i in range(10):
   if i>5: break
   print(i)
 

How did we know that print(i) was part of the loop? What defines a loop?

Many programming languages use { } or Begin End to delineate blocks of code to treat as a single unit.

Python uses white space (blanks). To define a block of code, indent the block to the same level.

By convention and for readability, indent a consistent number, usually 3 or 4 spaces. Many editors will do this for you.

Functions

Functions allow you to write code once and use it many times.

Functions also hide details so code is more understandable.

In [ ]:
def area(w, h):
   return w*h

area(3, 10) 

Summary of basic elements of Python

  • 4 basic types: int, float, boolean, string
  • 3 complex types: list, dict, tuple
  • 4 control constructs: if, while, for, def

Basic Printing

In [ ]:
print("Simple")
In [6]:
import math
x=16
print("The sqrt of %i is %f" % (x, math.sqrt(x)))
print("The sqrt of {} is {}".format(x, math.sqrt(x)))
print("{1} squared is {0}". format(x, math.sqrt(x)))
print("The sqrt of %(x)i is %(xx).3f" % {"x":x, "xx":math.sqrt(x)})
The sqrt of 16 is 4.000000
The sqrt of 16 is 4.0
4.0 squared is 16
The sqrt of 16 is 4.000

Objects

  • all python values are objects (lists, strings, dicts, etc.)
  • objects combine value(s) and methods (functions)
  • advanced users can create their own classes of objects
  • all the usual OO stuff: inheritance, data hiding, etc.
  • use dir(obj) to discover an object's methods
In [ ]:
l=[1,2,3]
l.clear()
l
In [ ]:
# try: index, upper, isupper, startswith, replace
s="This is a String"
print(s.partition("i"))
print(s.split("s"))
In [ ]:
print(s.partition.__doc__)

Example 1: File Reformatter

Task: given a file of hundreds or thousands of lines:

FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,...
160212,1,A1,human,TAAGGCGA-TAGATCGC,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A2,human,CGTACTAG-CTCTCTAT,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A3,human,AGGCAGAA-TATCCTCT,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A4,human,TCCTGAGC-AGAGTAGA,None,N,Eland-rna,Mei,Jon_mix10
...

Remove the last 3 letters from the 5th column:

FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,...
160212,1,A1,human,TAAGGCGA-TAGAT,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A2,human,CGTACTAG-CTCTC,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A3,human,AGGCAGAA-TATCC,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A4,human,TCCTGAGC-AGAGT,None,N,Eland-rna,Mei,Jon_mix10
...

In tnis example, we'll show:

  • reading lines of a file
  • parsing and modifying the lines
  • writing them back out
  • creating a script to do the above and running it
  • passing the script the file to modify

In pseudocode

 open the input file
 read the first header line, and print it out
 for each remaining line in the file
   read the line
   find the value in the 5th column
   truncate it by removing the last three letters
   put the line back together
   print it out

Step 1: open the input file

In [ ]:
fp=open('badfile.txt')
In [ ]:
fp

Open takes a filename, and returns a file pointer

We'll use that to read from the file.

Step 2: read the first header line, and print it out

In [ ]:
fp=open('badfile.txt')
print (fp.readline())

We'll call readline() on the file pointer to get a single line from the file. (the header line).

Strip() removes the return at the end of the line.

Then we print it.

Step 3: for each remaining line in the file, read the line

In [ ]:
fp=open('badfile.txt')
print (fp.readline().strip())
for l in fp:
  print(l)

A file pointer is an example of an iterator.

Instead of explicitly calling readline() for each line, we can just loop on the file pointer, getting one line each time.

Since we already read the header, we won't get that line.

Step 4: find the value in the 5th column, and remove last 3 letters

In [ ]:
fp=open('badfile.txt')
print (fp.readline().strip())  
for l in fp:
    flds=l.strip().split(',')
    flds[4]=flds[4][:-3]
    print(flds)

Like before, we strip the return from the line.

We split it into individual elements where we find commas.

The 5th field is referenced by flds[4], since python starts indexing with 0. [:-3] takes all characters of the string until the last 3.

Step 5: put the line back together, and print it

In [ ]:
fp=open("badfile.txt")
print (fp.readline().strip())
for l in fp:
    flds=l.strip().split(',')
    flds[4]=flds[4][:-3]
    print (','.join(flds))

Join takes a list of strings, and combines them into one string using the string provided. Then we just print that string.

We would invoke it like this:

$ python Ex1.py badfile.txt

$ python Ex1.py badfile.txt > fixedfile.txt

Example 2: directory walk with file ops

Imagine you have a directory tree with many subdirectories.

In those directories are files named *.fastq. You want to:

  • find them
  • compress them to fastq.gz using a program
  • delete them if the conversion was successful

In this example, we'll demonstrate:

  • traversing an entire directory tree
  • executing a program on files in that tree
  • testing for successful program execution

In psuedocode

for each directory
   get a list of files in that directory
   for each file in that directory
     if that file's name ends with .fastq
       create a new file name with .gz added
       create a command to do the compression
       run that command and check for success
       if success
         delete the original
       else
         stop

The conversion command is: gzip -c file.fastq > file.fastq.gz

Step 1: directory traversal

We need a way to traverse all the files and directories. os.walk(dir) starts at dir and visits every subdirectory below it. It returns a list of files and subdirectories at each subdirectory.

For example, imagine we have the following dirs and files:

Ex2dir
Ex2dir/d1
Ex2dir/d1/d2
Ex2dir/d1/d2/f2.fastq
Ex2dir/d1/f1.fastq
In [ ]:
import os
for d , dirs, files in os.walk('Ex2dir'):
   print (d, dirs, files)

Step 2: Invoking other programs from python

The subprocess module has a variety of ways to do this. A simple one:

import subprocess

ret=subprocess.call(cmd, shell=True)

ret is 0 on success, non-zero error code on failure.

In [ ]:
import subprocess
ret=subprocess.call('gzip -c myfile.fastq > myfile.fastq.gz', shell=True)
ret 

Put it all together

In [ ]:
import os, sys, subprocess
sys.argv=['dummy', 'Ex2dir'] # for Jupyter we'll cheat
start=sys.argv[1] 
for d, subdirs, files in os.walk(start):
    for f in files:
        if f.endswith('.fastq'):
            fn=d+'/'+f
            nfn=fn.replace('.fastq', '.fastq.gz')
            cmd='gzip -c '+fn+' > '+nfn
            print ("running", cmd)
            ret=subprocess.call(cmd, shell=True)
            if ret==0:
                if os.path.exists(nfn):
                    os.remove(fn)
            else:
                print ("Failed on ", fn)
                sys.exit(1)
print("Done")

We would invoke it like this:

$ python Ex2.py Ex2dir

Example 3: Nested Dictionaries

Dictionaries associate names with data, and allow quick retrieval by name.

By nesting dictionaries, powerful lookups are fast and easy.

In this example, we'll:

  • create a dict containing objects
  • load the objects with search data
  • use the dict to retrieve the appropriate object for a search
  • perform the search

genes.txt describes the locations of genes:

(name, chrom, strand, start, end)

uc001aaa.3      chr1    +       11873   14409 
uc010nxr.1      chr1    +       11873   14409 
uc010nxq.1      chr1    +       11873   14409  
uc009vis.3      chr1    -       14361   16765  
uc009vit.3      chr1    -       14361   19759  
...

mappedreads.txt describes mapped dna sequences

(name, chrom, position, sequence)

seq1 chr1  674540   ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGT
seq2 chr19 575000   AGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCG
seq3 chr5  441682   TCTGCATCTGCTCTGGTGTCTTCTGCCATATCACTGC
...

We'd like to be able to quickly determine the genes overlapped by a dna sequence.

First, we need a simple way to determine if two intervals overlap.

intervaltree is a python module that makes that easy.

In [ ]:
from intervaltree import IntervalTree
it=IntervalTree()
it[4:7]='gene1'
it[5:10]='gene2'
it[1:11]='gene3'
it
In [ ]:
it[1:5]

General plan

  • use interval trees, one for each chromosome
  • organize the trees in a dictionary by chromosome
  • store an interval for each gene the tree for it's chromosome
{'chr1': IntervalTree([Interval(1000, 1100, 'GeneA'), 
                       Interval(2000, 2100, 'GeneB'), ...
 'chr2': IntervalTree([Interval(4000, 5100, 'GeneC'), 
                       Interval(7000, 8100, 'GeneD'), ...
 'chr3':
 ...

In psuedocode

setup the lookup table

 create empty dict
 open the gene file
 for each line in the file
    get gene name, chrom, start, end
    initialize an intervaltree for the chrom, if needed, and add to dict
    add the interval and gene name to the interval tree
In [ ]:
import sys
from intervaltree import IntervalTree

print("initializing table")
table={}
sys.argv=['dummy', 'genes.txt', 'mappedreads.txt', 'results.txt'] # for Jupyter
for line in open(sys.argv[1]):
    genename, chrm, strand, start, end = line.split()
    if not chrm in table:
        table[chrm]=IntervalTree()
    table[chrm][int(start):int(end)]=genename
print("done")
In [ ]:
table['chr1'][670000:680000]

use the interval trees to find overlapped genes

 open the dna sequence file
 for each line in the file:
   get chrom, mapped position, and dna sequence
   look up the interval tree for that chrom in the dict
   search the interval tree for overlaps [pos, pos+len]
   print out the gene names
In [ ]:
print("reading sequences")

outfp=open(sys.argv[3], 'w')
for line in open(sys.argv[2]):
    name, chrm, pos, seq = line.strip().split()
    genes=table[chrm][int(pos):int(pos)+len(seq)]
    if genes:
        print("{}\t{}\t{}\t{}".format(name, chrm, pos, seq), file=outfp)
        for gene in genes:
            print ('\t{}'.format(gene.data), file=outfp)
print("done")

Example 4: Travelling salesman

  • Given N cities, find shortest route visiting each once
  • NP-complete -> very expensive to find optimal solution
  • We'll use a heuristic approach, simulated annealing:
T = initialT
curr = best = initialsolution
while T > stopT:
    candidate = pick two cities, swap and reverse path between them
    if cost(candidate) < cost(best):
        curr=candidate
        best=candidate
    elif cost(candidate)-cost(best) < T
        curr=candidate
    T = T * alpha
In [ ]:
import random
def initial(n):
    return [[random.randint(-1000,1000), random.randint(-1000,1000)] for i in range(n)]
In [ ]:
initial(10)
In [ ]:
import matplotlib.pyplot as plt
def plot(path):
    xs = []; ys = []
    for x,y in path:
        xs.append(x)
        ys.append(y)
    plt.plot(xs, ys, 'co-')
    plt.show()
In [ ]:
cities=initial(100)
plot(cities)
In [ ]:
import math

def dist(a,b):
    return math.sqrt((a[0]-b[0])**2+(a[1]-b[1])**2)

def cost(curr):
    total=0
    n=len(curr)
    for i in range(n):
        total+=dist(curr[i], curr[(i+1)%n])
    return total

cost(cities)
In [ ]:
def doswap(curr):
    cand=curr[:]
    n=len(cand)
    i,j = sorted(random.sample(range(n),2))
    cand[i:j+1] = reversed(cand[i:j+1])
    return cand
In [ ]:
curr=initial(5)
cand=doswap(curr)
print(curr)
print(cand)
In [ ]:
def p_accept(candcost, currcost, T):
    p = math.exp(-abs(candcost - currcost) / T)
    return p

p_accept(20010,20000,5)
In [ ]:
def main(n):
    #random.seed(0)
    curr=initial(n)
    plot(curr)
    currcost=cost(curr)
    T=math.sqrt(n)
    while T > 1e-10:
        cand=doswap(curr)
        candcost=cost(cand)
        if candcost<currcost:
            curr=cand
            currcost=candcost
            print("forward T %f cost %f" % (T, currcost))
        elif p_accept(candcost, currcost, T) > random.random():
            curr=cand
            currcost=candcost
            print("backward T %f cost %f" % (T, currcost))
        T=T*.999
    plot(curr)
In [ ]:
main(100)

Important Packages

  • numpy: high performance arrays
  • scipy: stats, linear alg, etc.
  • multiprocessing: easy parallelism
  • matplotlib: plotting
  • pandas: stats with R-like dataframes
  • flask, web2py: web applications

Python Resources we like