Intro into python for data science

Start installing + about: http://y-t.io/1

Please fill out the poll: http://y-t.io/2

What we will cover:
  • how to install python and start writing code
  • how to read code written by you or other people
  • how to use other people’s code
  • how to get through first errors to first plot

All the computing is about ones and zeroes


>>> "{0:b}".format(42)
'101010'
        

Meaning:
42 = 1 × 25 + 0 × 24 + 1 × 23 + 0 × 22 + 1 × 21 + 0 × 20

All the computing is about ones and zeroes


>>> list(map(ord, "I🧡∞"))
[73, 129505, 8734]
        

Meaning:
letter "I" is encoded by number 73
emoji "🧡" is encoded by number 129505
symbol "∞" is encoded by number 8734

Memory is taken care of

  • memory is sequential, i.e. address #2 follows address #1
  • everything is of known length
  • if you have some data, it's somewhere in the memory

Memory is taken care of


>>> a = [1,2,3]
>>> b = [1,2,3]
>>> c = a
>>> object.__repr__(a)
'<list object at 0x100c6bf08>'
>>> object.__repr__(b)
'<list object at 0x100cad6c8>'
>>> object.__repr__(c)
'<list object at 0x100c6bf08>'
        

Meaning:
object "a" points to address 0x100c6bf08
object "b" points to address 0x100cad6c8
object "c" points to the same address as object "a"

Python: programming language and a tool

What you downloaded is a python program to run your code written in python language

A program operates in the context of:

  • input (variable data)
  • output (data, usually depends on the input)
  • state (everything else, e.g. file system)

Python as a language

  • reads from top to bottom, from left to right
  • has sentences, called statements
  • has parts of speech
  • is hierarchical
  • can express same thing in lots of different ways
max_i, max = 0, array[0]
for i, el in enumerate(array[1:]):
    if el > max:
        max_i = i + 1

Coding is a dialogue

Coding is a dialogue

Abstract syntax tree (AST)

1 + 2 * 3 != (1 + 2) * 3
max_i, max = 0, array[0]
for i, el in enumerate(array[1:]):
    if el > max:
        max_i = i + 1

Abstract syntax tree (AST)

Each node does something to the state, and has an output
Python computes bottom-up by nodes.
So:

x = (7 + 2) * 5
is equivalent to:
x = (9) * 5

Parts of speech, simple

1 reads “number 1” or “integer 1”
1.5 reads “number 1.5” or “float 1.5”
"1" or '1' reads “string "1"”
"""1""" reads “string "1"”
True reads “boolean true”
False reads “boolean true”
None reads just “none”

Parts of speech, complex

(1,2) reads “tuple of numbers 1 and 2”
(1,) reads “tuple of number 1”
[1,2] reads “list of numbers 1 and 2”
{1,2} reads “set of numbers 1 and 2”
{1:2} reads “dictionary with key number 1 to value number 2”

Expressions, references

x reads “value of object x”
x.y reads “value of attribute y of object x”
x(1,2) reads “call function x with parameters of numbers 1 and 2”
x.y() reads “call method y of object x with no parameters”

Expressions, subscripts

x[1] reads “subscript object x with number 1” or more commonly “value of second element of list/tuple x”
x["a"] reads “subscript object x with string "a"” or more commonly “value by key "a" from dictionary x”
x[1:3] reads “subscript object x with slice from 1 to 3” or more commonly “take elements from 2nd to 4th of list/tuple x”

Expressions, operators

Mathematical:
x + y, x - y, x / y, x * y
x ** y for power, x // y for integer division
x % y for modulus of division
Comparisons:
x > y, x >= y, x == y
Logical:
x and y, x or y
not x, x in y

Sentence types, aka statemenets

Assignment:
x = 1
Conditions or if-statement:
if x > y:
    print("Greater")
elif x < y:
    print("Less")
else:
    print("Equal?")

Sentence types, loops

For-loop:
for item in collection:
    print(item)
else:
    print("the end")

While loop:
while x > 0:
    x = x - 1
else:
    print("the end")

Sentence types, misc.

Expecting errors:
try:
    maybe_works()
except Exception as e:
    print("well, it didn't")

Imports (using code from other files):
import antigravity
import pandas as pd
from matplotlib import pyplot as plt

Sentence types, functions

You can define your function:
def index_of_max(array):
    max_i, max = 0, array[0]
    for i, el in enumerate(array[1:]):
        if el > max:
            max_i = i + 1
    return max_i
print(index_of_max([1,2,3]))
print(index_of_max([]))

Errors

EVERYBODY MAKES THEM

Errors

Switch to Errors notebook

Data types

Lists: a collection of different objects

a = []
a.append(1)     # a = [1]
a.extend([2,3]) # a = [1,2,3]
a.reverse()     # a = [3,2,1]
a.index(1)      # 2 -------^
a.pop()         # a = [3,2]
del a[0]        # a = [2]

Data types

Dictionaries: a mapping between keys and values

a = {1:1, 2:4, 3:9, 4:16}
a[5] = 25              # a = {1:1, 2:4, 3:9, 4:16, 5:25}
a.extend({0:0, 1:-1})  # a = {1:-1, 2:4, 3:9, 4:16, 5:25, 0:0}
del a[4]               # a = {1:-1, 2:4, 3:9, 5:25, 0:0}
a.get(1)               # -1
a.pop(2)               # 4; a = {1:-1, 3:9, 5:25, 0:0}
list(a.keys())         # [1, 3, 5, 0]
list(a.values())       # [-1, 9, 25, 0]
list(a.items())        # [(1,-1),(3,9),(5,25),(0,0)]

Other people's code

  • distributed as packages (or libraries)
  • package has a name and a version
  • versions change all the time
  • manage packages with pip (or conda)
  • work in projects and manage each project's packages independently with pipenv (or virtualenv or conda)

Most common packages:

  • numpy: for effective computation and linear algebra
  • pandas: for working with data in tables
  • matplotlib: for plotting data
  • scikit-learn: a collection of machine learning algorithms

GitHub

  • essential for collaboration
  • backup of your project
  • sharing your project
  • very useful still for personal projects

A briefest example of plotting data

See Example notebook to see how matplotlib and pandas can be used to produce basic plots

NUIT workshops

Python: Text Analysis with NLTK (Natural Language ToolKit)
Thursday, December 5; 1-4pm, Mudd Library 2210

Biopython: Introduction-Chicago (familiarity with python)
Monday, December 2; 1-4pm, Galter Health Sciences Library, Library Classroom, Level 2

R: Data Manipulation with the Tidyverse
Thursday, December 5; 9am-noon, Galter Health Sciences Library, Library Classroom, Level 2

Thank you

Come to Data Science Night slack


nikolay.markov@northwestern.edu
@nmarkov on slack