## Visualising Big O notation

I’m currently learning as much computer science as I can on the side. I’ve come across Big O notation a few times already, and while I understand it, I’m much more of a visual guy.

It’s rather easy to use Python and matplotlib to graph out how a function’s execution time grows as the size of the input grows. The important things to note is not total execution time, but rather how the runtime of that function grows in relation to the input size. This can be plotted onto a graph which should give us a nice representation of Big O notations.

Note too that Big O notations always show the worst case. For this reason I’ll ensure to use values where the function will have to do the most work for.

# O(1)

O(1) means constant time. No matter what size the input, the runtime will always be the same. A simple example is finding the middle number in a list. I’ll ensure that all code return the amount of time a command was run in the function. This may make the code look just a bit bloated, but for a good reason.

To find the center of a list we simply divide the length of the list in two, and return that number. It does not matter if a list has 10 elements or 100 elements, the same amount of steps is performed:

```def O1(input):
count = 0
result = input[len(input) / 2]
count += 1
return count```

I have created 5 lists. The first is length 10, second length 20, and so on. I’ll get the returned values and plot them.

### O(1) plot

As can be seen, it doesn’t matter the size of the input. It will always run in the same constant time.

# O(logN)

O(logN) increases as the input size goes up. However it goes up as a log of the input size. This means that you can exponentially increase your input size, without linearly increasing the processing time to match.

```def OlogN(input):
def search(length, count):
count += 1
length /= 2
if length == 1 or length == 0:
return 1 + count
else:
return 1 + search(length, count)
return 1 + search(len(input), 1)```

### O(logN) plot

The run time is going up, but look at the size of the inputs at the bottom. I start with 10,000 and move up to 500,000. The number of steps has increased, but not significantly.

# O(N)

O(N) is linear. This means that the run time is linearly matched to the input size. They should increase at exactly the same rate.

```def ON(input, check):
count = 0
for number in input:
count += 1
if number == check:
return 1 + count```

### O(N) plot

There is a 1:1 correlation between input size and run time. As expected this produces a linear graph.

# O(N2)

O(N2)’s runtime will go up as a square of the input size. The runtime goes up faster than your input sizes, so processing time increases rapidly. This is usually when you iterate through multiple loops at the same time like so:

```def ON2(input):
count = 0
for i in input:
count += 1
for j in input:
count += 1
return 1 + count```

# O(N3)

O(N3)is merely O(N2) with another exponent. I wanted to show the difference by simply changing the exponent.

```def ON3(input):
count = 0
for i in input:
count += 1
for j in input:
count += 1
for k in input:
count +=1
return 1 + count```

### O(N3) plot

Graphs increase rapidly as the exponent increases.

# Conclusions

I’ve not shown every single type of algorithm, as I just wanted to show the ones I have the most experience with. It’s nice to have a visual representation of these things as it really drills down just how fast your runtime can increase with larger inputs.

You can find my code used over here.

## Python and MySQL

Let me preface this post by stating I am not a database expert. I use them occasionally now and then. The below post probably doesn’t show best practices. If you have any suggestions feel free to comment.

Over the weekend I’ve been testing various ways for me to store, update, and retrieve data from a database. At first I was using the sqlite3 package, but turned to MySQL. Why the move you ask? Well I already had MySQL running on the target machine and already had scripts backing up all databases every night. Why not just use the existing database there?

For this post I’m using python 2.7.3, Debian 7.7.0, and MySQL-python 1.2.5.

## Install and set-up

Ensure that the mysql python library is installed:

```[email protected]:/tmp# pip install MySQL-python
Collecting MySQL-python
100% |################################| 110kB 6.4MB/s
Installing collected packages: MySQL-python```

I’ve also installed mysql on the test server. I’m using a root password of password simply as this is a test box.

I’ll now create a database to work on. I want a database that contains BGP AS numbers, AS Names, a queried field, and a data to know when last the AS changed. A user will be created that has full access to this database. Password for now will be password.

```[email protected]:~# mysql -u root -p

mysql> create database AS_NUM_NAME;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE USER 'asnuser'@'localhost' IDENTIFIED BY 'password';
Query OK, 0 rows affected (0.00 sec)

mysql> GRANT ALL ON AS_NUM_NAME.* TO 'asnuser'@'localhost';
Query OK, 0 rows affected (0.00 sec)

mysql> USE AS_NUM_NAME;
Database changed

mysql> create table ASN(AS_NUM INT, AS_NAME VARCHAR(50), QUERIED INT, CHANGED DATE);
Query OK, 0 rows affected (0.01 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

mysql> quit;
Bye```

The mysql library allows us to open a connection to the database and execute sql commands. The next script will open the database, insert new data, commit those changed, then close the database:

```#!/usr/bin/python

import MySQLdb
import time

today = time.strftime("%Y-%m-%d")

db = MySQLdb.connect("localhost", "asnuser", "password", "AS_NUM_NAME")

with db:
cursor = db.cursor()
sql = '''INSERT INTO ASN(AS_NUM, AS_NAME, QUERIED, CHANGED) \
VALUES (%s, %s, %s, %s)'''
cursor.execute(sql, (1, 'Level 3 Communications, Inc.', 0, today))```

I’ll give this a run:

`[email protected]:~# ./add.py`

```[email protected]:~# mysql -u asnuser -p

mysql> use AS_NUM_NAME;

mysql> select * from ASN;
+--------+------------------------------+---------+------------+
| AS_NUM | AS_NAME                      | QUERIED | CHANGED    |
+--------+------------------------------+---------+------------+
|      1 | Level 3 Communications, Inc. |       0 | 2015-01-06 |
+--------+------------------------------+---------+------------+
1 row in set (0.00 sec)```

Of course, doing an single update is not very useful. I’ll create a text file with the first 20 AS numbers and names in use:

```1     Level 3 Communications, Inc.
2     University of Delaware
3     Massachusetts Institute of Technology
4     University of Southern California
5     Symbolics, Inc.
6     Bull HN Information Systems Inc.
7     UK Defence Research Agency
8     Rice University
9     Carnegie Mellon University
10    CSNET Coordination and Information Center (CSNET-CIC)
11    Harvard University
12    New York University
14    Columbia University
15    DYNAMICS
16    Lawrence Berkeley National Laboratory
17    Purdue University
18    University of Texas at Austin
19    Leidos, Inc.
20    University of Rochester```

I want to extract the first number, then everything will be part of the AS name. I’ll then stick them all in the database:

```#!/usr/bin/python

import MySQLdb
import time

today = time.strftime("%Y-%m-%d")
as_list = []

with open ('20_asn') as f:
for AS in new_as:
as_list.append(AS.strip())
db = MySQLdb.connect("localhost", "asnuser", "password", "AS_NUM_NAME")

with db:
cursor = db.cursor()
sql = '''INSERT INTO ASN(AS_NUM, AS_NAME, QUERIED, CHANGED) \
VALUES (%s, %s, %s, %s)'''
for AS in as_list:
split_as = AS.split(' ', 1)
cursor.execute(sql, (int(split_as[0].strip()), split_as[1].strip(), 0, today))```

After a quick run, let’s log back into the database and check what we have

```mysql> SELECT * FROM ASN WHERE AS_NUM = 14;
+--------+---------------------+---------+------------+
| AS_NUM | AS_NAME             | QUERIED | CHANGED    |
+--------+---------------------+---------+------------+
|     14 | Columbia University |       0 | 2015-01-06 |
+--------+---------------------+---------+------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM ASN WHERE AS_NUM = 18;
+--------+-------------------------------+---------+------------+
| AS_NUM | AS_NAME                       | QUERIED | CHANGED    |
+--------+-------------------------------+---------+------------+
|     18 | University of Texas at Austin |       0 | 2015-01-06 |
+--------+-------------------------------+---------+------------+
1 row in set (0.00 sec)```

## Retrieve data

I’ll write another script now that takes an AS number as an argument and prints out the AS name:

```#!/usr/bin/python

import MySQLdb
import sys

db = MySQLdb.connect("localhost", "asnuser", "password", "AS_NUM_NAME")
cursor = db.cursor()
sql = "SELECT AS_NAME FROM ASN where AS_NUM = " + sys.argv[1]
with db:
cursor.execute(sql)
row = cursor.fetchone()
print row[0]```

Note: Replies are always given as tuples, even when a single value is returned. This is why I’m only printing the first item in the returned tuple with [0]

A quick run gives me what I need:

```[email protected]:~# ./check.py 5
Symbolics, Inc.
[email protected]:~# ./check.py 6
Bull HN Information Systems Inc.
[email protected]:~# ./check.py 7
UK Defence Research Agency```

# Updating

I’d like to update the QUERIED field each time I query a value. I’ll rewrite the last script to update that field when it gets a result. First it should get the AS Name and the QUERIED value. Then update the QUERIED value by 1, and print the AS name:

```#!/usr/bin/python

import MySQLdb
import sys

query_as = sys.argv[1]

db = MySQLdb.connect("localhost", "asnuser", "password", "AS_NUM_NAME")
cursor = db.cursor()
sql = "SELECT * FROM ASN where AS_NUM = " + sys.argv[1]
with db:
cursor.execute(sql)
row = cursor.fetchone()
queried = row[2]
queried += 1
sql = """
UPDATE ASN
SET QUERIED = %s
WHERE AS_NUM = %s"""
cursor.execute(sql, (queried, query_as))
print row[1]```

First, my mysql check the queried value is 0:

```mysql> select * from ASN where AS_NUM = 18;
+--------+-------------------------------+---------+------------+
| AS_NUM | AS_NAME                       | QUERIED | CHANGED    |
+--------+-------------------------------+---------+------------+
|     18 | University of Texas at Austin |       0 | 2015-01-06 |
+--------+-------------------------------+---------+------------+
1 row in set (0.00 sec)```

Query that value three times:

```[email protected]:~# ./check.py 18
University of Texas at Austin
[email protected]:~# ./check.py 18
University of Texas at Austin
[email protected]:~# ./check.py 18
University of Texas at Austin```

Now check the database:

```mysql> select * from ASN where AS_NUM = 18;
+--------+-------------------------------+---------+------------+
| AS_NUM | AS_NAME                       | QUERIED | CHANGED    |
+--------+-------------------------------+---------+------------+
|     18 | University of Texas at Austin |       3 | 2015-01-06 |
+--------+-------------------------------+---------+------------+
1 row in set (0.00 sec)```

Very useful.

There was an issue I created earlier that you may have spotted. I created a record for ASN1, then when I created the first 20 I created another ASN1. This can be shown via a query:

```mysql> SELECT * FROM ASN WHERE AS_NUM = 1;
+--------+------------------------------+---------+------------+
| AS_NUM | AS_NAME                      | QUERIED | CHANGED    |
+--------+------------------------------+---------+------------+
|      1 | Level 3 Communications, Inc. |       0 | 2015-01-06 |
|      1 | Level 3 Communications, Inc. |       0 | 2015-01-06 |
+--------+------------------------------+---------+------------+
2 rows in set (0.00 sec)```

The script I used to populate all 20 was fine to populate a database for the first time, but no good for further updates. The script simply created new records, with new dates. What we want is to check the database first, then do different actions depending on what we see.

I’ll now get a list of the first 30 AS numbers, then run a script to update. I’ll need to check the following:

• Does the AS Number already exist?
• If so, check that the name matches (note, I’m storing only 50 characters of the name, so only match the first 50 characters)
• If the name doesn’t match, update the name and updated the CHANGED field.
• Otherwise just create a new record and insert todays date into the CHANGED field.

I’ll first delete all records with ASN 1 from the database then write the new script.

```mysql> DELETE FROM ASN WHERE AS_NUM = 1;
Query OK, 2 rows affected (0.00 sec)```

As this was getting a bit big, I moved the update and create sections into their own methods.

```#!/usr/bin/python

import MySQLdb
import sys
import time

as_list = []
already, new, changed = 0, 0, 0
today = time.strftime("%Y-%m-%d")

with open ('30_asn') as f:
for AS in new_as:
as_list.append(AS.strip())

db = MySQLdb.connect("localhost", "asnuser", "password", "AS_NUM_NAME")
cursor = db.cursor()

def create_sql(AS_NUM, AS_NAME):
sql = '''INSERT INTO ASN(AS_NUM, AS_NAME, QUERIED, CHANGED) \
VALUES (%s, %s, %s, %s)'''
cursor.execute(sql, (AS_NUM, AS_NAME, 0, today))

def update_sql(AS_NUM, AS_NAME):
sql = """
UPDATE ASN
SET QUERIED = %s
WHERE AS_NUM = %s"""
cursor.execute(sql, (AS_NUM, AS_NAME))

with db:
for AS in as_list:
split_as = AS.split(' ', 1)
split_as[1] = split_as[1].lstrip()
sql = "SELECT * FROM ASN where AS_NUM = " +str(split_as[0])
cursor.execute(sql)
row = cursor.fetchone()
if row:
if row[1] == split_as[1][:50].strip():
pass
else:
changed += 1
update_sql(int(split_as[0].strip()), split_as[1].strip())
else:
new += 1
create_sql(int(split_as[0].strip()), split_as[1].strip())

print "New AS: " + str(new)
print "Changed: " + str(changed)

Let’s do a quick run:

```roo[email protected]:~# ./insert.py
New AS: 11
Changed: 0
Unchanged: 20```

Great, 1 was re-added, plus the 10 new ones. This should add them all, plus not change the values of anything that was already there:

```mysql> SELECT * FROM ASN;
+--------+---------------------------------------------------+---------+------------+
| AS_NUM | AS_NAME                                           | QUERIED | CHANGED    |
+--------+---------------------------------------------------+---------+------------+
|      1 | Level 3 Communications, Inc.                      |       0 | 2015-01-06 |
|      2 | University of Delaware                            |       2 | 2015-01-06 |
|      3 | Massachusetts Institute of Technology             |       1 | 2015-01-06 |
|      4 | University of Southern California                 |       0 | 2015-01-06 |
|      5 | Symbolics, Inc.                                   |       0 | 2015-01-06 |
|      6 | Bull HN Information Systems Inc.                  |       0 | 2015-01-06 |
|      7 | UK Defence Research Agency                        |       0 | 2015-01-06 |
|      8 | Rice University                                   |       0 | 2015-01-06 |
|      9 | Carnegie Mellon University                        |       0 | 2015-01-06 |
|     10 | CSNET Coordination and Information Center (CSNET- |       1 | 2015-01-06 |
|     11 | Harvard University                                |       0 | 2015-01-06 |
|     12 | New York University                               |       0 | 2015-01-06 |
|     13 | Headquarters, USAISC                              |       0 | 2015-01-06 |
|     14 | Columbia University                               |       0 | 2015-01-06 |
|     15 | DYNAMICS                                          |       0 | 2015-01-06 |
|     16 | Lawrence Berkeley National Laboratory             |       0 | 2015-01-06 |
|     17 | Purdue University                                 |       0 | 2015-01-06 |
|     18 | University of Texas at Austin                     |       3 | 2015-01-06 |
|     19 | Leidos, Inc.                                      |       0 | 2015-01-06 |
|     20 | University of Rochester                           |       2 | 2015-01-06 |
|     21 | The RAND Corporation                              |       0 | 2015-01-06 |
|     22 | Navy Network Information Center (NNIC)            |       0 | 2015-01-06 |
|     23 | National Aeronautics and Space Administration     |       0 | 2015-01-06 |
|     24 | National Aeronautics and Space Administration     |       0 | 2015-01-06 |
|     25 | University of California at Berkeley              |       0 | 2015-01-06 |
|     26 | Cornell University                                |       0 | 2015-01-06 |
|     27 | University of Maryland                            |       0 | 2015-01-06 |
|     28 | Deutsches Zentrum fuer Luft- und Raumfahrt        |       0 | 2015-01-06 |
|     29 | Yale University                                   |       0 | 2015-01-06 |
|     30 | SRI International                                 |       0 | 2015-01-06 |
+--------+---------------------------------------------------+---------+------------+
30 rows in set (0.00 sec)```

## Python paths and Cron logging

I created two new twitter accounts yesterday and the amount of followers in such a short time is great to see. Feel free to follow them here – @bgp4_table and @bgp6_table

The accounts get updated through Python, and that Python script is run via a cron job once every six hours.

I noticed that when I ran my script manually, it worked fine. When the cronjob ran it, nothing happened. As there is no console log it made me wonder what the issue was.

## Cron log

The first thing I needed to do was configure cron to log it’s output. I’ve done it like so in my crontab:

`0 0,6,12,18 * * * /home/scripts/tweet.py > /home/scripts/tweet.log 2>&1`

This directs all 1 and 2 output to the log file I created. In case you not aware, standard stream 0 is input, stream 1 is output, and stream 2 is errors. The commands above ensure I’m logging both output, if any, and errors, if any.

## Python paths

On the next cron run, my log was created and it was plain to see:

```\$ less tweet.log
Traceback (most recent call last):
File "/home/scripts/tweet.py", line 10, in
with open("v4_count") as f:
IOError: [Errno 2] No such file or directory: 'v4_count'```

As part of my script, I save the last values in text files located in the same path as the script. When the script runs again, it reads the last value, get the new value, then works out a delta to display. I then write the latest value into that file for the next run. This is an example of one of those reads:

```# Pull last delta
with open("v4_count") as f:

This ran fine if I ran the script manually, but I was always running the script from within the same folder. This meant that python was finding and opening the file in the same folder. With the cronjob, it was getting called from somwehere else where v4_count did not exist.

The simple fix for this was to change all references to the full path:

```# Pull last delta
with open("/home/scripts/v4_count") as f:

This time on the next run, no more problems :)

## When and when not to multithread

At the end of my last post on Python multithreading, I said my example was not the best. Let me expand some more on this.

While testing code in the previous post, I noticed that certain code was slower when multiple threads were running. Also these threads are not tied to a CPU. If we were talking about a bigger applications in which we wanted to ensure multiple threads were on different CPUs, you are in fact looking for multiprocessing.

Consider the following code. It simple counts to 99 999, doubles the number, then prints this to the screen. At first I’ll do this as a single thread app then multithread and time them.

```#!/usr/bin/python

for i in range (100000):
i *= 2
print i```

```#!/usr/bin/python

i *= 2
with lock:
print i

for i in range (100000):
t.start()```

I’ll now time and run the command. I’ll run each command three times and take the average of all three:

`time ./single.py`

The single thread is able to do this in 0.411 seconds, while the multithreaded app takes a full 16.409 seconds.

Now I’ll do a test in which multithreading will make a big difference. I have a list of 500 random urls. I want to log into each, then get them to display the page contents. Not all urls respond, and I’ve also given a three second timeout to fetching any page.

The single thread app is coded like so:

```#!/usr/bin/python

import urllib2

with open("urls.txt", "r") as f:

for url in urls:
request = urllib2.Request(url)
try:
response = urllib2.urlopen(request, timeout = 3)
print page
except:

This takes a full 11 minutes and 40 seconds to fully run.

```#!/usr/bin/python

import urllib2

try:
response = urllib2.urlopen(url, timeout = 3)
with lock:
print page
except:
with lock:

with open("urls.txt", "r") as f:

for url in urls:
request = urllib2.Request(url)
t.start()```

This time the average over 3 runs is only 1 minute and 40 seconds.

I am however still locking output to the screen. This may be bad practice, but let’s assume I don’t really care about visible output. Maybe I just want to throw some commands somewhere, or something simple like ping. If I didn’t lock before printing, how quickly could this actually run?

```#!/usr/bin/python

import urllib2

try:
response = urllib2.urlopen(url, timeout = 3)
except:
pass

with open("urls.txt", "r") as f:

for url in urls:
request = urllib2.Request(url)
t.start()```

This completes in 1 minute and 13 seconds. Not as much as I hoped for. But it does mean one thing. Python is not running ALL the threads at exactly the same time. If that was the case, the max run time would be just over three seconds as that’s what the timeout is.

I’ll load up Wireshark and run the test again. I should see how many threads are sending HTTP GETs at the same time. When I start the threads, I can see 26 threads all starting within a second of each other. Only a full 7 seconds later do others start:

After that, I see more threads being added as others end. The timing seems random later as each page has a different response time.

This seems to be an OS imposed limit.

## Conclusions

• Multithreading in Python has certain benefits only in specific cases.
• Mainly if you are requesting data from many different sources. No need to query them one at a time.
• Python’s initial single thread is rather efficient all by itself.

I’m certainly not going to rewrite all my current code to use multithreading. Going back to my original OSPF checker, it certainly would be good to check pull information off multiple devices at the same time, but the rest of the app I’d still keep as a single thread.

The first ‘proper’ Python app I made logged onto a list of devices and pulled out OSPF state. This worked perfectly fine. The app correctly works out whether it can log into a device or not, and waits a few seconds to ensure a device actually responds.

The issue is that if I have a list of say 1000 devices, and 500 of them don’t respond, the amount of time you need to wait rapidly increases as it looks at each one in turn. Would it not be better for the app to be able to log into multiple devices at the same time in parallel? This would drastically reduce the runtime.

Consider the following code:

```#!/usr/bin/python

return

for i in range(4):
t.start()```

A module is defined called thread_test. I then spawn four threads, each of which run the module. I should therefore see four lines printed:

```\$ ./thread.py

Of course getting them all to do exactly the same thing is a bit boring. I may have a list of items I want to print. Let’s pass each item as an argument and print them out:

```#!/usr/bin/python

list_of_items = ["cat", "banana", "house", "phone"]

print "I am a " + item
return

for word in list_of_items:
t.start()```
```\$ ./thread.py
I am a cat
I am a banana
I am a house
I am a phone```

If you’ve run this code, you may notice that sometimes your output gets a bit garbled:

```\$ ./thread.py
I am a catI am a banana

I am a house
I am a phone

I am a cat
I am a banana
I am a house
I am a phone```

All four threads are trying to write to the screen at the same time. If outputting to the screen, or writing to a file, this output can look rather messy. Especially as the device and thread count goes up.

## Locks

I can use locks to prevent this. Each thread can go do it’s business, but if I need to write to the screen or write to a file, I ensure only a single thread can do this at a time. As an example I’ll iterate through a list of 100. All those threads will create their data in memory at pretty much the same time, but I’ll ensure only one at a time can print and write to a file. I’ll also ensure that the application closes the file only after all threads are completed.

```#!/usr/bin/python

phrase = "I am number " + str(num)
lock.acquire()
print phrase
f.write(phrase + "\n")
lock.release()

f = open("text.txt", 'w')
for i in range (100):
t.start()

pass
else:
f.close()```

Any code between the locks acquiring and releasing can only be done one at a time. The example above doesn’t show a great example, but the action of getting data and waiting from a remote device can take a few seconds. If that can all be done at the same time, then results written once at a time to a file, it would speed things up immensely.

## Update – 15/12/14

Ben Cardy below mentioned a great shortcut that most viewers might miss if not reading all the comments. For that reason I’ll put it up here. My code above acquires and releases a lock when needed. There is a simpler way to do this. If you code with lock, any code indented after will essentially be wrapped in lock codes. This is nice as you don’t have to remember to release the lock. Another benefit is that the with code will release the lock even if the thread throws an exception.

The last code above could be rewritten like so:

```#!/usr/bin/python

phrase = "I am number " + str(num)
with lock:
print phrase
f.write(phrase + "\n")