New Year, New Direction

Published: Jan 2, 2015 by Gavin

When I first envisioned this website, I had intended for it to be a place for me to write about interesting bits of chemistry and other science that caught my attention. As you can see from my publishing record during 2014, that plan hasn’t been going very well. I have had plenty of ideas for things to write about, but there are always other things that seem like more of a priority. Today, I will tell you about some of those other things and my plan to (hopefully) bring more frequent content to Quantum Geranium.

The latter half of 2014 was a time of big changes for me. I ended more than a year of active participation in the union representing postdocs at my university, my current position as a postdoctoral fellow in computational chemistry came to an end, and most significantly, I chose to pursue an entirely new career path. The start of 2015 sees me unemployed and moving my professional life in a new direction. After ten years in post-secondary schooling preparing for a life as a chemist, I have decided to change course and seek work as a data scientist. My reasons for doing this could be the subject of their own post, one which I have been attempting to write in a coherent and concise manner since Halloween.

“What is data science?”, I hear you ask. I have not found one single answer to that, as it is a broad, currently evolving field. The most all-encompassing definition I can come up with is a field that uses computing tools to analyze large data sets. This can include everything from the Blizzard employees who sifted through user data from the 100 million World of Warcraft accounts that have been created to the teams of researchers who competed in the Netflix prize to improve on Netflix’s own algorithm for predicting user ratings of movies. As storage has become cheaper, more organizations are collecting and storing large sets of data on everything from consumer behaviour to patient outcomes to minute-by-minute energy usage. It is easier to collect this information than to do anything useful with it and I would like to be one of the people attempting to do that.

As I apply for jobs as a data scientist in the coming year, there a few things I am planning to do to hone my skills and improve my analytical capabilities.

1. Learn more modern programming languages

As a computational scientist, I wrote many thousands of lines of code to implement simulation methods for my research, do analysis on the results of that research, and manage the (sometimes hundreds or thousands) individual calculations I had running on various high performance computing resources. To do these tasks, I primarily used Python 2.7, Bash, and Fortran 95.

Over the last six months, I have transitioned almost completely to using Python 3, which has been mostly painless. I would like to start doing more object-oriented programming in Python, as it has been more than six years, when I was taking undergraduate computer science courses using Java, since I did anything using an OOP paradigm.

Last summer, in preparation for attending a workshop on parallel programming for high performance computing, I began working through Stroustrup’s “Programming - Principles and Practice Using C++” and I would like to continue learning C++ to have a compiled language more modern than Fortran 95 in my toolbox. I have also looked at Google’s Go language as another useful addition to my programming toolbox. R is being used as the teaching language in a series of online courses I began in November, so I have been learning that I work through the course material.

To provide a purpose as I learn new languages and techniques, I’m going to work through the challenges posted to the Daily Programmer subreddit, probably using C++ or Go to solve the easy/intermediate ones at first and Python for the harder ones. I’ll be posting my solutions to them in a GitHub repository and potentially writing about some of the more interesting ones here (For example, for the Halloween challenge, I wrote a fun program that simulates a map of zombies, hunters, and humans. Right now it only prints an ascii map, but if I update it to include prettier graphics, I’ll probably post a writeup of the project.). If anyone knows of other communities where regular programming challenges are posted, I’d be interested in hearing about them as well.

2. Learn more data analysis techniques

As mentioned above, I began a series of online courses about data science on Coursera in November as part of their new specialization program. It is made up of nine four-week courses and a final capstone project. I have already completed the first two courses in the series and am registered for two more in January. These have been useful for understanding the language and concepts of data science and introducing me to the R statistical programming language.

3. Work with real-world data

Alongside the coursework, I have some independent data mining and analysis projects I will be working on as well. These will allow me to get some more hands-on experience with real-world data and familiarize myself with data collection techniques I never encountered working as a chemist. As these projects progress, I will definitely be providing updates on them here and writing up complete analyses once I have them.