Data Science Machine and Command Line Setup

Data Scientists require a very particular toolset for their everyday tasks, but unlike software developers, few of them spend a lot of time optimising this toolset for their specific needs. I compiled a simple step-by-step guide that helps to automate the process setting up a brand new data science machine and making it work for you by customising the command prompt and using a dotfile approach to manage configuration, identity, and access information. This gets you from zero to Data Science in minutes on MacOS

I’ve had to set up new data science laptops twice in the last couple of months and got frustrated with the tedious setup procedures. Installing libraries, customising settings, how do I switch RStudio to night mode again? Moreover, I have two new starters joining my team in the coming weeks which means that more system setups are just around the corner. So I decided to compile a guide with scripts and commands that make this process smoother and faster.

There is many things to be said for an automatic setup over manual installation. Speed, reproducibility, a standardised configuration between all team members, and the opportunity for programmatic customisation. Among software developers this approach, called .dotfile configuration, is common practice and great introductions are available here and here. However, so far I have only rarely encountered it on data science teams. This is despite the fact that data scientists frequently work with complex statements at the command line, have to pay particular attention to system setup to ensure reproducibility of their experiments, use version control, and commonly deal with data from a wide range of sources, many of which will require API tokens or access credentials. So think of this as a data science specific dotfile setup. There are three main components to this approach:

  1. using command line tools and package managers instead of graphic installers automate first-time system setup, because this is faster, more reproducible, and more easily maintainable.
  2. set up a beautiful, efficient, and powerful command line configuration, because it will make everyday tasks easier, because it’s awesome and because we can!
  3. create a .dotfile repository that saves settings, application preferences, api keys, and access tokens, because it is more convenient and more secure than glueing post-its to our monitor or hard-coding passwords and tokens into our code that is then pushed to GitHub.

Most parts of this article can be used in isolation, so unlike the British Prime Minister you are free to “cherry-pick” if you are so inclined.

I am assuming here that you’re using MacOS. Parts of it may be transferable to a linux machine, much of it will need modification. If you’re on Windows… good luck! It may work with the new Ubuntu for Windows? If you get a chance to test this, please let me know in the comment section below.

Initial Setup

We start of by installing install Homebrew, the “missing package manager for MacOS”! This bit actually requires some user input ( and ), so we will split that from the rest of the basic installations.

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Once that is done we can use homebrew for some additional household essentials.

# we will need these later
brew install wget htop git git-lfs libgit2 keychain

# I like these, so I'll install them here as well
brew cask install google-chrome atom slack vlc spotify dropbox
# You can launch and configure apps like this
open ~/Applications/Dropbox.app

# install gcc and java,
# a lot of the data science tools we will install later depend on them
# (some of these may require your password again)
brew install gcc

brew tap caskroom/versions

brew cask install java
brew cask install java8
brew install jenv

Powerlevel9k Command Line

Now it’s time to beef up our command line. This is something that many software developers and engineers spend a lot of time on, to the point where some are holding competitions to show off their great shells. Many Data Scientists, on the other hand, seem to neglect command line customisation. I think that this is a mistake. Let me convince you by highlighting some of the neat extra features that we can add with a little bit of extra setup effort:

  • beautiful command prompt
  • syntax highlighting
  • auto completion
  • read/write flags
  • execution timing
  • git support with repo status tracking

example powerlevel9k prompt

The most nerdy set of productivity tools on the block! To make this work we will need iTerm2 and zsh. iTerm2 is a macOS terminal replacement with many additional features, such as more display customisation, better hotkeys, and fantastic split pane functionality. Zsh is a shell designed for interactive use. It works particularly well with oh-my-zsh, a configuration tool that helps with setting up everything just the way we like it. They have great stickers, too ;)

While we’re at it we will also install the Powerline terminal fonts, which will be needed for powerlevel9k, the zsh theme of my choosing.

# install iTerm2
brew cask install iterm2

# install zsh
brew install zsh

# get oh-my-zsh configuration tool
sh -c "$(curl -fsSL https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh)"
# (this may require your password again)

# get powerlevel9k theme for zsh
git clone https://github.com/bhilburn/powerlevel9k.git ~/.oh-my-zsh/custom/themes/powerlevel9k
# and the corresponding font
wget -O /Library/Fonts/font_sourcecodepro_powerline_awesomeregular.ttf https://github.com/Falkor/dotfiles/blob/master/fonts/SourceCodePro+Powerline+Awesome+Regular.ttf?raw=true

The oh-my-zsh installation script changes your default shell to zsh and creates the file .zshrc. Just like .bash_profile for bash, this file is automatically sourced when a new zsh session is launched. From now on you should always use .zshrc instead of .bash_profile, for example when setting a new standard conda environment. Notice that .zshrc comes with a lot of options that are commented out. Feel free to go through the file and uncomment the modifications that may be of interest to you.

You should also add iTerm2 to the dock bar and/or assign a hot key of your choosing. Change the colour scheme (Menu bar > Profiles > Open Profiles... > Select "Default" > Edit Profiles...) as you see fit. Definitively change the font to SourceCodePro+Powerline+Awesome Regular. This last step is important as POWERLEVEL9K WON’T WORK PROPERLY WITHOUT THIS and you will end up with cryptic symbols on your prompt instead.

If you don’t have strong feelings about colour style preference, feel free to use my profile template. You can install it as a dynamic profile with the command below. DynamicProfiles enable you to share your preferences between different machines. You can create your own by exporting your profile from the profile menu to a JSON file and copying it to the same location:

# copy the profile settings for iTerm2 to DynamicProfiles folder
wget -O ~/Library/Application\ Support/iTerm2/DynamicProfiles https://github.com/JanLauGe/.dotfiles/blob/master/iterm_profile.json

There is a wide range of plugins available for iTerm2 and zsh. I automatically add a few that I find useful by installing them with homebrew. Afterwards I add them to the .zshrc configuration file with sed or by pipe-appending (>>) a string to the end of the file.

In case you’re unfamiliar with these commands: sed looks for a string in a file using regular expression and replaces the found string with a replacement string. The inplace flag -i '' is Mac specific and tells sed to overwrite the old file with the new updated version. The >> operator appends to a file or creates the file if it doesn’t exist.

Side note: Alternatively, we could just copy a pre-existing .zshrc but I felt that adding lines using sed keeps things more transparent and allows for more of a mix-and-match approach where you can choose the bits you like and leave out the ones that are not useful to you.

# change zsh theme to powerlevel9k
sed -i '' 's/ZSH_THEME="robbyrussell"/POWERLEVEL9K_MODE='awesome-patched'\
ZSH_THEME="powerlevel9k\/powerlevel9k"/g' .zshrc

# Add auto suggestions (for Oh My Zsh) suggests the commands you used
# in your terminal history. You just have to type → to fill it entirely!
# Note: $ZSH_CUSTOM/plugins path is by default ~/.oh-my-zsh/custom/plugins
brew install zsh-autosuggestions zsh-syntax-highlighting

# Add the plugins to the list of plugins in ~/.zshrc configuration file :
sed -i '' '/^plugins=(/  a\
 \ \ zsh-autosuggestions \
 \ \ web-search \
 \ \ jsontools \
 \ \ macports \
 \ \ node \
 \ \ osx \
 \ \ sudo \
 \ \ thor \
 \ \ docker \
' .zshrc

# set default user in .zshrc to avoid the nasty username@machine prompt
echo 'export DEFAULT_USER="$(whoami)"' >> .zshrc

Data Science Essentials

Data science at the command line is great, but I doubt it will be enough to do all of your day-to-day tasks. We need R & Python, and while the GUI installers for Rstudio and Anaconda make the installation child’s play, it would be nice to have it as part of this initial setup script as well. Moreover, I find myself accumulating eclectic collections of packages and libraries. Instead of reinstalling all of these manually I have included them here as well:

#### install anaconda
# May need updating for conda version
wget -O anaconda.sh https://repo.anaconda.com/archive/Anaconda3-5.3.0-MacOSX-x86_64.sh
bash anaconda.sh
rm anaconda.sh
# append conda path to bash profile
echo 'export PATH="~/anaconda3/bin:$PATH"' >> ~/.zshrc
# reload profile
source .zshrc

# create new anaconda virtual environments
conda update conda
conda config --add channels conda-forge
conda create --name dev2 python=2.7
conda create --name dev3 python=3.6
# and switch to it to avoid using the system python
source activate dev3
# do this every time we start a new session
# assuming you want to use python3 by default
echo 'source activate dev3' >> ~/.zshrc
# Install a few libraries that do not ship with anaconda
pip install awscli tensorflow tensorflow-gpu keras


#### install R and RStudio
# this is required for some advanced plotting
brew cask install xquartz # (will need password again)
brew install --with-x11 r
brew cask install --appdir=/Applications rstudio
# Note the --appdir option which will use /Applications instead of ~/Applications


# set up rJava; this can be a pain!
# I used these instructions: https://zhiyzuo.github.io/installation-rJava/
# consult google if you get stuck here

# set java environmental variables for the profile
echo 'export PATH="$HOME/.jenv/bin:$PATH"' >> ~/.zshrc
# (you may need to update version number here)
echo 'export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home"' >> ~/.zshrc
echo 'eval "$(jenv init -)"' >> ~/.zshrc
source ~/.zshrc
# make sure to set this to the version that you installed (`java -version`)
jenv add /Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home
jenv global oracle64-1.8.0_181
# prepare installation and install rJava by building from source
R CMD javareconf
RScript -e "install.packages('rJava',\
  repos='http://cran.us.r-project.org',\
  type='source')"

# install R packages
RScript -e "install.packages(c(\
  'cluster','crayon','crosstalk','curl','CVST','data.table','DBI',\
  'devtools','doMC','dtplyr','foreach','foreign','ggplot2','ggthemes','glmnet',\
  'haven','here','htmltools','htmlwidgets','httr','igraph','jsonlite','knitr',\
  'labeling','lattice','lazyeval','leaflet','lubridate','magrittr','markdown',\
  'mime','praise','psych','purrr','raster','RColorBrewer','Rcpp','readr',\
  'rmarkdown','rpart','rvest','scales','shiny','stringr','survival','testthat',\
  'units','viridis','xml2','aws.s3','checkmate','feather','future',\
  'gapminder','keras','lintr','plotly','plotROC','prettyunits','pROC','progress',\
  'randomForest','ranger','reticulate','rJava','RJDBC','RJSONIO','RODBC',\
  'roxygen2','RPostgreSQL','Rtsne','slackr','sf','stringdist','tensorflow',\
  'text2vec','vegan','xgboost','XML','tidyverse'),\
  repos='http://cran.us.r-project.org')"
# This library for snowflake is only available on github
RScript -e "library(devtools); install_github('snowflakedb/dplyr-snowflakedb')"

Consider adding /bin/zsh to your RStudio global options under Global Options... > Terminal > Custom shell binary to keep your RStudio Terminal sessions in tune with the custom terminal we set up here.

Settings and Access

So now we are done with the basic setup on our local machine. However, there are still ssh keys, api access tokens, and config files to configure. This can take a lot of time and energy, and having different tokens on different machines can be confusing or even unsafe (I have seen far too many people hard-code their AWS credentials into their notebooks!).

I’ve therefore gone for an approach of creating a folder with all the files for identity management and protecting it with a single strong master password. For obvious reasons I will not go into too much detail on my exact approach to this, but let’s just say that we have synced all our identity files to a local folder called .dotfiles. From there we can sync them into our home directory, as succinctly explained by Ajmal Siddiqui in this post.

rsync .dotfiles ~

and since we want to do that whenever we start a new terminal session:

echo 'rsync .dotfiles ~' >> .zshrc

This will synchronise all files in the .dotfiles folder to the home directory where they are available to the various applications or our custom scripts that may use them. Files that I now use this for include:

  • .ssh - ssh keys for Github, AWS, etc.
  • .aws - AWS credentials needed for the aws cli
  • .gitconfig - To track my contributions to version controlled code bases
  • .kaggle.json - Access token to use the new Kaggle API
  • .google - Access token for the google maps SDK that I used here

So that’s all! As always, I hope it is useful for someone. Please let me know any thoughts you may have in the comments below. Also, follow me on twitter, connect with me on linkedIn, and feel free to email me.