File paths and data science projects

Some thoughts on keeping a data science project tidy.

Author

Philip Khor

Published

June 13, 2019

Large data science projects can be a pain to manage. Cookiecutter Data Science recommends the following project folder structure, and I think it’s a good picture of how a data science project should be organized:

├── README.md          
├── data
│   ├── external       
│   ├── interim        
│   ├── processed      
│   └── raw            
├── docs               
├── models            
├── notebooks          
├── references         
├── reports           
│   └── figures        
├── requirements.txt   
├── setup.py           
├── src                
│   ├── __init__.py    
│   ├── data           
│   │   └── make_dataset.py
│   ├── features       
│   │   └── build_features.py
│   ├── models         
│   │   ├── predict_model.py
│   │   └── train_model.py
│   └── visualization  
│       └── visualize.py

Specifically, file paths can be hard to manage, since it’s not great practice to use absolute paths in your code. In R, I’ve gotten used to using the here package, which automatically detects the root folder of your project, so that to import a csv file from the external folder, all I have to do is

library(readr)
read_csv(here("data", "external", "my_csv_file.csv"))

so long as a .here or .Rproj file is present in the project directory.

Previously, I used .. a lot, which references the directory above the current directory. However, if you move your files around, you’ll have to change your file paths too.

While looking for a Python alternative, I found a package rootpath which essentially does the same with the rootpath.detect() function.

import rootpath
import os 

ROOT_DIR = rootpath.detect()
os.chdir(ROOT_DIR)
file_path = "data/interim/..."