Purpose: Introduce participants to the unix command line to increase efficiency and reproducibility.
Background: Scientific research, and especially scientific synthesis, requires extensive computing, which typically involves creation of hudreds or thousands of data files, analytical processes, and many products such as graphs, model outputs, maps, images, and more. Managing this complexity is often really hard, and researchers can easily lose track of just what they did, when, and why. Thus, traditional approaches to managing the research process often fail the reproducibility test, as even the original investigators can’t repeat their initial process.
- From PhDComics
Thus, the command line. The command-line provides two major advantages to the researcher:
These advantages come at the cost of:
Command line interpreters come in many flavors. We will focus on Unix shell syntax, but even with Unix there is tremendous diversity. Learning all of the nuances of one shell can take years, but you will also find many variants on the servers you encounter.
In this lesson, you will learn the power of the command line. But, you will undoubtedly also internalize that computers are exceedingly literal. You type, they do. Or, more commonly, you type, they give an error. My experience has been that I am far more productive when my attitude towards the computer is:
jones@powder:~$ ssh jones@aurora.nceas.ucsb.edu
which should give you a terminal window showing aurora
as the host name:
In the following code examples, you need to type the command, but not include the command prompt (e.g., jones@aurora:~$
) which just shows that the computer is ready to accept a command.
We’ll start by:
mkdir
(make directory)echo
cat
(concatenate)jones@aurora:~$ mkdir oss
jones@aurora:~$ mkdir oss/data
jones@aurora:~$ echo "# Tutorial files related to OSS" > oss/README.md
jones@aurora:~$ cat oss/README.md
# Tutorial files related to OSS
jones@aurora:~$
Next, let’s copy another file and look around in the directories:
cp
(copy)cd
(change directory)ls
(list)pwd
(print working directory)tree
jones@aurora:~$ cp /tmp/plotobs.csv oss/data
jones@aurora:~$ cd oss
jones@aurora:~/oss$ ls
data README.md
jones@aurora:~/oss$ pwd
/home/jones/oss
jones@aurora:~/oss$ tree
.
├── data
│ └── plotobs.csv
└── README.md
1 directory, 2 files
jones@aurora:~/oss$
Now, let’s create another directory with two files so that we can demonstrate removing files and directories.
sites
and lakes
using mkdir
echo
tree
mv
(move)sites
directory using cd
and check it with pwd
rm
(remove) and list files with ls
pwd
sites
directory using rmdir
, which will produce and error because the directory contains files
rmdir
can only remove empty directoriesrm -r sites
to recursively remove the directory and its filesjones@aurora:~/oss$ mkdir sites lakes
jones@aurora:~/oss$ echo "Site 1 Info" > sites/site1.txt
jones@aurora:~/oss$ echo "Site 2 Info" > sites/site2.txt
jones@aurora:~/oss$ echo "Site 3 Info" > sites/site3.txt
jones@aurora:~/oss$ echo "Lake Mary" > lakes/LakeMary.md
jones@aurora:~/oss$ echo "Lake Hunter" > lakes/HunterLake.md
jones@aurora:~/oss$ tree .
.
├── data
│ └── plotobs.csv
├── lakes
│ ├── HunterLake.md
│ └── LakeMary.md
├── README.md
└── sites
├── site1.txt
├── site2.txt
└── site3.txt
3 directories, 7 files
jones@aurora:~/oss$ mv lakes/HunterLake.md lakes/LakeHunter.md
jones@aurora:~/oss$ ls lakes
LakeHunter.md LakeMary.md
jones@aurora:~/oss$ cd sites
jones@aurora:~/oss/sites$ pwd
/home/jones/oss/sites
jones@aurora:~/oss/sites$ rm site2.txt site3.txt
jones@aurora:~/oss/sites$ ls
site1.txt
jones@aurora:~/oss/sites$ cd ..
jones@aurora:~/oss$ pwd
/home/jones/oss
jones@aurora:~/oss$ rmdir sites
rmdir: failed to remove 'sites': Directory not empty
jones@aurora:~/oss$ rm -r sites
jones@aurora:~/oss$ ls
data lakes README.md
jones@aurora:~/oss$
Note the use of the single dot .
and the double dot ..
symbols in these commands. A single dot .
represents the current directory, and a double dot ..
represents the parent directory. One can also use the tilde symbol ~
to represent the your home directory, which is main directory where your files will reside.
cat
print file(s)head
print first few lines of file(s)tail
print last few lines of file(s)grep
search for matching lines of file(s)less
“pager” – view file interactivelyjones@aurora:~/oss$ head data/plotobs.csv
obsid,siteid,plot,date_sampled,sciname ,diameter,condition
1,1,A,6/13/11,Abies lasiocarpa,31.84,normal
2,1,A,6/13/11,Picea engelmannii,3.21,dry
3,1,A,6/13/11,Picea engelmannii,7.2,dry
4,1,A,6/13/11,Picea engelmannii,11.62,dry
5,1,A,6/13/11,Picea engelmannii,11.25,dry
6,1,A,6/13/11,Picea engelmannii,13.16,normal
7,1,A,6/13/11,Picea engelmannii,18.6,normal
8,1,A,6/13/11,Picea engelmannii,23.62,dry
9,1,A,6/13/11,Picea engelmannii,31.75,normal
jones@aurora:~/oss$ tail data/plotobs.csv
3287,32,B,6/10/12,Pseudotsuga menziesii,4.38,normal
3288,32,B,6/10/12,Pseudotsuga menziesii,3.09,dry
3289,32,B,6/10/12,Jamesia americana,7.98,dry
3290,32,B,6/10/12,Abies lasiocarpa,10.85,normal
3291,32,B,6/10/12,Abies lasiocarpa,13.55,dry
3292,32,B,6/10/12,Abies lasiocarpa,17.26,normal
3293,32,B,6/10/12,Abies lasiocarpa,21.65,dry
3294,32,B,6/10/12,Abies lasiocarpa,17.8,dry
3295,32,B,6/10/12,Abies lasiocarpa,23.4,normal
3296,32,B,6/10/12,Abies lasiocarpa,25.79,normal
jones@aurora:~/oss$ grep Sambucus data/plotobs.csv
13,1,A,6/13/11,Sambucus racemosa,3.83,dry
44,1,A,6/10/12,Sambucus racemosa,17.04,dry
75,1,B,6/13/11,Sambucus racemosa,3.39,dry
91,1,B,6/10/12,Sambucus racemosa,19.53,dry
116,2,A,6/13/11,Sambucus racemosa,1.4,dry
147,2,A,6/10/12,Sambucus racemosa,20.19,dry
...
jones@aurora:~/oss$
Working on syntheis projects means collaborating, which is really easy on unix because multiple people can use the same computer at the same time. Right now, we are all using aurora
at the same time, running commands in our own part of the file storage system. But what if we want to share files? Unix lets each person control who can access their files through a set of permissions which can be seen by doing a long listing with ls -l
:
jones@aurora:~/oss$ ls -l README.md
-rw-rw-r-- 1 jones staff 32 Jul 10 01:10 README.md
jones@aurora:~/oss$
You can interpret the permissions as follows:
chmod
, using for example chmod o+r README.md
chown
To see someone else’s files, they have to permit you to have the proper permissions. For example, you can examine the contents of one of my data files using the full path to the file, as long as you have the needed file permissions:
jones@aurora:~/oss$ cat ~jones/oss/lakes/LakeMary.md
Lake Mary
jones@aurora:~/oss$
<command> -h
, <command> --help
man
, info
, apropos
, whereis
$ command [options] [arguments]
command
must be an executable file on your PATH
echo $PATH
options
can usually take two forms-h
--help
Linux/Unix Cheetsheet: http://cheatsheetworld.com/programming/unix-linux-cheat-sheet/
To make it easier to follow the remaining commands, lets use git to make a copy of the lessons repository.
Go find the git URL of the repository we want to clone. First, go to the GitHub repository page and copy the reposiotry clone URL:
https://github.com/NCEAS/oss-lessons
Now, open a terminal on aurora, login, and clone the repository:
jones@powder:~$ ssh aurora.nceas.ucsb.edu
jones@aurora:~$ cd oss
jones@aurora:~/oss$ ls
data lakes README.md
jones@aurora:~/oss$ git clone https://github.com/NCEAS/oss-lessons.git
Cloning into 'oss-lessons'...
remote: Counting objects: 1554, done.
remote: Compressing objects: 100% (796/796), done.
remote: Total 1554 (delta 673), reused 1548 (delta 667), pack-reused 0
Receiving objects: 100% (1554/1554), 90.17 MiB | 21.84 MiB/s, done.
Resolving deltas: 100% (673/673), done.
Checking connectivity... done.
jones@aurora:~/oss$ ls
data lakes oss-lessons README.md
jones@aurora:~/oss$ cd oss-lessons/servers-networks-command-line
jones@aurora:~/oss/oss-lessons/servers-networks-command-line$ ls
1-servers-net.html 3-bash-loops.Rmd paleo-mammals-v2.txt
1-servers-net.Rmd 4-regex.Rmd paleo-mammals-v3.txt
2-commandline-intro.html images plotobs.csv
2-commandline-intro.Rmd paleo-mammals.txt
jones@aurora:~/oss/oss-lessons/servers-networks-command-line$
There are many text editors available for Unix systems. nano
is one that is simple and fairly easy to learn, so we’ll use that when needed, but most people prefer vim
or emacs
which are much more powerful but also harder to learn. If you use the commandline, I recommend learning one of them thoroughly.
vim
emacs
nano
$ nano paleo-mammals.txt
wc
count lines, words, and/or charactersdiff
compare two files for differencessort
sort lines in a fileuniq
report or filter out repeated lines in a file$ ls -1 ../../lakes | wc -l
2
$ ls -1 ../../lakes | wc -l > lakecount.txt
$ cat lakecount.txt
2
$ diff -u paleo-mammals.txt paleo-mammals-v2.txt
--- paleo-mammals.txt 2017-07-12 06:13:24.979291128 -0700
+++ paleo-mammals-v2.txt 2017-07-12 06:13:24.979291128 -0700
@@ -4,7 +4,7 @@
Homotherium serum,American Scimitar Cat,1m,1.5M,10K
Castoroides ohioensis,Giant Beaver,1m,1.5M,10K
Dasypus bellus,Beautiful Armadillo,1m,1M,10K
-Osteoborus cynoides,Bone-Crushing Dog,.9m,8M,1.5M
+Osteoborus cynoides,Bone-Crushing Dog,9m,8M,1.5M
Camelops hesternus,American Camel,3.6m,1M,10K
Aepycamelus,Giraffe Camel,3m,10M,5M
Megalocerous giganteus,Giant Irish Elk,2.1m,500k,10K
@@ -16,7 +16,7 @@
Doedicurus,Glyptodon,1.5m,1.5M,12K
Uintatherium robustum,,1.5m,50M,35M
Odobenocetops peruvianus,Walrus-Whale,2.1m,5M,1M
-Mammuthus primigenius,Woolly mammoth,2.75m,1.5M,8K
+Mammuthus primigenios,Woolly mammoth,2.75m,1.5M,8K
Coelodonta antiquitatis,Woolly rhinoceros,2m,500K,10K
Megatherium americanum,Ground sloth,6m,30M,8K
Megalonyx jeffersonii,Jefferson's Ground Sloth,3m,30M,8K
$ diff -u <(sort paleo-mammals.txt) <(sort paleo-mammals-v2.txt)
--- /dev/fd/63 2017-07-12 06:27:26.274193251 -0700
+++ /dev/fd/62 2017-07-12 06:27:26.270193219 -0700
@@ -12,12 +12,12 @@
Gomphotheres,Four-Tusked Elephant,2.4m,15M,5M
Homotherium serum,American Scimitar Cat,1m,1.5M,10K
Indricotherium transsouralicum,,4.7m,30M,25M
-Mammuthus primigenius,Woolly mammoth,2.75m,1.5M,8K
+Mammuthus primigenios,Woolly mammoth,2.75m,1.5M,8K
Megalocerous giganteus,Giant Irish Elk,2.1m,500k,10K
Megalonyx jeffersonii,Jefferson's Ground Sloth,3m,30M,8K
Megatherium americanum,Ground sloth,6m,30M,8K
Nothrotheriops shastensis,Shasta Ground Sloth,1.5m,30M,8K
Odobenocetops peruvianus,Walrus-Whale,2.1m,5M,1M
-Osteoborus cynoides,Bone-Crushing Dog,.9m,8M,1.5M
+Osteoborus cynoides,Bone-Crushing Dog,9m,8M,1.5M
Smilodon fatalis,Sabertooth Cat,1.2m,1.5M,10K
Uintatherium robustum,,1.5m,50M,35M
$ ls nofilehere.txt
ls: cannot access 'nofilehere.txt': No such file or directory
$ ls nofilehere.txt 2>/dev/null
$
grep
search files for textsed
filter and transform textfind
advanced search for files/directoriescut
extract parts of files like columnsjoin
merge files using a common shared column$ grep bug *.Rmd
2-commandline-intro.Rmd:### Show all lines containing "bug" in my R scripts
2-commandline-intro.Rmd:$ grep bug *.R
2-commandline-intro.Rmd:$ grep -c bug *.R
2-commandline-intro.Rmd:### Print the names of files that contain bug
2-commandline-intro.Rmd:$ grep -l bug *.R
2-commandline-intro.Rmd:### Print the lines of files that __don't__ contain bug
2-commandline-intro.Rmd:$ grep -v bug *.R
2-commandline-intro.Rmd:### Remove all lines containing "bug"!
2-commandline-intro.Rmd:$ sed '/bug/d' myscript.R
2-commandline-intro.Rmd:### Call them buglets, not bugs!
2-commandline-intro.Rmd:$ sed 's/bug/buglet/g' myscript.R
2-commandline-intro.Rmd:$ sed '/#/ s/bug/buglet/g' myscript.R
$ grep -c bug *.Rmd
1-servers-net.Rmd:0
2-commandline-intro.Rmd:12
3-bash-loops.Rmd:0
4-regex.Rmd:0
$ grep -l bug *.Rmd
2-commandline-intro.Rmd
$ grep -v bug *.Rmd |wc -l
841
$ sed '/bug/d' 2-commandline-intro.Rmd
$ sed 's/bug/buglet/g' 2-commandline-intro.Rmd
$ sed '/#/ s/bug/buglet/g' 2-commandline-intro.Rmd
$ find . -iname '*.Rmd'
$ find . -size +5K -ls
Cut is used to extract columns from a delimited text file, like a CSV file. It is fast and simple.
$ cut -d , -f 1,5 plotobs.csv
obsid,sciname
1,Abies lasiocarpa
2,Picea engelmannii
3,Picea engelmannii
4,Picea engelmannii
5,Picea engelmannii
...
$ cut -d , -f 1,5 plotobs.csv > plot-spp.csv
$ cut -d , -f 1,6 plotobs.csv > plot-diam.csv
$ head plot-spp.csv plot-diam.csv
==> plot-spp.csv <==
obsid,sciname
1,Abies lasiocarpa
2,Picea engelmannii
3,Picea engelmannii
4,Picea engelmannii
5,Picea engelmannii
6,Picea engelmannii
7,Picea engelmannii
8,Picea engelmannii
9,Picea engelmannii
==> plot-diam.csv <==
obsid,diameter
1,31.84
2,3.21
3,7.2
4,11.62
5,11.25
6,13.16
7,18.6
8,23.62
9,31.75
$ join -j 1 -t , <(sort plot-diam.csv) <(sort plot-spp.csv)