Posts

Showing posts from December, 2017

cat abuse with split

# cat gets used all the time just to pipe the contents of a
# file to stdout.
# But it is actually for concatenating files.
# It has a partner in crime called split.
# Working together they are very powerful for parallel processing.
# split - do work in parallel - cat

# split then cat will produce out output file which is identical to the input:

# Make a 1 gig file of random bytes on my external ssd.
time head -c $(( 1024 * 1024 * 1024 )) /dev/urandom > $(mktemp '/DataSwap/big.XXXXXXXX')

real    0m5.738s
user    0m0.057s
sys    0m5.680s

# Yey - that was fast!
[aturner@Alexanders-MBP ~]$ split -b $(( 1024 * 1024)) /DataSwap/big.* '/DataSwap/parts'
[aturner@Alexanders-MBP ~]$ ls /DataSwap/parts*
/DataSwap/partsaa  /DataSwap/partsgp  /DataSwap/partsne  /DataSwap/partstt    /DataSwap/partszabi  /DataSwap/partszahx
/DataSwap/partsab  /DataSwap/partsgq  /DataSwap/partsnf  /DataSwap/partstu    /DataSwap/partszabj  /DataSwap/partszahy
/DataSwap/partsac  /DataSwap/partsgr  /DataSwap/partsng…

Mass Deleting With split and map

# First off - this is how I got into the mess in the first place.

# Step 1: make a really big random file:
head -c $(( 1024 * 1024 * 1024 )) /dev/urandom > $(mktemp '/DataSwap/big.XXXXXXXX')

# Step 2: Screw up and split it into a million (actually 1024*1024)
# separate 1024byte files.
split -b 1024 /DataSwap/big.KFGgJs8S '/DataSwap/parts'

# Trying to delete them normally just fails because the command line is two long.

# This seems to be about as fast as I can get, using a whole
# bunch of parallel deleters.

# First make separate files of 1000 entries to remove and put in shared
# memory for speed (linuxisum).
ls | split - '/dev/shm/lses' 

# Now make a function which can read a block and delete all
# the files listed.
function rm_block { for f in $(cat $1); do rm "/DataSwap/$f"; done; }

# A wrapper to easily put that in the background.
function rm_block_bg { rm_block $1 & }

# Now kick off the delete:
ls /dev/shm/ls* | map rm_block_bg

# Go make a cup of tea whils…

bash 'header' files

# So you want to load a set of library functions but not
# constantly reload if they are already loaded?
#
# For example:

if [[ ! -n $__UTILS_LOADED__ ]]
then

function print { local line="$@"; printf "%s\n" "$line"; }

function map { local l; while read -r l; do $1 $l; done; }

print '*** UTILS LOADED ***'
__UTILS_LOADED__=TRUE
fi

# Now you can put...
source /some/path/to/lib/utils.sh
# ...where ever you want.

Parsing Columns From Files WITHOUT awk

# awk is cool - but sometimes jumping from bash to awk to bash gets clunky.
# We don't actually have use awk - we can just leverage bash internal parsing.

# This is a classic awk example but now we have the really useful function map
# which lets us do the same time in bash.

function size_name {
    print $5 $9
}


function map {
    local l;
    while read -r l; do
        $1 $l;
    done
}

ls -l SonicField/src/cpp/lib/ | map size_name | column -t
168       build.sh
2108      stream.hpp
15676048  stream.hpp_out

# Let's try this as a one liner:
_tmp () { print $5 $9; }; ls -l SonicField/src/cpp/lib/ | map _tmp  | column -t

# awk is simpler but not by much if we assume map is part of your utils
ls -l SonicField/src/cpp/lib/ | awk '{print $5, $9}'  | column -t
168       build.sh
2108      stream.hpp
15676048  stream.hpp_out

# But remember that awk is a separate process space so you do not 
# have access to the bash state in the same way. For example:
function add_size { [[ $5 != '' ]] &am…