Skip to main content

Text processing

Convert file encoding

Get available encodings.

iconv -l

Convert text from the ISO 8859-15 character encoding to UTF-8.

iconv -f ISO-8859-15 -t UTF-8 < input.txt > output.txt

Get number of lines in targetFile

wc -l targetFile | grep -Eo '[0-9]+'

Read csv file and extract certain columns in certain order

awk -F ',' '{print $3 "," $1}' a1.csv > b2.csv
note
  • Saves output to file b2.csv.
  • Columns are referenced using $1, $2, $x.
  • Default fieldsep if not set is space.

Remove duplicate lines

Preserving line order in output

awk '!seen[$0]++' target.csv

Not maintaining line order

sort -u target.csv
tip

-f - Case insensitive comparisons.

Sed

Delete lines matching regex & print result

sed '/regex/d' file

Replace all occurrences of regex in target file with string

sed 's/regex/string/g' file

Replace only on lines matching the line pattern

sed '/lineregex/s/regex/string/g' file
  • Remove g from any of the expressions above to replace only the first occurrence on each line.

Useful options

  • -i - Make changes overwriting the file.
  • --in-place=.bkp - Also update in-place but create a backup of the original file with .bkp extension.
  • -e - Apply multiple expressions (i.e. sed -e 's/regex0/string0/' -e 's/regex1/string1/' file).
  • -r - Allow extended regular expressions.

Sort by first column alphabetically, second numerically

sort -k1,1 -k2,2nr
note
  • Last r causes reversed output.
  • Default fieldsep is space.