Optimizing Grep, Sort, Uniq For Speed

Home » CentOS » Optimizing Grep, Sort, Uniq For Speed
CentOS 9 Comments

This snippet of code pulls an array of hostnames from some log files. It has to parse around 3GB of log files, so I’m keen on making it as efficient as possible. Can you think of any way to optimize this to run faster?

    for host in $(grep -h -o "[-\.0-9a-z][-\.0-9a-z]*.com" ${TMPDIR}/* | sort | uniq); do

9 thoughts on - Optimizing Grep, Sort, Uniq For Speed

  • Sean Carolan wrote:

    For one, do the sort in one step: sort -u. For another, are the hostnames always the same field? For example, if they’re all /var/log/messages, I’d do awk ‘{print $4;}’ | sort -u


  • You have two major performance problems in this script. First, UTF-8
    processing is slow. Second, wildcards are EXTREMELY SLOW!

    You’ll get a small performance improvement by using a C locale, *if* you know that all of your text will be ascii (hostnames will be). You can set LANG either for the whole script or just for grep/sort:

  • Naturally, you should test both on your own data. I’m amused to admit that I tested my own advice against my mail log and got more improvement from the LANG setting than the string prefix. The combination of the two reduced the amount of time to run your your pattern against my mail logs by about 90%.

  • Thank you Mark and Gordon. Since the hostnames I needed to collect are in the same field, at least in the lines of the file that are important. I ended up using suggestions from both of you, the code is like this now. The egrep is there to make sure whatever is in the 9th field looks like a domain name.

    for host in $(awk ‘{ print $9 }’ ${TMPDIR}/* | egrep
    “[-\.0-9a-z][-\.0-9a-z]*.com” | sort -u); do

    Original script:
    real 28m11.488s user 26m57.043s sys 0m30.634s

    Using awk instead of grepping the entire batch:
    real 6m14.949s user 5m0.629s sys 0m26.914s

    Using awk and with export LANG=C
    real 2m50.611s user 1m20.849s sys 0m27.366s

    Awesome, thanks for the tips!

  • Sean Carolan wrote:
    awk is not “cut”. What you want is awk ‘{if (/[-\.0-9a-z][-\.0-9a-z]*.com/) { print $9;}}’ | sort -u

    No grep needed; awk looks for what you want *first* this way.

    mark, who learned awk in the very early nineties, writing
    100-200 line awk scripts….

  • Thanks, Mark. This is cleaner code but it benchmarked slower than awk then grep.

    real 3m35.550s user 2m7.186s sys 0m27.793s

    I’ll run it a few more times to make sure that it wasn’t some other process slowing it down.

    I really need to brush up some more on my awk skills!

  • If the key phrase is *as efficient as possible*, then I would say you want a compiled pattern search. Lex is the tool for this, and for this job is not hard. Lex will generate a specific scanner(*)
    in C or C++ (depending on what flavor of lex you use). It will probably be table-based. Grep and awk, in contrast, generate scanners on the fly, and specifying complicated regular expressions is somewhat clumsier in grep and awk.

    (*) strictly speaking, you are *scanning* not *parsing*. Parsing involves a grammar, and there’s no grammar here. If it develops that these domain names are context sensitive, then you will need a grammar.

    The suggestions of others — setting LANG, cutting a specific field,
    and so on, are all very valuable, and may be *practically* more valuable than writing a scanner with lex, or could be used in conjunction with a “proper” scanner.

    Note that lex will allow you to use a much better definition for
    “domain name” — such as more than one suffix, names of arbitrary complexity, names that may violate RFC, numeric type names, case sensitivity, names that match certain special templates, like
    “*.cn” or “goog*.*” and so on.

    If you are unfamiliar with lex, note that it is the front end for many a compiler.

    BTW, you could easily incorporate a sorting function in lex that would eliminate the need for an external sort. This might be done in awk,
    too, but in lex it would be more natural. You simply would not enter duplicates in the tree. When the run is over, traverse the tree and out come the unique hostnames. I’m assuming you’ll have many collisions. (You could even keep a count of collisions, if you’re interested in which hosts are “popular”.) Consider btree(3) for this or hash(3).


  • Woodchuck wrote:

    That, to me, would be a Big Deal.

    Hello, mark, wake up.

    Of course, there’s an even easier way, just using awk:

    awk ‘{if (/[-\.0-9a-z][-\.0-9a-z]*.com/) { hostarray[$9] = 1;}} END { for
    (i in hostarray ) { print i;}}’

    This dumps it into an associative array – that’s one whose indices are a string – so it will by default be in order.


  • I ended up using this construct in my code; this one fetches out servers that are having issues checking in with puppet:

    awk ‘{if (/Could not find default node or by name with/) { print substr($15, 2, length($15)-2);}}’ ${TMPDIR}/* | sort -u

    Thanks again, your knowledge and helpfulness is much appreciated.