Thursday, July 4, 2013

Grep too slow? Use git-grep

If on a git repo, use `git grep`. Similar options, much, much faster.

Example, on a directory with 11K+ files:

grep -r . 'float: none' -exclude-dir=.git
56.96s user 0.70s system 99% cpu 58.127 total

git grep 'float: none'
0.42s user 1.05s system 335% cpu 0.438 total

130x+ improvement :D

Update:
Just install grep 2.14, as recommended on this Hacker News thread.
I though I would have gotten better results :\

$ time grep -r icon-similarities
grep -r icon-similarities  0.11s user 0.75s system 12% cpu 6.646 total

$ time git grep icon-similarities
git grep icon-similarities  0.02s user 0.03s system 24% cpu 0.201 total

26 comments:

  1. Have you tried:

    LC_ALL="C" grep -r ... ?

    ReplyDelete
    Replies
    1. http://stackoverflow.com/questions/8138124/implications-of-lc-all-c-to-speedup-grep

      Delete
  2. Did you do another grep test after the git-grep one ?

    When I grep large folders the first grep takes ages but the next ones are incredibly fast. There must be some kind of cache. Or it's just zsh doing it for me, who knows ?

    ReplyDelete
    Replies
    1. Yup, I did several of them, they all looked the same.

      Delete
    2. The second run will be faster always because the *filesystem* keeps an in-memory cache of recently opened files.

      Delete
    3. I'll try it again tomorrow and get some more numbers and screenshots, I'm pretty certain the numbers didn't change between runs.

      Delete
    4. Yup, you were right:

      ─$ time grep -r . 'float: none' -exclude-dir=.git > /dev/null grep -r . 'float: none' -exclude-dir=.git > /dev/null 58.77s user 1.82s system 72% cpu 1:23.44 total
      ─$ time grep -r . 'float: none' -exclude-dir=.git > /dev/null
      grep -r . 'float: none' -exclude-dir=.git > /dev/null 58.65s user 0.68s system 99% cpu 59.613 total
      ─$ time grep -r . 'float: none' -exclude-dir=.git > /dev/null grep -r . 'float: none' -exclude-dir=.git > /dev/null 59.42s user 0.74s system 99% cpu 1:00.47 total
      ─$ time grep -r . 'float: none' -exclude-dir=.git > /dev/null grep -r . 'float: none' -exclude-dir=.git > /dev/null 57.31s user 0.88s system 98% cpu 58.938 total

      There is some variation between runs.

      Delete
  3. What about Ack?

    ReplyDelete
    Replies
    1. Maybe you can provide us some numbers for comparison :)

      Delete
    2. How do you get the time counter?

      Delete
    3. time [command]

      Example:

      time grep -r . 'float: none' -exclude-dir=.git

      Delete
    4. Ack was 10 times slower than grep, even with --css.

      Couldn't run git grep properly with the time command, as the output gets piped to a pager and I have to type q to quit (adding my finger typing to the time result).

      Typing q as fast as I can gets me maybe 0.020s faster than grep. Too little project maybe. Well, git grep is faster to type anyway, saving a lot of time.

      Delete
  4. This isn't apples to apples. The two commands are not equivalent, so of course, one is faster.

    $ grep --exclude-dir=.git -r foo . | wc -l
    5829
    $ git grep foo | wc -l
    500

    In this case, I have a huge untracked logs directory in the root of this directory, and git-grep ignores anything not tracked in the repo. grep is much much faster by ignoring that log directory and binary files.

    $ grep --exclude-dir=.git --exclude-dir=logs -I -r foo . | wc -l
    337

    But since the counts don't match, it is still not apples to apples.

    Anyway, I wasn't aware of git-grep, so thank you for the tip. Definitely useful if you're not already using an aliased version of grep.

    ReplyDelete
  5. Maybe you're using a slow grep.

    http://jlebar.com/2012/11/28/GNU_grep_is_10x_faster_than_Mac_grep.html

    ReplyDelete
  6. echo 3 | sudo tee /proc/vm/sys/drop_caches

    ReplyDelete
  7. Mistake above : echo 3 | sudo tee /proc/sys/vm/drop_caches

    ReplyDelete
  8. I use ack-grep http://beyondgrep.com/

    ReplyDelete
  9. The time difference seems to be solely due to what files are searched. 'git grep' only greps through the files that are tracked, so skips all binary files, while 'grep' by default also searches through compiled executables, .o files, etc.

    Including only .cpp and .h files (which are the largest part of what is in there):

    $ time git grep "\(TODO\|FIXME\)" > /dev/null

    real 0m0.202s
    user 0m0.276s
    sys 0m0.060s

    $ time grep --exclude-dir=.git --include=\*.{cpp,h} -r "\(TODO\|FIXME\)" . > /dev/null

    real 0m0.210s
    user 0m0.156s
    sys 0m0.052s

    $ time LC_ALL="UTF-8" grep --exclude-dir=.git --include=\*.{cpp,h} -r "\(TODO\|FIXME\)" . > /dev/null

    real 0m0.252s
    user 0m0.172s
    sys 0m0.040s

    ReplyDelete
    Replies
    1. that second one is with LC_ALL="C", btw :)

      Delete
  10. The Silver Searcher anyone ? https://github.com/ggreer/the_silver_searcher

    ReplyDelete
  11. Off topic, but if I click the "Internet Defense League" banner on the right, it opens the website in a tiny iFrame. (Safari)

    ReplyDelete
  12. Been using `ag` for a while https://github.com/ggreer/the_silver_searcher. Please do give it a try and compare with these results.

    ReplyDelete

  13. Did you see the comments on HN?

    Dont use the BSD grep included with OSX, use GNU grep

    ReplyDelete