Grep Across Multiple Lines

Last Updated at 2023-01-18

[#quick-reference]Quick Reference[#quick-reference]

Command Explanation
 $ grep -Pzo ‘(?s)from.*to’ <file_name>
grep -Pzo ‘(?s).. followed by the first word, end word and file name is the simplest way to use grep to match across multiple lines in a file
 $ ggrep -Pzo (?s)from.*to <file_name>
In other cases where -P is no longer supported by grep, you have to use ggrep after installing it with brew install grep.
 $ pcre2grep -M ‘from(\n|.)*to’ <file_name>
The pcre2grep extension can also be used to simplify the command further by removing the need to add the -Pzo flags by only requiring the -M flag.

[#multiple-matches-with-grep]Multiple matches with grep[#multiple-matches-with-grep]

If you are searching to match across multiple words on the same line, the grep command takes the form:

 $ grep ‘from.*to’ <file_name> 

For example:

Which uses regular expression syntax to match lines that contain all words [.inline-code]complete[.inline-code] until [.inline-code]complete[.inline-code] on the same line. This is because [.inline-code].[.inline-code] means all characters while [.inline-code]*[.inline-code] means as many as possible. 

[#using-grep-p-or-ggrep-P-to-grep-multiple-lines]Using [.inline-code]grep -P[.inline-code] or [.inline-code]ggrep -P[.inline-code] to grep multiple lines[#using-grep-p-or-ggrep-P-to-grep-multiple-lines]

To multiline match with grep, the command becomes much more complicated:

 # if your machine supports grep -P
 $ grep -Pzo ‘(?s)from.*to’ 
 # using ggrep instead
 $ ggrep -Pzo ‘(?s)from.*to’ 

For example:

If your machine does not support [.inline-code]grep -P[.inline-code], you can install [.inline-code]ggrep[.inline-code] from [.inline-code]homebrew-core[.inline-code] using [.inline-code]brew[.inline-code] and the command:

 $ brew install grep

This will then become available as [.inline-code]ggrep[.inline-code].

The parameters for this are:

  • [.inline-code]-P[.inline-code] uses Perl compatible regular expression (PCRE)
  • [.inline-code]-z[.inline-code] treats the input as a set of lines, each being terminated by a zero byte instead of a new line. Essentially this allows grep to treat the file as a whole line as opposed to multiple lines
  • [.inline-code]-o[.inline-code] prints only the matching strings as otherwise the entire file will be printed. The complication however is that will also add a trailing zero byte character which can cause additional problems.
  • [.inline-code](?s)[.inline-code] activate PCRE_DOTALL which means that “.” finds any character or a new line. 
  • [.inline-code].*[.inline-code] will match everything, including new lines, up until [.inline-code]to[.inline-code] because of the addition of [.inline-code](?s)[.inline-code] into the regular expression.

If you want to simply print out file names that have lines that have matches with the regular expression then you can alter the [.inline-code]-o[.inline-code] flag to [.inline-code]-l[.inline-code] which will list all matching file names.

[#grep-for-single-line-to-final-word-in-another-line]Grep for single line to the final word in another line[#grep-for-single-line-to-final-word-in-another-line]

 $ grep -Pzo '(?s)success.*failure' process_output.txt
 # or
 $ ggrep -Pzo '(?s)success.*failure' process_output.txt

For example:

[#grep-for-start-to-end-of-line-containing-multiple-instances]Grep for start of line containing multiple instances of the same word to the end of a line containing multiple instances of the same word[#grep-for-start-to-end-of-line-containing-multiple-instances]

 $ grep -Pzo '(?s)scheduled.*complete' process_output.txt
 # or
 $ ggrep -Pzo '(?s)scheduled.*complete' process_output.txt

For example:

[#grep-word-at-end-of-one-line-to-final-word-in-end-of-other-line]Grep for word at the end of one line to the final word in another line[#grep-word-at-end-of-one-line-to-final-word-in-end-of-other-line]

 $ grep -Pzo '(?s)failure.*complete' process_output.txt
 # or
 $ ggrep -Pzo '(?s)failure.*complete' process_output.txt

For example:

[#using-pcre2grep-to-grep-multiple-lines]Using [.inline-code]pcre2grep[.inline-code] to grep multiple lines[#using-pcre2grep-to-grep-multiple-lines]

An alternative would be to take advantage of the [.inline-code]pcre2grep[.inline-code] extension which would simplify the command by adding the flag [.inline-code]-M[.inline-code]

 $ pcre2grep -M 'from(\n|.)*to' <file_name>

Where the [.inline-code]-M[.inline-code] or [.inline-code]--multiline[.inline-code] flags allow patterns to match more than one line. This is an alternative that packs inbuilt support for Perl Compatible regular expression and is usually already preinstalled in your system alongside grep. Otherwise, this can be installed using your package manager.

Alternatively, you can also use the [.inline-code](?s)[.inline-code] trick from before to turn on PCRE_DOTALL and make the dot character match new lines as well. Which simplifies the command to:

 $ pcre2grep -M 'from(\n|.)*to' <file_name>

[#common-gotchas-when-using-grep-multiple-lines]Common “gotchas” when using [.inline-code]grep[.inline-code] across multiple lines[#common-gotchas-when-using-grep-multiple-lines]

[#grep-will-use-first-and-last-instances-of-words][.inline-code]grep[.inline-code] will use the first and last instances of the words[#grep-will-use-first-and-last-instances-of-words]

When using [.inline-code]grep[.inline-code] across multiple lines it is important to be aware that the command will get both the first instance of the [.inline-code]from[.inline-code] word and will get everything up until the last instance of the [.inline-code]to[.inline-code] word. This will likely affect the output you expected, especially when there may be multiple instances of [.inline-code]from[.inline-code] or [.inline-code]to[.inline-code] in your document. Alternatively, tools such as [.inline-code]awk[.inline-code] or [.inline-code]sed[.inline-code] will start from the first instance of [.inline-code]from[.inline-code] but finish at the first instance of [.inline-code]to[.inline-code]. 

[#grep-uses-regex-standards][.inline-code]grep[.inline-code] uses regex standards[#grep-uses-regex-standards]

It is important to know that the “strings” following the [.inline-code]grep[.inline-code] command will match the document based on the rules of regular expression. This means that simply typing in [.inline-code]fail[.inline-code] will also match [.inline-code]failure[.inline-code]. To match only specific words when matching across multiple lines you can use regular expression tools to match one words. For example:

 $ grep -Pzo ‘(?s)\bfail\b.*\n.*\bsuccess\b’ 

[#grep-is-case-sensitive][.inline-code]grep[.inline-code] is case sensitive[#grep-is-case-sensitive]

[.inline-code]grep[.inline-code] commands are also case sensitive but you can control this using the [.inline-code]i[.inline-code] flag to ignore case.

[#find-out-more]Find out more about [.inline-code]grep[.inline-code][#find-out-more]

As always if you want to find out more about how to use the grep tool you can use:

 $ man grep

Which will print out all the options with explanations. Or:

 $ grep --help

Which will print out a short page of all the available options.

[#alternative-tools]Alternative tools[#alternative-tools]

Alternatively, tools such as awk and sed make can make this command much simpler to implement. For awk the command would be:

 $ awk ‘/from/,/to/’ <file_name> 

where [.inline-code]from[.inline-code] is the first word or regular expression you are searching for and [.inline-code]to[.inline-code] is the final work you are looking for.

In sed the command is similar and takes the form:

 $ sed -n ‘/from/,/to/p’ <file_name>

As with the prior example,  [.inline-code]from[.inline-code] is the first word or regular expression and  [.inline-code]to[.inline-code] is the final word or regular expression you are looking for.