All Posts programming How to extract lines with specific character count from a text file in Linux terminal ?

How to extract lines with specific character count from a text file in Linux terminal ?

ยท 586 words ยท 3 minute read

I faced a need to extract the text lines which contains a specific character count from a huge text file. I searched online for all possible answers and tried them all. Here is what I found.

Using grep ๐Ÿ”—

grep is a command-line utility for searching plain-text datasets for lines that match a regular expression. Its name comes from the ed command g/re/p (global regular expression search and print), which has the same effect.

To use grep, we use -E flag with a regular expression to get the lines with a specific character count.

Here is the command to get the 1 character length lines from a plain-text file called pt.txt:

grep -E '^.{1}$' pt.txt

And here is a command to extract lines that have only 2 characters from a plain text file called pt.txt:

grep -E '^.{2}$' pt.txt

And here is a command to extract text lines that are only 8 characters long from a plain text file called pt.txt:

grep -E '^.{8}$' pt.txt

Using egrep ๐Ÿ”—

egrep is just a script to run grep -E command.

egrep is grep -E

So, you can use the command like this.

egrep '^.{1}$' pt.txt

In the above command, it extracts every line that has only two characters long.

Here is the command to get all lines that contain only 10 characters:

egrep '^.{10}$' pt.txt

Using awk ๐Ÿ”—

AWK (/ษ”หk/) is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and it is a standard feature of most Unix-like operating systems.

To get the length of the current line, we use length() function in awk. So, we can get the length of current line and compare it with == to the number of characters we want to checks if the length of the current line is exactly the character count we specified. If the condition is true, awk prints the line by default. That’s great! let’s do that!

awk 'length() == 4' pt.txt

In the above command, awk will print on the standard output (stdout) all the line that have 4 characters only.

If you want to extract all lines that contain 12 characters, use this awk command.

awk 'length() == 12' pt.txt

Use-cases ๐Ÿ”—

That task of extracting each line of a specific length can be needed in various scenarios in different programming domains such as data science, defensive/offensive security, and other domains.

Offensive security experts may use it to extract passwords of specific length form plain-text dataset (called wordlists or passwordlists).

Notes ๐Ÿ”—

Accuracy of results ๐Ÿ”—

I faced a weird situation when using the three commands. Awk always extracts more results which are accurate.

accuracy of grep, egrep, and awk

So, why grep didn’t grep them all? unfortunately, I DON’T KNOW.

Performance and execution speed ๐Ÿ”—

I noticed another thing. Awk is slower than grep (or egrep). grep command executed in 47 milliseconds, and egrep command executed in 48 milliseconds. But awk command executed in more than 12 WHOLE seconds. Awk is significatly slower than grep.

benchmarking grep, egrep, and awk

My opinion and preference ๐Ÿ”—

I prefer awk over grep in this case because of its accuracy I found in experiments. Its slow execution is not significant for me.

I hope you enjoyed reading this post as much as I enjoyed writing it. If you know a person who can benefit from this information, send them a link of this post. If you want to get notified about new posts, follow me on YouTube , Twitter (x) , LinkedIn , and GitHub .

Translations:  ุงู„ุนุฑุจูŠุฉ (ูƒูŠู ุชุณุชุฎุฑุฌ ุงู„ุฃุณุทุฑ ุฐุงุช ุนุฏุฏ ุงู„ุญุฑูˆู ุงู„ู…ุญุฏุฏ ู…ู† ู…ู„ู ู†ุตูŠ ููŠ ู„ูŠู†ูƒุณ ุŸ)