How to count number of words in a pdf file from Linux cli
Using pdftotext:
#
Installation:
- If it’s not installed, you’ll need to install the
poppler-utilspackage which includespdftotext.
sudo apt install poppler-utilsor
yum install poppler-utilsdepending on your distribution.
- If it’s not installed, you’ll need to install the
Usage:
- Once installed, you can convert a PDF to text and then count the words as follows:
pdftotext input.pdf - | wc -wHere,
input.pdfis your source PDF file, andwc -wcounts the number of words. The-inpdftotextspecifies that the output should be sent to stdout, which is then piped intowc.
Using pdfgrep:
#
Installation:
- Install
pdfgrepusing your package manager:
sudo apt install pdfgrepor
yum install pdfgrep- Install
Usage:
pdfgrepis generally used for pattern matching, but you can use it to match any word characters and pipe them towc -wlike this:
pdfgrep -o '\w+' input.pdf | wc -wThis might be slower and is generally more useful when you’re looking for specific words.
Using Python with PyPDF2: #
You can also create a small Python script to do the job using the PyPDF2 library.
Installation:
- Install PyPDF2 using pip:
pip install PyPDF2Usage:
- Here’s a simple Python script you could use:
import PyPDF2 def count_words_in_pdf(file_path): with open(file_path, 'rb') as f: reader = PyPDF2.PdfFileReader(f) total_words = 0 for i in range(reader.numPages): page = reader.getPage(i) text = page.extractText() total_words += len(text.split()) return total_words if __name__ == "__main__": file_path = "input.pdf" print(count_words_in_pdf(file_path))Save this script, make it executable, and run it. It will read
input.pdfand print out the number of words.
Using pdf2txt.py from the pdfminer suite:
#
Installation:
- You can install
pdfminerlike this:
pip install pdfminer.six- You can install
Usage:
pdf2txt.py input.pdf | wc -wThis command will convert the PDF to text and pipe it to
wcto count the words.
Performance Considerations: #
Accuracy: Not all methods have the same level of accuracy. Text layout in PDFs can be complicated, and the above methods might not capture all the nuances.
Speed: Native CLI tools like
pdftotextandpdfgrepare generally faster compared to Python-based solutions, which have to spin up a Python interpreter.Complexity:
pdftotextandpdfgrepare easier to use for simple tasks, but Python-based solutions offer more flexibility and control.Portability: The CLI tools depend on certain packages that need to be installed, but a Python script could be more portable, especially if you’re going to run it on different systems.
The method you choose will likely depend on your specific requirements. If you just need a quick and dirty solution, pdftotext piped into wc is easy and effective. For more complex requirements, such as handling multiple PDFs, incorporating additional logic, or even using more advanced text analysis techniques (like natural language processing), you might want to look at Python-based solutions. These provide the building blocks to craft a tailored solution that could evolve with your needs. The elegance of the Linux command line is that it offers a wide range of tools that can be combined in an almost infinite number of ways to solve problems both big and small. This toolbox gets even more powerful when you integrate it with scripting languages like Python, enabling you to tackle not just text-processing tasks but a multitude of other challenges as well.