Grep word docx files

Posted on Wednesday, April 20, 2016



I have tons of word documents and recently I wanted to grep through them.  Poking around I could not find a good tool to get the job done.  Then I found a post that suggested converting the word documents to text files then grepping them.  I thought that was a good idea so I decided to go that route.

One tool I found that converts the docx files to text is docx2txt http://docx2txt.sourceforge.net/ [1]






Ubuntu Install


Installing it in Ubuntu is pretty easy


 > sudo apt-get install docx2txt







Cygwin Install


Installing it on cygwin takes a few more steps.  Here is the commands I ran to get it installed.


 > curl -L http://downloads.sourceforge.net/docx2txt/docx2txt-1.4.tgz?download -o docx2txt-1.4.tgz
> tar -xvzf docx2txt-1.4.tgz
> cp docx2txt-1.4/docx2txt.pl /usr/bin/docx2txt




Convert them all in place


If you want to simply batch convert a ton of .docx files and create the .txt file in the same folder as the original .docx file then run this simple command.


 > find `pwd` -iname "*.docx" | xargs -I{} docx2txt {}





Now test

 








Convert them all in another directory


What if you really did not want to place all the .txt files in the same location as the .docx but to mirror it in a new folder?

Well I spent a while trying to make a neat one liner…  but eventually gave up on that idea and made this script.


#!/bin/bash
#
# Simple script to convert docx files to txt
# files and putting them in a new folder
# to Change the folder rename NEW_FOLDER variable
#
##################################################

find `pwd` -iname "*.docx" |
while read docxfile
do
  NEW_FOLDER="txtFiles"
  BASE_FOLDER=`pwd`
  TXT_FILE=$(echo $docxfile | sed 's/\.docx$/\.txt/' | sed 's?'$BASE_FOLDER'?'$BASE_FOLDER'/'$NEW_FOLDER'?')
  DIR=$(dirname "${TXT_FILE}")

  echo $TXT_FILE

  mkdir -p "$DIR"
  docx2txt "$docxfile" "$TXT_FILE"
done


This script places all the converted docx files into a folder called txtFiles.  Now you just grep there.

References

[1]        docx2txt home page
           http://docx2txt.sourceforge.net/
                Accessed 04/2016




No comments:

Post a Comment