Use httrack to download an entire website

Posted on Thursday, October 1, 2015






The other day I discovered this awesome command line tool that lets you effectively download a website and save it locally.     I used it to download a website of mine that is in WordPress and convert it to just a static backup of  the site (images and all).





Download and install


The main website for httrack is https://www.httrack.com/ [1]

Ubuntu


Installing on Ubuntu is cake


     > sudo apt-get install webhttrack








Cygwin


When looking how to do this I found this page http://sourceforge.net/projects/cygwin-ports/ [3]. 

Which led me to this page





From cygwin run setup.exe but use the -K to point to the cygwin ports project.


     > cygstart -- /cygdrive/c/cygwin64/setup-x86_64.exe -K http://cygwinports.org/ports.gpg




When you get to "Choose a Download Site"


Add ftp://ftp.cygwinports.org/pub/cygwinports  


Click Add and Next





Doh!  Error

unable to get setup.ini from ftp://ftp/cygwinports.org/pub/cygwinports

… what did I do wrong?

Oh I had a space in the URL.  Once I removed that it worked.




Cool now I can just search for httrack and it shows up J



Start a new cygwin window and check if it'whichs installed.


     > which httrack







How does it work?


Well for a more advanced explanation check out https://www.httrack.com/html/filters.html [2]

Basically you give it a start location, for example look at this command.


     > httrack -v "http://www.whiteboardcoder.com/2015/08/"  -O whiteboardcoder


It will then go to http://www.whiteboardcoder.com/2015/08/index.html copy the page and start copying images and other pages that the first starter page links to.  It's smart, it won't start copying pages outside the URL you gave it.  In fact, unless you tell it otherwise, it will only drill down, not up.  So… even if it as a link to http://www.whiteboardcoder.com it won't copy http://www.whiteboardcoder.com/index.html .  It won't even copy other subdomains… ex http://other-subdomain.whiteboardcoder.com/



You can define how you want to filter and even what you want to filter out.

For me I want copy all of *.whiteboardcoder.com.  Here is the command that would do that.



 > httrack -v "http://www.whiteboardcoder.com"  -O whiteboardcoder "+*.whiteboardcoder.com/*"


This is the filter


 "+*.whiteboardcoder.com/*"


All subdomains and every file (that the site link to… as long as they are part of the same domain).

If you want to get detailed on what you skip and what you get see https://www.httrack.com/html/filters.html [2]   May be of some value to you… for example if you want to skip all .zip files you could with
"-*.zip"




Let me run this with the time command to see how long it takes to download my entire blog (which is run at blogger.com)


 > time httrack -v "http://www.whiteboardcoder.com"  -O whiteboardcoder "+*.whiteboardcoder.com/*"







Took almost 1.5 hrs to download my site.





Now I have a nice self-contained version of my website in my desktop. 
External links still will take me to other URLs.   Clicking from page to page just loads files locally on my hard drive.





And the original URL is preserved in the folder structure.  Nice.

References

[1]        htrack main site
                        https://www.httrack.com/
                Accessed 10/2015
[2]        Filters
                        https://www.httrack.com/html/filters.html
                Accessed 10/2015
[3]        cygwin-ports
                        http://sourceforge.net/projects/cygwin-ports/
                Accessed 10/2015
[4]        cygwin-ports how to run
                        http://cygwinports.org/
                Accessed 10/2015


No comments:

Post a Comment