gitignore and calculating tools

Posted on Wednesday, August 6, 2014


I finally decided to take the plunge, drink the Koolaid, or whatever metaphor fits, and begin using git on some of my personal files.

I ended up learning a lot more about git in the process.  As a result this document may be a little long, if you want to skip ahead and read the summary that may be good enough for most people.

To do this right you need a server that can serve as a remote repository (So you can have off computer storage, if not also offsite).  I covered how to do this in this article http://www.whiteboardcoder.com/2012/08/installing-git-server-on-ubuntu-1204.html



If I had a simple situation I would just go into the folder I want to have in a repository and run the following commands


> git init
> git add .
> git commit -m "initial commit"




Then on your git server init a bare repo that you can push your repository to.  In this example I am sudo'n to the git user (who does not have a normal shell, so you have to designate it)


> sudo su git -s /bin/bash
> cd /git/repos
> mkdir project.git
> cd project.git
> git --bare init



Finally push it up to a remote master server (adjust the url to your remote server and the path)


> git remote add origin git@example.com:/git/repos/project.git
> git push origin master



But…. If you are like me and have a combination of files and folders I do not want to add to my git nor do I want to move to a different folder on my machine.

Some of these folders just make no sense storing in git and some are just too large to bother tracking like large video files.

First I had to find a few command lines to help me calculate how much space I will save by skipping folders or certain file types.  After figuring out what I want to skip and how much it will save me, how do I properly set up my .gitignore file to make it work?



Calculating sizes

First I will go over the easy no brainer command line tools you can use.

du (disk usage)


Finding the size of a folder and all its sub files can be found using the du command.

From within a folder run the following command.


> du -hs .




Or to check a subfolder replace the "." with the folder name


> du -hs archive







Find & awk


I found this nice little command line snippet at http://stackoverflow.com/questions/599027/calculate-size-of-files-in-shell [1] for calculating the size of all files of a certain type.


First it's important to know that the command depends on how your system shows data when you run the ls command from find.  As test run the following command


> find . -iname "*" -ls




In my case the 7th column has the size of the file in bits,  we can use that to calculate the cumulative size of a file type.

Here is an example to find out what the total size in MiB of all .jpg files in this folder and all its subfolders.  (the -iname will allow it to search on .jpg and .JPG)


> find . -iname "*.jpg" -ls | awk '{total +=$7} END {print total/(1024*1024)}'


This command will return the size in MiB.

Run a few of these on different file types to see how much space a particular file type is taking up.




find -size


I almost forgot to add this one find -size


> find . -size +100M


This will list all files that are at least 100MiB  in size

Or this one which will also output the size for each file it finds over 128 MiB


> find "$PWD" -size +128M | xargs -I {} ls -alh "{}"






Find size of folders


Find the size of every folder and order them by size


> find . -type d | xargs -I {} du -s {} | sort -V


The size is listed in bytes.



wc and tar


Word count can be a very important thing to know as well.

I happen to have a few old systems with 10,000+ files that I really don't need to keep as files I can archive them in a tar file.  But first I need to find them.  (why do this?  Well backing up 10,000 tiny files is far more time consuming then backing up one file of the same size).


> find . -type f | wc -l


This will list the number of files within this folder and all its subdirectories.

I tried to find a one liner that would list the number of files within each folder (and its subfolders) but I was unsuccessful.  If you know of one please post it.




Here is a simple example of tar'n a folder up


> tar cvzf myfolder.tar.gz myfolder


I had one particular folder that contained 80,000 small files.  I Tar'n this folder which saved me about 800 MiB and I was able to rsync all the directories in 2min 30 sec vs 8 min.




gitignore


Hopefully, now you have a list of files and folders you want to ignore and not add to your git repository.

To ignore files you need to use the .gitignore file.

In the base directory create a .gitignore file.


> vi .gitignore


But, before I get into my examples, how can you be sure git is really ignoring a file?

Here is how I confirmed my .gitignore when creating a new repository (this assumes you have run "git init ." and nothing else yet.

After running


> git init .


Run this


> git status





Here you can see the files and folders, in the current. folder that are currently untracked but would be added if you ran git add .

To check a subfolder just run something like this




> git status img-folder




This will list the same thing, but for the folder you designate.

If I edit the .gitignore file


> vi .gitignore


And place the following in it.


*.jpg





Now run


> git status img-folder



Now you can see that the .jpg files are being ignored.

I found using status very helpful in making sure my .gitignore files was correct.




gitignore examples


Here are some of my .gitignore one liners



#Ignore all directories and their contents
*/*


If this is the only thing you had you would only get the files in the top directory everything else would be ignored.



#Ignore all directories that start with /Logo
/Logo*





#Ignore those pesky .DS_Store OS X files
.DS_Store





*.png
*.JPG
*.pdf


Ignore all .png, .jpg, and .pdf files



Double asterisk.  In theory I should be able to do something like this


selling/**/*.JPG


The ** would mean not directory or any number of directories so this would result in any JPG file within the selling directory or an of its subdirectories to be ignored.  However this feature was introduced in git 1.8 if you do a quick git --version cygwin only has 1.7.9  So I can't use this on my box.

I decided to update git on cygwin to 1.9 just to make sure I don't get myself in trouble.







Intalling git vs 1.9 on cygwin and ubuntu server


To install a newer version of git on cygwin follow this procedure (I am going to go to version 1.9)


     > git clone https://github.com/git/git.git
     > cd git
     > git checkout v1.9.0



Now that you have v 1.9 checked out do the following


     > make configure
     > ./configure --prefix=/usr/local
     > make
     > make -i install



After doing this I opened a new cygwin window and ran the following to confirm the update



     > git --version


 

I also had to install git v 1.9.0 on my Ubuntu 10.04 server to do this I ran the following commands


     > git clone https://github.com/git/git.git
     > cd git
     > git checkout v1.9.0



Now that you have v 1.9 checked out do the following


     > make configure
     > ./configure --prefix=/usr
     > make
     > make -i install


After doing this I opened a new cygwin window and ran the following to confirm the update


     > git --version












Issues I ran into


There were a couple of interesting issues I ran into while trying to put my "normal" files into a git repository and to push it to a few remote repositories here are the issues and fixes I came up with.


fatal: out of memory, malloc failed


In one particular repository I have a couple of large VMs one has a 2900 MiB virtual hard drive and the other a 3008 MiB Virtual Hard drive.

Now I admit this type of file is not a real good candidate for adding to a git repo.  But in this case it’s a historical VM I am never going to change and I would like an effective simple backup and transfer mechanism for it that git provides.  (It's an old VM I used for my Masters' Thesis)

I am using a 32-bit version of cygwin on my windows box and when I run


> git commit -m "Initial Commit"




I get the following error


fatal: Out of memory, malloc failed (tried to allocate 3041787905 bytes)




Running this quick find command I find


> find . -size +1000M -ls


My two largest files are
3041787904 bytes
3154509824 bytes

Or
2.8 GiB
2.9 GiB

The first error comes out to exactly the size of my first file.

One web site I found talking about this issue was https://github.com/hbons/SparkleShare/issues/519 [2]



They suggested updating the .git/config file.

Here is what I did (after wiping my .git folder and running git init again)


> vi .git/config


And add


[pack]
        deltaCacheSize = 3072m
        packSizeLimit = 3072m
        windowMemory = 3072m
[core]
        packedGitLimit = 128m
        packedGitWindowSize = 128m


Then I re-ran


> git add .
> git commit -m "Initial commit"


And I got the same error… What the heck?

Maybe I need a little but more overhead room on my memory?

So I updated it to




[pack]
        deltaCacheSize = 3600m
        packSizeLimit = 3600m
        windowMemory = 3600m
[core]
        packedGitLimit = 128m
        packedGitWindowSize = 128m


And tried again.

I still got the malloc failed error…..


OK I am going to update my cygwin to a 64 bit version and see if that fixes it.




After installing the 64 bit system I edited the .git/config to the cygwinfollowing.


[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
        ignorecase = true
[core]
        packedGitLimit = 128m
        packedGitWindowSize = 128m

[pack]
        deltaCacheSize = 128m
        packSizeLimit = 128m
        windowMemory = 128m




Which only has it set to 128m

And it worked!

Looks like the 64-bit version fixed my problem.



I did notice that my .gitconfig file located in my home directory (that contains my [user] information also contained



[pack]
        windowMemory = 64m


Maybe this created the issue I saw on the 32-bit side?  I would guess the local config file would override this, but I am not sure.

But I thought I would mention it in case someone sees the same issue.


error: pack-objects died of signal 13


When trying to push this repo up to a remote repo I go the following errors

Counting objects: 3231, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (3081/3081), done.
error: inflate: data stream error (incorrect data check)/s
fatal: pack has bad object at offset 337748223: inflate returned -3
error: pack-objects died of signal 13

error: failed to push some refs to 'git@git.



Now how do  I fix this?

After several attempts to fix this I really was getting nowhere…   I am sure I could get it fixed and working but I have to admit to myself that git was not meant to do this (repo very large binary files) and I should not force a square peg in a round hole.




My Solution


So I gave up, threw in the towel, surrendered… and found another way to deal with it.

Here is my idea…

Within each potential repo I run the following command.



> find "$PWD" -size +128M | xargs -I {} ls -alh "{}"


This will list all the files over 128MiB and show there location and actual size.

In addition I also ran du -hs {folder}   on a few folders to determine how large they are.   I have a few filled with images that I won't put in a git repo either.

Then I gathered up all the files/folders that I do not want to add to my git repo and add them to my .gitignore file.  For example..


#not-git files and folders
/99_Thesis/VMs




rsync script


I still want to be able to get the VMs folder and its content easily so I created a script to download the folder using rsync.

On my ubuntu server (which I am using as a git remote repo) I created a not-git folder


> sudo mkdir -p /not-git/rsync


This is where the tricky part comes in.  Since we are using rsync we need to give access to the /not-git/rsync folders to any user who needs access to it.  In my case is simple and it's just me.  But if you had other users they would need permissions to this box in some way to read this folder.  (Or you could easily put these files somewhere else even on an FTP server… just a thought)

Create the script


> vi .rsync-not-git










Here is the script I created


#!/bin/bash

#Check for an override name
name=""
if [ $2 ]
then
  name="$2@"
fi

#=======================================
#
#Only spot you should be changing anything
#Array contains subfolders and other array contains files/folders to rsync
#I don't think bash supports arrays of arrays so i did it this way
loc="/not-git/rsync/01_folder/"
folders=("Folder1" " Folder2")
files=("file1" "fil2")
#
#===============================================


flags="-avzr"

if [ "$1" == 'push' ]
then
  echo "Push it"
  #Need to make the directories
  #for folder in $folders
  for i in "${!folders[@]}"
   do
     ssh "$name"git.example.com mkdir -p $loc${folders[$i]}
     rsync $flags "${folders[$i]}${files[$i]}" "$name"git.example.com:$loc${folders[$i]}
   done
else
  echo "Pull it"
  #Create local folder if it is not present
  for i in "${!folders[@]}"
   do
     if [ "${folders[$i]}" == '' ]
     then
       rsync $flags "$name"git.example.com:$loc${folders[$i]}${files[$i]} .
     else
       mkdir -p ${folders[$i]}
       rsync $flags "$name"git.example.com:$loc${folders[$i]}${files[$i]} ${folders[$i]}
     fi
   done
fi



Edit this to your needs.

Change the loc variable to the loc on your remote server where you want to place this.

Change folders and files arrays to match what you are ignoring in git but want to rsync.

For example if you want to rsync /Jeff/move.mp4 and a folder /work/images

You would change it ot



folders=("Jeff" "work")
files=("move.mp4" "images")



Then to use it run the script like this


> ./.rsync-not-git push




To pull


> ./.rsync-not-git pull


If you are like me and are on a system that has a different username for you.  You can add a username to the push or pull.

For example


> ./.rsync-not-git push patman




Summary


To sum it all up….

1 - Upgrade git to at least version 1.9 on your local machine and any remote git server you will use. (if you are using cygwin use 64-bit cygwin)

2 - edit .git/config


> vi .git/config


And add


[core]
        packedGitLimit = 128m
        packedGitWindowSize = 128m

[pack]
        deltaCacheSize = 128m
        packSizeLimit = 128m
        windowMemory = 128m







3 - Ignore very large files/folders

Run the following command to find large file


> find "$PWD" -size +128M | xargs -I {} ls -alh "{}"


Any large files add to the .gitignore files.  Also add any large folders, for example a folder of images, that don't make sense to put into a repository to the .gitignore file.


4 - rsync the ignore files

Write a script to rsync the files and folders you are ignoring in git but want to have a simple way to download/upload.

The script should allow for a push and pull  (see my example above of a script I came up with)




That’s it!  Use the git for all but your large file/folders and use git to store a script to rsync the files/folders you ignored.



References
[1]  Calculate size of files in shell  -  tpgould
       Visited 7/2014
[2]  Better git memory usage settings for huge files #519
       Visited 7/2014
[3]  Cygwin home page
       Visited 7/2014
[4]  How to get Git 1.8 in Cygwin?        
       Visited 7/2014