Sunday, February 26, 2012

How to backup your google docs documents




I am a fan of google docs: I often needs to access and edit my documents while I am away, and google docs offers a great way to do that.
The problem is: I have a lot of large pdfs there, and they can take a while to load: I would love to have a local copy when I am in the office...
On top of that, I always like to have a local copy of stuff... just in case! Call me paranoid but what happens if your account is hacked? or if google unilaterally closes your account because they consider you don't respect the terms of use? Better be safe than sorry...


I couldn't find anywhere an application that would do what I want (get a local backup of my google documents and update it regularly).
There is the google "takeout" application but you can not schedule regular downoads...
A project like google-docs-fs seems promissing but it does only support google documents (and not any other file you may have uploaded if you have -like me- a premium account). Plus, my analysis is that there are too many possible points of failures if you rsync this file system... I need something more robust.


I decided to code what I need myself: a java command line application that can be used to schedule regular downloads of all your google docs documents.


1. Presentation gdocsauploader.jar

The features implemented:
- re use data from a previous data to avoid re-downloading files that haven't changed
- rotating backups (for example, a maximum of 7 backups backup.zip being the most recent one and backup.7.zip being the oldest one)
- zip archive or just a folder archive (takes more space but easier to access)
- configurable document export mode (export google spreasheets as xls or as csv)
- download only once documents that are in multiple folders (gdocsbackup.removeduplicateddocuments)
- archive without folder structure (all documents in a zip, like google takeout) or with folder structure (much easier to navigate)
- support for any type of files.

TODO:
- use hard links on operating systems that support it (that would substantially reduce that amount of disk needed for multiple backups with a lot of unchanged documents)
- fix the bug that forces you to use a temp directory on the same partition as the destination directory

In my setup, I want to install pgdocsauploader.jar as a daily cron on my NAS, but you can install it anywhere.

The program is configured using the config file gdocsuploader.properties which reads as follows:
#use system defined proxy
gdocsbackup.usesystemproxy=true
#google account username and password
gdocsbackup.username=xxxx
gdocsbackup.password=xxx
#the path where we want to backup
gdocsbackup.backuppath=C:\\Users\\xxx\\Documents\\Data\\
#the name of the backup archive. 
#the zip archives will be named: backuprootname.zip backuprootname.1.zip
#the folder archives will be named: backuprootname/ backuprootname.1/
gdocsbackup.backuprootname=gdocs_backup
#the number of backup files to keep
gdocsbackup.nbbackupfiles=7
#TRUE is you want to stroe backup as zip file. 
gdocsbackup.usezip=FALSE
#zip compression level (0-9) with 9 being the most compressed (and most CPU intensive)
gdocsbackup.zipcompresslevel=6
#use hard links to link new data identical to older data. This does save a lot of space (you can't use this option with usezip)
#not supported yet!
gdocsbackup.usehardlinks=FALSE
#document export format: one of doc html odt pdf png rtf txt zip
gdocsbackup.documentexportformat=doc
#presentation export format: one of pdf png ppt txt
gdocsbackup.presentationexportformat=ppt
#spreadsheet export format: one of xls csv pdf ods tsv html (NB: first sheet export only for csv and tsv)
gdocsbackup.spreadsheetexportformat=xls
#try to replicate the directory structure in the zip
docsbackup.keepdirectorystructure=TRUE
#show documents that appear at different places in the folder tree only once (in the first folder where it is found)
gdocsbackup.removeduplicateddocuments=TRUE
#log file (for linux, good practice is to put it in /var/log/ or /opt/var/log (and make sur logrotate works correctly))
gdocsbackup.logfile=C:\\gdocsbackup.log

All options are self explanatory. You can customize it as required by your setup.

As the program is java, it can be run on any OS / Architecture supporting Java.

The jar is available for download at http://dl.dropbox.com/u/50398581/gdocsbackup/gdocsdownload.jar
sample properties files is available at http://dl.dropbox.com/u/50398581/gdocsbackup/gdocsdownload.properties
and source code is available at: http://dl.dropbox.com/u/50398581/gdocsbackup/gdocsdownload-src.zip


Please note that in order to "rotate" backups, the program will delete the oldest backup! Don't modify the backups or store anything there!
The program only gets information from the google server: it does not update or delet anything: you are safe there!


To determine if the file was already downloaded, the last_update tag given by google is checked. I suggest you do a full backup from time to time to avoid an error propagating from backup to backup (to do that, just add the option full download after the "properties" file launching the jar)


2. Steps to install the gdocsbackup on a linux based NAS
The setup is easy to adapt to any machine running linux. I didn't do a tutorial for Windows or Mac as I lack some knowledge to do it, but it can of course be done... feel free to adapt it and post your results and hints in the comments!
This tutorial assumes some vi ans linux knowledge...

This is how I installed the gdocsbackup.jar on my NAS (an Iomega Storcenter ix4-200d). Please note that the procedure is unsupported by Iomega! use at your own risk!

a. Download and setup of gdocsdownload
First, you need to ssh into your NAS (see my other post if you have am Iomega Storcenter)
Then:
mkdir /opt/usr/local
mkdir /opt/usr/local/gdocsdownload/
cd /opt/usr/local/gdocsdownload/
wget http://dl.dropbox.com/u/50398581/gdocsbackup/gdocsdownload.jar
wget http://dl.dropbox.com/u/50398581/gdocsbackup/gdocsdownload.properties
Don't forget to change the properties file to make it work for your setup (you at least need to change account information and paths):
vi gdocsdownload.properties

If you are concerned about security, you should put the properties files into you home folder...

If you haven't already done so, you need to install java on your NAS. See the java section of my previous post How to install Crashplan on an Iomega Storcenter to find out how to do it for an Iomega storcenter.

If you followed the java installation procedure of my other post, link java to a more usual location:
ln -s /mnt/pools/A/A0/NAS_Extension/ejre1.7.0/bin/java /opt/bin/java
The setup can already be tested by starting the command:
/opt/bin/java -jar /opt/usr/local/gdocsdownload/gdocsdownload.jar /opt/usr/local/gdocsdownload/gdocsdownload.properties
press Ctrl-C to stop the run


the program will need to be started from a script so that we can set correct folder permissions and TMP folder.
You need to make sure there is enough space in your temp folder (my /tmp/ folder is way to small, that's why I use /opt/tmp/
vi gdocsdownloader

and then type:
#!/bin/sh

#this is to have a backup that's readable by everybody
#but only writeable by the owner.
#change it to suit your needs
umask 022
#use a tmp file with enough space to fit all your docs
#NB: it seems like there is a bug somewhere and the tmp directory has to
#be on the same partition than the destination directory....
#please choose a tmp file respecting these conditions
#/opt/bin/java -Djava.io.tmpdir=/opt/tmp/ -jar /opt/usr/local/gdocsdownload/gdocsdownload.jar $@
/opt/bin/java -Djava.io.tmpdir=/mnt/pools/A/A0/data/perso/gdocs/ -jar /opt/usr/local/gdocsdownload/gdocsdownload.jar $@
make it an executable:
chmod a+x gdocsdownloader
And test with:
./gdocsdownloader /opt/usr/local/gdocsdownload/gdocsdownload.properties


b. Set up a cron job to backup google docs data
Create the gdocsdownloader cron (I don't use /etc/cron.daily/ because I want a full download once a week):
vi /etc/cron.d/gdocsdownload
and add:
# download google docs files at 3:45 AM

#full download on sunday
45 3    * * 0   root    /opt/usr/local/gdocsdownload/gdocsdownloader /opt/usr/local/gdocsdownload/gdocsdownload.properties fulldownload > /dev/null 2>&1
#regular download the other days
45 3    * * 1,2,3,4,5,6   root    /opt/usr/local/gdocsdownload/gdocsdownloader /opt/usr/local/gdocsdownload/gdocsdownload.properties > /dev/null 2>&1

The cron will run everyday!
you may want to run the first batch by starting:
/opt/usr/local/gdocsdownload/gdocsdownloader /opt/usr/local/gdocsdownload/gdocsdownload.properties

c. start the cron daemon

The cron daemon is not started at boot by default....

You can start it manually:
/etc/init.d/cron start

But to have it start up every time at boot, we need to add the line:
/etc/init.d/cron start >> /opt/init-opt.log
to our /opt/init-opt.sh script.

See my other post How to run a program at boot on the Iomega Storcenter NAS to see how it works!

d. set up logrotate
Logrotate is the process that compresses and delete old logs so that your logs don't eat all you disk space!
vi /etc/logrotate.d/gdocsdownload
and add:
/opt/var/log/gdocsdownload.log {
    rotate 4
    weekly
    compress
    delaycompress
    missingok
    notifempty
    prerotate
      while [ "`ps aux | grep gdocsdownloader.jar | grep -v grep | wc -l`" = "1" ]
        do
          sleep 10
        done
    endscript
}


This will rotate your gdocsdownload logs once a week and keep at least 4 weeks worth of logs. It is easy to modify these parameters in the config file above.

I try to make sure the gdocsdownload is done before rotating the logs to avoid conflict...

Don't forget to change the path if your log is somewhere else!

No comments:

Post a Comment