wiki:Processing/proj_tidy

Context Navigation

Version 14 (modified by dap, 11 years ago) (diff)
--

proj_tidy.sh

Intro

proj_tidy (project tidy) is a tool to keep the arsf flight directories in a neat and standard data structure with correct layout and filenames. It also runs a multitude of checks to flag any problems with the data.

It also serves as our "unpacking script", it prints a list of commands that you should use to convert it from the format we get it from ARSF-OPS to our desired standard.

proj_tidy should be in harmony with the project structure page
If they disagree, go on what the majority of flights have done for that year.

Location

It lives here:
~arsf/usr/bin/proj_tidy.sh

It is modular, and it's modules live here:
~arsf/usr/bin/proj_tidy/

Basic guide to most important modules

Some functions are taken care of in the script, some things which are more complicated or are easier in python are called as separate modules.

Some things, like checking the webcam images take time and only need to be done once, so I've put these under a '-e' (extended) option, which only need to be ran once, when the flight arrives and is being checked for the first time. Fixing permissions takes time too, and this is done nightly on a cron, so I also moved this to it's own option '-m'.

It decides if any directories are missing using check_dirs_present module which uses the folder_structure library that anch originally wrote. This maintains the project structure (somewhere) in a central place, so in theory we only need to update one place when we change structure (e.g. add or move folders around). Note you'll still have to update proj_tidy's regex file for that year if things change, as files will now be in a different place. For any missing directories it will generate a 'mkdir' command, which will appear in "Suggested Commands" at the very bottom.

There are a number of modules/procedures which need to see all the files in the directory. proj_tidy therefore creates a list of all files at the start and writes to a temp file: /tmp/proj_tidy_$timeStamp. This file can then be parsed by any modules that require it.

check_mandatory_files takes the list of all files, and uses proj_tidy's yearly regex list to make sure all required files are present.

check_unrecognised_files takes the list of all files in the flight, filters out any expected ones using the regex file, and prints the rest as "unrecognised files".

move_commands.sh generates the suggested mv commands (which will be printed below) and is one of the most complex modules. It's designed to convert projects to the current structre from the one we get from Ops (and also historically, the pre-2011 file structures). This therefore has to be kept manually updated to match the structure that ops supply to us. It's broken down into one section for each specific directory and is in the form: Check if directory exists -> Check if directory has contents -> Move contents to correct place -> rmdir the directory. It's best start from the lowest level and move upwards (e.g. do leica/ipas/raw/ then leica/ipas/ then leica/) so you're not moving anything you then want to deal with further down the script. It also does some other more complicated things to do with azgcorr style stuff - this was important when converting the projects from the pre-2011 style to the current style and making sure old and new things were kept separated, but is not important now.
It actually looks more complicated than it is because I set it up so you could either execute the commands or just print them. In the end it was decided it was safer just to print them, so proj_tidy doesn't call this module with the 'ex' argument that is necessary to quietly execute them.

At the very end, it prints suggested commands, which are intended to be checked then copy and pasted. These are the ones that are used when unpacking, which convert the project to the current format. The mkdir commands from the missing directories checker will be printed first, and then the mv commands. Don't try and copy and paste all at once because BASH doesn't like too many lines pasted straight into the terminal :P

Changes between years

Also present in the module directory, are regex/templates. These are designed to hold filename/folder structure information for each year. This is because we use proj_tidy to archive flights in previous years. We want to make sure these are standard for a given year but we don't care if they change in-between years (which things do). We therefore need a record of past conventions. The regex files are used by proj_tidy, the templates are the same thing but in a form where things like flightDay can be readily inserted to build correct paths/names. I think this was so proj_tidy can give suggested changes so you can copy and paste without having to correct it manually, but this has not yet been implemented. For pre-2011 an older version of proj_tidy is ran (proj_tidy_old.sh), this is because the project structure/filename conventions were originally hardcoded into the script. When we changed to APL in 2011, I decided to rewrite proj_tidy in the new, flexible year format, and kept the old one around for 2010 stuff. If you run the regular proj_tidy on something pre-2011, it will call proj_tidy_old instead.

Improvements/fixes

SERIOUS BUG - although it's set up to check pre-2011 delivies (as outlined above), it currently can't do this if that flight has been converted to the layout. Since ALL flights have now been converted to the new layout, this needs fixing before we can archive older flights. (dap, ask someone what about the change in 2011, reprocessing, and conversion of structures).
Currently it determines if raw data for a sensor exists by looking if there is e.g. hyperspectral/fenix/ dir. This should be changed so it actually looks for data files e.g. FENIX*.raw
Check if we have .hdr file and a .log file for every .raw file, and do something similar for other sensors. So basically check, if we do have a sensor as determined above, that all expected files are present
Check the database to get number of lines for a sensor, and check there are indeed this number
If can't find raw data, don't search for files pertaining to those sensors
Check that the bandset in the hyperspectral header files is what we expect
Some updates are needed to the regex files so it recognises some newer files
The use folder_structure to decide what dirs are missing is creating folders we don't need in flights, like eage and hawk dirs when there is no data for those. Should either remove these from folder_structure or (better) integrate this bit with the part that determines if sensors are present so it doesn't create these if they're not needed. It also currently puts owl in hyperspectral/owl when it should be in thermal/owl.
There are common errors that ops do that we should integrate into proj_tidy to make unpacking easier. For example, the RCD raws are often not put into the rcd/raw_images/ dir, so we could do this automatically.

Download in other formats:

Plain Text