Version 7 (modified by knpa, 9 years ago) (diff)

--

proj_tidy.sh

Intro

proj_tidy (project tidy) is a tool to keep the arsf flight directories in a neat and standard data structure with correct layout and filenames. It also runs a multitude of checks to flag any problems with the data.

It also serves as our "unpacking script", it prints a list of commands that you should use to convert it from the format we get it from ARSF-OPS to our desired standard.

proj_tidy should be in harmony with the project structure page
If they disagree, go on what the majority of flights have done for that year.

Location

It lives here:
~arsf/usr/bin/proj_tidy.sh

It is modular, and it's modules live here:
~arsf/usr/bin/proj_tidy/

Basic guide to functions

Some functions are taken care of in the script, some things which are more complicated or are easier in python are called as separate modules.

It decides if directories are missing or not using using folder_structure module that anch originally wrote. This maintains the project structure (somewhere) in a central place, so in theory we only need to update one place when we change structure (e.g. move folders around). Note you'll still have to update proj_tidy's regex file, as files will now be in a different place.

It makes a list of all files in the flight, filters out any expected ones using the regex file, and prints the rest as "unrecognised files".

Changes between years

Also present in the module directory, are regex/templates. These are designed to hold filename/folder structure information for each year. This is because we use proj_tidy to archive flights in previous years. We want to make sure these are standard for a given year but we don't care if they change in-between years (which things do). We therefore need a record of past conventions. The regex files are used by proj_tidy, the templates are the same thing but in a form where things like flightDay can be readily inserted to build correct paths/names. I think this was so proj_tidy can give suggested changes so you can copy and paste without having to correct it manually, but this has not yet been implemented. For pre-2011 an older version of proj_tidy is ran (proj_tidy_old.sh), this is because the project structure/filename conventions were originally hardcoded into the script. When we changed to APL in 2011, I decided to rewrite proj_tidy in the new, flexible year format, and kept the old one around for 2010 stuff. If you run the regular proj_tidy on something pre-2011, it will call proj_tidy_old instead.

Improvements/fixes

  • SERIOUS BUG - although it's set up to check pre-2011 delivies (as outlined above), it currently can't do this if that flight has been converted to the layout. Since ALL flights have now been converted to the new layout, this needs fixing before we can archive older flights. (dap, ask someone what about the change in 2011, reprocessing, and conversion of structures).
  • Currently it determines if raw data for a sensor exists by looking if there is e.g. hyperspectral/fenix/ dir. This should be changed so it actually looks for data files e.g. FENIX*.raw
  • Check if we have .hdr file and a .log file for every .raw file, and do something similar for other sensors. So basically check, if we do have a sensor as determined above, that all expected files are present
  • Check the database to get number of lines for a sensor, and check there are indeed this number
  • If can't find raw data, don't search for files pertaining to those sensors
  • Check that the bandset in the hyperspectral header files is what we expect
  • Some updates are needed to the regex files so it recognises some newer files
  • The use folder_structure to decide what dirs are missing is creating folders we don't need in flights, like eage and hawk dirs when there is no data for those. Should either remove these from folder_structure or (better) integrate this bit with the part that determines if sensors are present so it doesn't create these if they're not needed. It also currently puts owl in hyperspectral/owl when it should be in thermal/owl.
  • There are common errors that ops do that we should integrate into proj_tidy to make unpacking easier. For example, the RCD raws are often not put into the rcd/raw_images/ dir, so we could do this automatically.