Version 23 (modified by dap, 10 years ago) (diff)

--

proj_tidy.sh

Intro

proj_tidy (project tidy) is a tool to keep the arsf flight directories in a neat and standard data structure with correct layout and filenames. It also runs a multitude of checks to flag any problems with the data.

It also serves as our "unpacking script", it prints a list of commands that you should use to convert it from the format we get it from ARSF-OPS to our desired standard.

proj_tidy should be in harmony with the project structure page
If they disagree, go on what the majority of flights have done for that year.

Location

It lives here:
~arsf/usr/bin/proj_tidy.sh

It is modular, and it's modules live here:
~arsf/usr/bin/proj_tidy/

Basic guide to most important modules

Some functions are taken care of in the script, some things which are more complicated or are easier in python are called as separate modules.

Some things, like checking the webcam images take time and only need to be done once, so I've put these under a '-e' (extended) option, which only need to be ran once, when the flight arrives and is being checked for the first time. Fixing permissions takes time too, and this is done nightly on a cron, so I also moved this to it's own option '-m'.

It decides if any directories are missing using check_dirs_present module which uses the folder_structure library that anch originally wrote. This maintains the project structure (somewhere) in a central place, so in theory we only need to update one place when we change structure (e.g. add or move folders around). Note you'll still have to update proj_tidy's regex file for that year if things change, as files will now be in a different place. For any missing directories it will generate a 'mkdir' command, which will appear in "Suggested Commands" at the very bottom.

There are a number of modules/procedures which need to see all the files in the directory. proj_tidy therefore creates a list of all files at the start and writes to a temp file: /tmp/proj_tidy_$timeStamp. This file can then be parsed by any modules that require it.

check_mandatory_files takes the list of all files, and uses proj_tidy's yearly regex list to make sure all required files are present.

check_unrecognised_files takes the list of all files in the flight, filters out any expected ones using the regex file, and prints the rest as "unrecognised files".

move_commands.sh generates the suggested mv commands (which will be printed below) and is one of the most complex modules. It's designed to convert projects to the current structre from the one we get from Ops (and also historically, the pre-2011 file structures). This therefore has to be kept manually updated to match the structure that ops supply to us. It's broken down into one section for each specific directory and is in the form: Check if directory exists -> Check if directory has contents -> Move contents to correct place -> rmdir the directory. It's best start from the lowest level and move upwards (e.g. do leica/ipas/raw/ then leica/ipas/ then leica/) so you're not moving anything you then want to deal with further down the script. It also does some other more complicated things to do with azgcorr style stuff - this was important when converting the projects from the pre-2011 style to the current style and making sure old and new things were kept separated, but is not important now.
It actually looks more complicated than it is because I set it up so you could either execute the commands or just print them. In the end it was decided it was safer just to print them, so proj_tidy doesn't call this module with the 'ex' argument that is necessary to quietly execute them.

At the very end, it prints suggested commands, which are intended to be checked then copy and pasted. These are the ones that are used when unpacking, which convert the project to the current format. The mkdir commands from the missing directories checker will be printed first, and then the mv commands. Don't try and copy and paste all at once because BASH doesn't like too many lines pasted straight into the terminal :P

Changes between years

Also present in the module directory, are regex/templates. These are designed to hold filename/folder structure information for each year. This is because we use proj_tidy to archive flights in previous years. We want to make sure these are standard for a given year but we don't care if they change in-between years (which things do). We therefore need a record of past conventions. The regex files are used by proj_tidy, the templates are the same thing but in a form where things like flightDay can be readily inserted to build correct paths/names. I think this was so proj_tidy can give suggested changes so you can copy and paste without having to correct it manually, but this has not yet been implemented. For pre-2011 an older version of proj_tidy is ran (proj_tidy_old.sh), this is because the project structure/filename conventions were originally hardcoded into the script. When we changed to APL in 2011, I decided to rewrite proj_tidy in the new, flexible year format, and kept the old one around for 2010 stuff. If you run the regular proj_tidy on something pre-2011, it will call proj_tidy_old instead.

Improvements/fixes

  • SERIOUS BUG - although it's set up to check pre-2011 delivies (as outlined above), it currently can't do this if that flight has been converted to the layout. Since ALL flights have now been converted to the new layout, this needs fixing before we can archive older flights. (dap, ask someone what about the change in 2011, reprocessing, and conversion of structures).
  • Currently it determines if raw data for a sensor exists by looking if there is e.g. hyperspectral/fenix/ dir. This should be changed so it actually looks for data files e.g. FENIX*.raw
  • Check if we have .hdr file and a .log file for every .raw file, and do something similar for other sensors. So basically check, if we do have a sensor as determined above, that all expected files are present.
    • For the hawk and fenix hyperspectral sensors, there is a check to ensure that we have a .log and a .nav file for each .raw file in the raw directory.
    • For the eagle sensor, there is a check to ensure that we have a .nav file for each .raw file but there is no need to check that we have a log file as these are kept in the hawk directory.
    • For the owl sensor, there is a check to ensure we have a .nav file and a .hdr file for each .raw file but since there isn't always a .log file for each flight line (it's not compulsory), a check isn't necessary.
    • For the camera and LiDAR sensor, there is no need to implement such a check as there is only one type of file for each of these sensors (.raw and .scn respectively).
  • [FUTURE] Check the database to get number of lines for a sensor, and check there are indeed this number
    • This can't be done yet, as there are often test lines flown by ops, which are not always entered in to the database.
  • If can't find raw data, don't search for files pertaining to those sensors
    • In the Check for missing directories section, hyperspectral/owl is looked for. This should be thermal/owl. This ties in with the point listed below when the project layout is built. This is due to incorrect key list in the folder_structure module.
    • Currently, proj_tidy.missing_directories calls folder_structure.FolderStructure's constructor, which returns a full list of the directories that are expected to be in the project directory. This could be changed by either adding another function to proj_tidy.py that looks for regex instead or changing the FolderStructure constructor so that it can just return the relevant directories if necessary.
      • This was achieved by updating the missing_directories function. I changed it so that it now takes arguments that determine which missing directories it should look for. For example, it has the check_hyper argument, which false will be passed in to if there is no hyperspectral data. The Python function will check this argument and if it is false, will remove the path if it contains "/hyperspectral". The other arguments work in a similar way.
  • Check that the bandset in the hyperspectral header files is what we expect
    • This is done by checking if there are any .wls files in the hyperspectral sensor calibration directory (~arsf/calibration/<year>/<sensor>/) for that particular year. If it's there, it will count the number of lines in each file and add it to a Python list. It will then read in the header files and count the number of wavelengths in each one using the Python data_handler library. It will then check that the length of the header file is in the Python list of numbers of lines. If it's not, it will print a warning.
  • Some updates are needed to the regex files so it recognises some newer files
    • I believe these are up to date. The only things that will need adding are the owl delivery directories and any owl files that are required in the main project directory. The folder_structure library will also need to be updated to include the owl directories so that they are built when unpacking. Delivery libraries will need to be updated so that delivery structure is created when needed.
    • Where proj_tidy checks that all the compulsory files are present, checking for mandatory Owl files needs to be added.
  • The use folder_structure to decide what dirs are missing is creating folders we don't need in flights, like eagle and hawk dirs when there is no data for those. Should either remove these from folder_structure or (better) integrate this bit with the part that determines if sensors are present so it doesn't create these if they're not needed. It also currently puts owl in hyperspectral/owl when it should be in thermal/owl.
  • There are common errors that ops do that we should integrate into proj_tidy to make unpacking easier. For example, the RCD raws are often not put into the rcd/raw_images/ dir, so we could do this automatically.
    • When proj_tidy checks which sensors are present, for the RCD, it looks in $projpath/camera/rcd/raw_images and if none are found in there, looks in $projpath/camera/rcd.
    • If they are then found in there AND the script is NOT being executed in just_checking mode, it will print an mv command to move the files to the correct place.
    • If the script IS being run in just_checking mode, it will warn the user that they are in the wrong place but won't suggest any mv commands.
    • If the .raw files aren't found in either $projpath/camera/rcd or $projpath/camera/rcd/raw_images, it will assume there are no camera raw files present.