Data Assimilation Facility

The data assimilation facility allows the user to specify background winds and thermodynamic fields that vary in x, y, z, and time. As discussed in Data Assimilation, forcing terms are added to the equations of motion in order to steer the large-scale components of the solution to the prescribed background distributions. This page does not discuss the theory further, but rather focuses on the data assimilation implementation details.

The data assimilation feature is activated via the user input i_read_w=2. It is also necessary to supply a data file containing the space and time dependent background fields. Construction of this file can be a complex task and much of the remainder of this document is devoted to describing the steps necessary to build the data file. The same steps can be followed if the user only needs a single profile at a fixed latitude, longitude, and time.

Wind File Overview

Historically users could specify one-dimensional mean wind and temperature profiles (U(z), V(z), T(z)) via the file background.wind. This concept has been generalized for the data assimilation facility by allowing the user to specify background winds and temperatures that vary in all three spatial directions and in time (e.g. U(x,y,z,t)). In what follows we shall refer to this as 4D winds as it encompasses three spatial dimensions plus time. The file name background.wind is still used, but the file format is different when data assimilation is used. Since the 4D wind file typically contains much more data that the 1D variant, most of it is written in binary format. The exception is the inclusion of a few ascii header records at the beginning of the file that provide useful meta data in an easily accessible format. The downside of the hybrid data structure is that the background.wind file must be constructed following strict formatting rules in order to be read properly by CGCAM. While the programs described below will generate a correctly formatted background.wind file, care must be taken if the user uses his own procedures to create a wind file.

4D Wind File Structure

In a nutshell, the 4D background.wind file consists of three ascii header lines listing things like the number of points in each spatial direction and in time, the time interval, the case name, etc (described in detail below). Next comes the grid coordinates for any direction where the sampling is non-uniform, written in single-precision binary format. The main data portion then follows and it is a single-precision pure binary dump of the array wind(1:Nxwind,1:Nywind,1:Nzwind,1:n_var,1:Ntwind) (in column major order), where n_var=(u,v,T,[p]). If the pressure is not supplied, it will be computed from the hydrostatic balance. Double precision data can be also be used by specifying the parameter n_prec_wind=8 in the CGCAM source file globals.f.

Wind File Header Records

A 4D background.wind file begins with a 240 character ascii header record, organized into three 80-character lines (see the example below). Although white space between fields can be used arbitrarily, it is important to add sufficient white space at the end of each line so that they are each 80 characters in length. A good way to ensure this is to add a terminator character such as | in column 79 and then the newline character will bring the line to 80 characters. The first header line contains the case name, the wind grid type, a pressure data flag, and the time range.

The wind grid type is specified by the variable i_grid, which has a somewhat different meaning in this context as opposed when used as an input to the CGCAM program itself. The value of i_grid signifies the vertical grid point distribution strategy:

1:Absolute altitude for z points.
2:Above ground level for z points, using the local terrain elevation.
3:Above ground level using the underlying Cartesian grid.

The i_grid input can be either positive or negative with positive values indicating x-y space and negative values indicating latitude-longitude space for the horizontal point distributions. Note: This is relative to the grid.dat references SEE: Grid File Section. Grid File

The i_pressure input is 1 if the pressure is included in the data file. If the pressure is omitted, it will be computed from the Temperature via the hydrostatic balance. The time range for the wind data is given by the Time input and is in decimal hours (i.e. 10.50 is 10 hours, 30 minutes).

The second line indicates the number of data points in each direction and in time, and the third line gives the coordinate ranges for the directions where the sampling points are uniform. Directions with non-uniform sampling will have no data following the respective keyword and this will signify that the non-uniform point distribution is contained in the binary section of the wind file. The data ranges are specified either in kilometers or degrees depending on the sign of the i_grid input. The use of kilometers is for convenience and CGCAM will convert these to meters upon input. A slight inconsistency is that non-uniform sampling distributions (part of the binary data section) must be specified in meters.

Example

# Case = SAMPLE          i_grid = -2  i_pressure = 1   Time[   0.00:  47.00]  |
# Nxwind =    21   Nywind =    26   Nzwind =   205   Ntwind =    48           |
# x_range[   160.:   180.] y_range[  -57.5:  -32.5] z_range                   |

Here grid type -2 indicates an above ground reference from surface elevation in latitude-longitude space. The Pressure field is supplied (i_pressure=1), and the time range is 0 to 47 hours, inclusive of both end points. With Ntwind=48, this gives a sampling interval of one hour. Uniform latitude-longitude distributions are signified by the numerical inputs for x_range and y_range, whereas the missing inputs for z_range signifies that this direction is non-uniform. The z-point distribution is then specified immediately following the header in single-precision binary.

A complete background.wind file used for the example case (cgcam/examples/4Dwind) can be downloaded here https://boulder.gats-inc.com/background.wind

Generating the 4D wind Data

The main body of the 4D background.wind file contains the actual space and time dependent wind, temperature, and optionally pressure data, as the array wind(1:Nxwind,1:Nywind,1:Nzwind,1:n_var,1:Ntwind) (in column major order), where n_var=(u,v,T,[p]). If the pressure is not supplied, it will be computed from the hydrostatic balance. There are several possible sources for this data and these are described in detail below. The sources that we recommend are either reanalysis or climatology based, and they have different characteristics, samplings, and altitude ranges. In many cases it will be necessary to use multiple sources for a particular case in order to cover the entire altitude range of interest. Python scripts are used to read the native reanalysis data output files and then the fortran program merge_wind is used to join multiple data sets if required. The basic steps used to prepare the wind data can be summarized as follows:

  1. Download data from appropriate source/sources
  2. Edit the file input_convert.py to specify the data sources as well as the latitude, longitude, and time ranges.
  3. Run convert_assim_data.py to translate to binary data
  4. Specify source names inside merge_wind.f (dat_names and n_files) and recompile
  5. Specify options in merge_wind.inp
  6. Run merge_wind for interpolation and blending to create the final background.wind file
  7. Run stabilize_wind to increase the Richardson number by smoothing locally in the vertical

An encompassed script get_assim_data.sh is provided to execute the above procedure to generate the final wind file background.wind_stable. To select sources (DEFAULT: ERA5 NAVGEM JAWARA) and modify configurations, edit the file assim.config.

get_assim_data.sh -l/--list [input_base/input_base_list.txt]
   OPTIONS:
      -d          Download Only
      -p          Profile at lat/lon and make animations in time
      -f/--force  Force rerun of all data conversions
      -s/--skip   Skip all data conversions. Also skip [plot/movie] when profiling is active

Where the date/times are listed in input file input_base_list.txt:

# YYYY-MM-DD   UTC   dt   Latitude/Longitude  Orbit(Optional)
  2024-05-30   18    12   example             # 2953

The UTC time associated is the final time with dt previous hours. The orbit number can be used as the naming convention instead of the default date by setting output_base to orbit in the configuration. The data range in then specified in the file indicated above in the list as example_lat_lon.txt:

# LAT  LON
  -60  150
  -30  200

Note that the longitude crosses 180 degrees, therefore the range [0:360] is used. Generally, theses procedures use the convention [-180:180] degrees in longitude.

Profiling can also be enacted by specifying a single latitude and longitude

# YYYY-MM-DD   UTC   dt   Latitude/Longitude
  2024-05-30   18    12   -60      150

Requirements:

  1. ERA5 API account setup and install: https://cds.climate.copernicus.eu/how-to-api
  2. JAWARA (or other sources) data downloaded and path pointed to in get_assim_data.sh DEFAULT: Data storage on AWE /dataST
  3. Python Dependencies (below)
  4. CGCAM Directory DEFAULT: $HOME

A collection of wind profiling for AWE swaths can be found here: https://boulder.gats-inc.com/~adam/era_timeser/

Installing Python Dependencies

It will be necessary to install a few publicly available python packages before proceeding. In most cases it will be best to install these packages on a per user basis as opposed to system-wide global basis. To do this issue:

pip3 install --user numpy xarray[complete] cfgrib ambiance pyvista

If the system uses a virtual environment (i.e.. programming environment module on HPC systems) it is required to create another virtual layer to install packages on top of the existing software. This is a user based environment, therefore, --user should not be used.

python3 -m venv /path/to/my_env

This will create the directory my_env where local packages are installed. Note that it is possible to create separate virtual environments for different projects. To activate an environment, use the command:

source /path/to/my_env/bin/activate

If one environment is always desired, this command can be placed in bashrc/zshrc startup file. Once in the new virtual environment, packages can be installed normally. To check the installation list use:

pip3 list

To leave the virtual environment, use:

deactivate

Data Sources

Table 1 Source Overview
Source Link Temporal Lat/Lon Approx. Vertical Relevant Fields
ERA5 1 hr 0.25° [0:40] km U,V,T
JAWARA 1 hr 2.8125° [0:140] km U,V,T
MERRA2 3 hr 0.5/0.625° [0:70] km U,V,T
GFS 3 hr 0.25° [0:70] km U,V,T
NRLMSIS2.0 15 min 5/20° [0:1000] km T
HWM14 15 min [0:1000] km U,V
ICON 30 sec [80:300] km U,V,T

The left column in the table above contains web links to the respective data source websites. Each of these sites contains instructions on how to download their archived data files. Follow these instructions in order to download the files for the date, time, and region of interest.

The ERA5, MERRA2, and ICON data sources provide API methods (command line environment amenable to scripting) to download data files. The other sources have more primitive download methods but these are described in some detail on the respective websites.

Note that the various data sources have rather different spatial and temporal resolution as well as altitude ranges. Also note that the NRLMSIS2.0 source does not provide winds and that the HWM source does not provide temperature. The disparate altitudes as well as the absence of certain fields often require that multiple data sources be used and the results from these be merged together to create the desired wind file. Merging is accomplished via the merge_wind program and its use is described in some detail below.

While all of the data sources except ICON provide data from the surface to at least 40 km in altitude, there are important differences that will dictate a good choice for the winds at lower altitudes. As listed in Table 1, the ERA5 dataset has the best temporal and spacial resolution, and thus we recommend this source for cases where high fidelity for the low altitude winds are requires. We have also found that the ERA5 data appears to more realistic low level winds in areas of high terrain. This is an important feature for mountain wave simulations where accurate depiction of the near surface winds is critical for proper wave generation. While ERA5 may be the best choice for low level winds it only extends to an altitude of approximately 40 km, which will be much lower than the desired altitude range in most cases. The solution to this problem is to merge the ERA5 solution with one of the other data sources that extends higher in order to create a composite data file that covers the entire altitude range. If the desired upper boundary of the wind file is to be above 70 km, then Table 1 shows that only JAWARA, HWM, NRLMSIS, and ICON extend high enough. The HWM source is a spatially-coarse climatology that does not include temperature and thus must be augmented with one of the other models (Like NRLMSIS2.0, which only supplies temperature). Similarly, the ICON data set is inconvenient since it starts at 80 km, again requiring a third data set to fill in the altitude range between 40 and 80 km. This leaves the JAWARA data set, which contains both winds and temperature and extends from the surface to 140 km. The horizontal resolution is too coarse to be used near the surface, but it is probably adequate for heights above 40 km. Thus we recommend using the JAWARA source for the upper altitudes.

Converting from Native to CGCAM Data Format

To translate from native data formats (usually NetCDF or grib2) to the simple binary format that CGCAM reads, the python script, convert_assim_data.py is supplied. This script reads parameter selections such as the data source type and space and time data ranges from the input file input_convert.py. If multiple sources are desired with the same subset, they can be listed in the same input file, otherwise use separate files. A template input file can be found in the directory cgcam/input and an example input file is shown below.

import numpy as np

# Date used for file names and time subset
year = '2024'
month = '06'
day = '08'
date = np.datetime64('%s-%s-%s' % (year, month, day))

# Initialize dataset specification arrays
NAME = []
SOURCE = []
FILE_NAME = []
PATH = []

# Specify the datasets to be processed.  Four different datasets are
# specified in this example.  The required inputs are the dataset name,
# a pattern for the data file(s) and the path to the data files.  The
# data file path can be either relative or absolute.
name='jawara'
NAME.append(name)
SOURCE.append(name)
FILE_NAME.append(str(year)[2:4] + str(month)  + '.nc')
PATH.append('../data/'+name)

# Here multiple files are required so we define a base file name for
# all files on a certain day/month.
name='gfs'
NAME.append(name)
SOURCE.append(name)
#FILE_NAME.append('gfs.0p25.%s%s%s' % (year, month, day))
FILE_NAME.append('gfs.0p25.%s%s' % (year, month))
PATH.append('../data/'+name)

name='merra'
NAME.append(name)
SOURCE.append(name)
FILE_NAME.append('MERRA2_400.inst3_3d_asm_Nv.%s%s%s.nc4' % (year, month, day))
PATH.append('../data/'+name)

name='era'
NAME.append(name)
SOURCE.append(name)
FILE_NAME.append('era.%s%s%s.grib' % (year, month, day))
PATH.append('../data/'+name)

# Optionally define variables to be written
#VARIABLES=['u','v','t']

# Define subsets in space and time. None indicates that all available
# data be used (i.e. no subset or interpolation is preformed for that
# field) Time should be specified in numpy datetime64 format unless the
# data file contains no date and time information.  Here the time starts
# at 11:00 UTC on 06/08/2024 and extends until 23:00 UTC.
t0 = date + np.timedelta64(11,'h')
t1 = date + np.timedelta64(23,'h')

lat0 = -56.0
lat1 = -44.0

lon0 = 60.00
lon1 = 80.00

# Use all available vertical data levels.
z0 = None
z1 = None

# Write data files.
write_data = True

The convert_assim_data.py script also accepts a few additional optional arguments, but these should generally only be used for testing and diagnostic purposes. Similar options with better support are available when running the merge_wind program (described below). For completeness the convert_assim_data.py optional arguments are listed below

# Interpolate onto subset with spacings
interp_method = 'linear'
dt = np.timedelta64(1,'h')
dlat = 0.25
dlon = 0.25
dz = None

# Write vertical profile
write_prof = True
t_prof = date + np.timedelta64(18,'h')
lat_prof = -49.3
lon_prof = 69.2

# Write contour. If z_contour_mean_top is specified, a vertical average will
# be preformed using levels of data rather than z coordinates for each point
write_contour = True
z_contour = 0.5
#z_contour_mean_top = 1.5

# Find the profile with minimum Richardson Number
write_min_stability = True

Creating a logical directory structure will help keep everything organized when building a 4D wind file. An example is provided at https://boulder.gats-inc.com/wind, and a graphical illustration of the directory tree structure is shown below.

wind
├── 1D
│   ├── convert_assim_data.py
│   ├── input_convert.py
├── 4D
│   ├── convert_assim_data.py
│   ├── input_convert.py
└── data
    ├── era
    │   ├── era.20250602.grib
    └── jawara
        ├── T2506.nc
        ├── U2506.nc
        ├── V2506.nc
        └── Z2506.nc

Here the wind directory contains the 1D and 4D subdirectories where 1D and 4D background.wind files will be generated. It also contains the data directory, which will contain data files from both ERA5 and JAWARA sources. We start by populating the 1D and 4D directories with the python script files and loading the ERA5 and JAWARA data files into their respective data directories. Next we edit the input_convert.py to specify the data sources, their file name structures, paths, and also specify the space and time subsets. Now with the data files present and the input file configured, we can move into either the 1D or 4D directory and run the convert_assim_data.py script by issuing

python3 convert_assim_data.py

The convert_assim_data.py script will write several files, but most importantly it will produce files named like background.datasetName, which contain the extracted subset data sampled at the native resolution of the respective dataset. In this example the files background.era and background.jawara will be produced.

The Merge_wind Program

The merge_wind program is used to create a composite wind file from the individual background files produced by the convert_assim_data.py script. It will interpolate the data in space and time as necessary and apply a weighted average blending in the vertical direction in order to stitch together data with disparate altitude ranges. The merge_wind program uses a preprocessor directive, n_files, to set the number of data sources that will be merged together. This variable along with the hard-coded dat_names will need to be modify inside the source file merge_wind.f and recompiled to change sources. Example merge_wind.inp input files are contained in the 1D and 4D example directories. The user supplied inputs for the 4D case are listed below

# Data sources names are defined in merge_wind.f(dat_names/n_files)
# Recompile merge_wind to modify these inputs
n_fields      3            Number of fields in data files to process
zo0           0            Altitude of bottom of domain
zo1           130k         Altitude of top of domain
dz0           1k           Mesh spacing at zo0
sf_max        1.008        Maximum stretching factor for z distribution
i_hydrostatic 1            1-Get pressure from hydrostatic balance 0-do not
i_atmos_vary  0            1-Read background.gas for atmos state 0-Assume constant
ref_p         55000        Reference pressure at ref_k
ref_k         15           Reference index in z direction
ref_lon       68.0         Reference Longitude
ref_lat      -49.5         Reference Latitude
ref_itime     2            Reference time index(triggers profile)
z_blend1      30k          Center of blending region(+- dz_blend)
dz_blend1     6k           Blending distance from center z_blend
n_smooth_x    1            Smoothing passes in x
n_smooth_y    1            Smoothing passes in y
n_smooth_z    1            Smoothing passes in z
z_xy_vtk1     70k          Altitude of xy vtk plane
z_xy_vtk2      2k          Altitude of xy vtk plane
z_xy_avg2     10k          Altitude for top of arithmetic mean starting at z_xy_vtk
i_project     1            1-Cartesian projection for vtks using ref_lat/lon as tangent 0-lat/lon

First set the number of input fields contained in the data, n_fields: 1-T, 2-u/v, 3-u/v/T, 4-u/v/T/p. For data sets where temperature is gathered from a different source(MSIS), use the union of fields (ie. n_fields=3 when u/v are provided). The i_hydrostatic variable controls whether or not the hydrostatic balance is used to compute the pressure field. If the pressure is to be computed then it is necessary to set a reference pressure. The data is assumed to be computed on constant pressure levels, therefore the inputs ref_k and ref_p set the pressure to constant ref_p for the plane of data at k=ref_k. The z distribution is defined by limits zo0/zo1 with spacing dz0 at zo0. To enable stretching, sf_max can be specified, otherwise a uniform vertical spacing is assumed. References ref_lon/ref_lat/ref_itime specify the space and time for a 1D output profile. Profiles are blended over the interval z_blend +/- dz_blend. The number of blending regions should be specified for each source to be joined (n_files-1). A three point smoothing operated can be applied independently in each n_smooth direction. Vtk outputs can be specified in the xy direction for sources/output in the altitude range. A vertical mean, starting at z_xy_vtk, can be specified via upper altitude limit z_xy_avg. If i_project=1, then the longitude-latitude points will be visualized in x-y Cartesian space using reference ref_lon and ref_lat.

The merge_wind program should be run from the directory where the convert_assim_data.py script was run.

Generating 1D profiles

There will be situations where only a 1D (function of z) wind profile is required. One way to do this is to use the program probe_wind to extract a 1D profile from a 4D wind file at a specified horizontal location and time. This approach is ideal for cases where several 1D profiles are desired. A second method is to specify fixed longitude, latitude, and time values in the convert_assim_data.py script and proceed as described above. While the resulting background.wind file will still be written in binary format, but the merge_wind program also writes the ascii file background1D.wind and this file will contain the same data in a more convenient format.

CIRA Climatology

A climatology data base is available for users but should probably only be used for preliminary analysis or profile approximation due to its limitations. The data set contains zonal winds and temperature at 1 degree lines of latitude. The latitude range is [-80:80] degrees and the altitude range range is approximately [0:120] km.

Data generated from Committee on Space Research (COSPAR) International Reference Atmosphere (CIRA) 1986 model and precursor to:
Fleming, E. L., S. Chandra, M. R. Shoeberl, and J. J. Barnett. Monthly Mean Global Climatology of Temperature, Wind, Geopotential Height and Pressure for 0-120 km. NASA TM100697, February 1988.

To generate profiles from the CIRA climatology, build and run the program climatology.

CGCAM Input Options

i_read_w:2-read 4D, 1-read 1D, 0-generate internally
t_adjust_wind:Time to start data time variation (Use: ramping with fixed winds)
n_frame0_w:Starting wind time frame index
i_t_interp_wind:
 Temporal interpolation order 1-Linear, 3-Cubic(requires background.spline)
n_skip_b:Skip factor for background output (diagnostic)

To reduce transient solutions, it is recommended to use a thrid-order temporal interpolation. Due to the high memory requirements to calculate a cubic spline for the entire wind duration, the spline coefficients are written by the program spline_fit_wind to a seperate file background.spline. The following bash can be added to a batch script to produce the file:

#!/bin/bash
program='spline_fit_wind'

input='cgcam.inp'
cgcam_dir="$HOME/cgcam/src"

# Check for i_read_w=2
read_w=($(grep i_read_w $input))
if [[ ${read_w[1]} == 2 ]]; then
   # Check for thrid order interpolation i_t_interp_wind=3
   spline=($(grep i_t_interp_wind $input))
   if [[ $spline != '#'* ]]; then
      if [[ ${spline[1]} == 3 ]]; then
         # Check that spline_fit_wind exists
         if [[ ! -x ./$program ]]; then
            # Build program if DNE
            if [[ ! -x $cgcam_dir/$program ]]; then
               wrk_dir=$(pwd)
               cd $cgcam_dir
               make $program
               cd $wrk_dir
            fi
            ln -s $cgcam_dir/$program .
         fi
         # Run spline_fit_wind to produce background.spline
         echo "Running ./$program"
         ./$program
      fi
   fi
fi

#mpiexec ./cgcam

Input options for boundary conditions, diffusion, hydrostatic balance, and ramping function (hyperbolic tangent) are strict for terrain cases.

Grid File

Similar to the background.wind file, the format of the grid.dat file has been altered for data assimilation cases. The alteration amounts to the addition of meta data-containing ascii header records at the beginning of the file, which allow for easy consistency checking between the simulation grid and the wind-data grid. The grid file header records will be included for a terrain grid if the input i_header=1 is added to the terrain_mesh.inp file:

i_header:1-write header, 0-no header (default)

Older grid files without the header records can still be used for data assimilation cases, but full consistency may not be guaranteed.