Commit fbb50345 authored by malenovsky's avatar malenovsky
Browse files

Merge branch '83-provide-osba-item-generation-script' into 'main'

Add OMASA/OSBA item generation scripts

See merge request !171
parents 1bd00fe3 e575d8cb
Loading
Loading
Loading
Loading

.flake8

0 → 100644
+4 −0
Original line number Diff line number Diff line
[flake8]
max-line-length = 88
ignore = E203,E402,E501,E741,W503,W504
exclude = .git,__pycache__,build,dist
 No newline at end of file
+1 −1
Original line number Diff line number Diff line
@@ -121,7 +121,7 @@ lint:
    - linux
  allow_failure: true
  script:
    - flake8 --max-line-length 88 --extend-ignore=E203,E402,E501,E741
    - flake8 --config .flake8

format:
  stage: analyze
+22 −9
Original line number Diff line number Diff line
@@ -55,21 +55,34 @@ In the following sections the only purpose of the curly brackets is to mark the
## P800

The setup for a P800 test from the experiments folder consists of two steps:
item generation and item processing. The two steps can be applied independent of each other.
item generation and item processing. The two steps can be applied independently of each other.

### Item generation

To set up the P800-{X} listening test (X = 1, 2, ...9) copy your mono input files to `experiments/selection/P800-{X}/gen_input/items_mono`.
These files have to follow the naming scheme `{l}{LL}p0{X}{name_of_item}` where 'l' stands for the listening lab designator: a (Force Technology),
b (HEAD acoustics), c (MQ University), d (Mesaqin.com), and 'LL' stands for the language: EN, GE, JP, MA, DK, FR.
To facilitate the preparation of items for P800-{X} listening tests, it is possible to generate samples of complex formats (STEREO, SBA, ISMn, OMASA, OSBA) from mono samples. To generate items, run the following command from the root of the repository:

The impluse responses have to be copied to experiments/selection/P800-{X}/gen_input/IRs.
```bash
   python generate_items.py --config path/to/scene_description_config_file.yml
 ```
 
The YAML configuration file (`scene_description_config_file.yml`) defines how individual mono files should be spatially positioned and combined into the target format. For advanced formats like OMASA or OSBA, note that additional SBA items may be required. Refer to the `examples/` folder for template `.yml` files demonstrating the expected structure and usage.

Relative paths are resolved from the working directory (not the YAML file location). Use absolute paths if you're unsure. Avoid using dots `.` in file names (e.g., use `item_xxa3s1.wav`, not `item.xx.a3s1.wav`). Windows users: Use double backslashes `\\` and add `.exe` to executables if needed. Input and output files follow structured naming conventions to encode metadata like lab, language, speaker ID, etc. These are explained in detail in the file under *Filename conventions*.

Each entry under `scenes:` describes one test item, specifying:

* `output`: output file name
* `description`: human-readable description
* `input`: list of mono `.wav` files
* `azimuth` / `elevation`: spatial placement (°)
* `level`: loudness in dB
* `shift`: timing offsets in seconds

Dynamic positioning (e.g., `"-20:1.0:360"`) means the source will move over time, stepping every 20 ms.

To generate the items run `python -m ivas_processing_scripts.generation experiments/selection/P800-{X}/config/item_gen_P800-{X}_{l}.yml` from the root folder of the repository.
The resulting files can be found in `experiments/selection/P800-{X}/proc_input_{l}` sorted by category.
The total duration of the output signal can be controlled using the  `duration` field. The output signal may optionally be rendered to the BINAURAL format by specifying the `binaural_output` field. 

For P800-3 the input files for the processing are already provided by the listening lab. This means this step can be skipped.
For tests with ISM input format (P800-6 and P800-7) no IRs are needed, only mono sentences
Start by running a single scene to verify settings. Output includes both audio and optional metadata files. You can enable multiprocessing by setting `multiprocessing: true`.

### Item processing

+177 −0
Original line number Diff line number Diff line
---
################################################
# Item generation - General configuration
################################################

### Any relative paths will be interpreted relative to the working directory the script is called from!
### Usage of absolute paths is recommended.
### Do not use file names with dots "." in them! This is not supported, use "_" instead
### For Windows users: please use double back slash '\\' in paths and add '.exe' to executable definitions

### Output format
format: "ISM3"
# masa_tc: 2        # applicable only to OMASA format
# masa_dirs: 2      # applicable only to OMASA format
# sba_order: 2      # applicable only to OSBA format

### Output sampling rate in Hz
fs: 48000

### Generate BINAURAL output (_BINAURAL will be appended to the output filename)
binaural_output: true

### Normalize target loudness to X LKFS 
# loudness: -26

### Apply pre-amble and post-amble in X seconds 
preamble: 0.0
postamble: 0.0

### Apply fade-in and fade-out of X seconds 
fade_in_out: 0.5

### Trim the output such that the total duration is X seconds
duration: 8

### Add low-level random background noise (amplitude +-4) instead of silence; default = false (silence)
add_low_level_random_noise: true

### Process with parallel streams
multiprocessing: False

################################################
### Item generation - Filename conventions
################################################

### Naming convention for the input mono files
### The input filenames are represented by:
###   lLLeeettszz.wav
### where: 
###   l stands for the listening lab designator: a (Force Technology), b (HEAD acoustics), c (MQ University), d (Mesaqin.com) 
###   LL stands for the language: JP, FR, GE, MA, DA, EN
###   eee stands for the experiment designator: p01, p02, p04, p05, p06, p07, p08, p09
###   tt stands for the talker ID: f1, f2, f3, m1, m2, m3
###   s stands for 'sample' and zz is the sample number; 01, ..., 14

### Naming convention for the generated output files
### The output filenames are represented by:
###   leeeayszz.wav
### The filenames of the accompanying output metadata files (applicable to metadata-assisted spatial audio, object-based audio) are represented by:
###   leeeayszz.met for metadata-assisted spatial audio
###   leeeayszz.wav.o.csv for object-based audio
### where: 
###   l stands for the listening lab designator: a (Force Technology), b (HEAD acoustics), c (MQ University), d (Mesaqin.com) 
###   eee stands for the experiment designator: p01, p02, p04, p05, p06, p07, p08, p09
###   a stands 'audio'
###   y is the per-experiment category according to IVAS-8a: 01, 02, 03, 04, 05, 06
###   s stands for sample and zz is the sample number; 01, 02, 03, 04, 05, 06, 07 (07 is the preliminary sample)
###   o stands for the object number; 0, 1, 2, 3

### File designators, default is "l" for listening lab, "EN" for language, "p07" for experiment and "g" for company
listening_lab: "l"
language: "EN"
exp: "p01"
provider: "va"

### Insert prefix for all input filenames (default: "")
### l stands for the 'listening_lab' designator, L stands for the 'language', e stands for the 'experiment' 
### the number of consecutive letters define the length of each field
# use_input_prefix: "lLLeee"

### Insert prefix for all output filenames (default: "")
### l stands for the 'listening_lab' designator, L stands for the 'language', e stands for the 'experiment' 
### the number of consecutive letters define the length of each field
# use_output_prefix: "leee"

################################################
### Item generation - Scene description
################################################

### Each scene shall de described using the following parameters/properties:
###   output:      output filename
###   description: textual description of the scene
###   input:       input filename(s)
###   azimuth:     azimuth in the range [-180,180]; positive values point to the left
###   elevation:   elevation in the range [-90,90]; positive values indicate up
###   shift:       time adjustment of the input signal (negative value delays the signal)
###
### Note 0: you can use relative paths in filenames (the program assumes that the root directory is the parent directory of the ivas_processing_scripts subfolder)
### Note 1: use brackets [val1, val2, ...] when specifying multiple values 
### Note 2: use the "start:step:stop" notation for moving sources, where step will be applied in 20ms frames
### Note 3: we're using right-handed coordinate system with azimuth = 0 pointing from the nose to the screen


scenes:

    "01": 
        output: "out/VA_3obj_2tlks_music1.wav"
        description: "Two talkers sitting at a table, at different azimuth angles with respect to the microphone, ~30% overlapping utterances."
        input: ["items_mono/untrimmed/f2s1a_Talker1.wav", "items_mono/untrimmed/m2s10a_Talker2.wav", "items_mono/music/Sc01.wav"]
        azimuth: [20, -40, 45]
        elevation: [0, 0, 70]
        level: [-26, -26, -41]
        shift: [-1.0, -2.0, 2.0]
        
    "02":
        output: "out/VA_3obj_2tlks_music2.wav"
        description: "One talker sitting at a table, second talker walking around the table, ~30% overlapping utterances."
        input: ["items_mono/untrimmed/f5s10b_Talker1.wav", "items_mono/untrimmed/m3s2b_Talker2.wav", "items_mono/music/Guitar1.wav"]
        azimuth: [50, "180:1:120 + 360", -120]
        elevation: [0, 45, 70]
        level: [-26, -26, -41]
        shift: [1.0, -2.0, -1.0] 
        
    "03":
        output: "out/VA_3obj_2tlks_music3.wav"
        description: "Two talkers walking side-by-side around the table, ~30% overlapping utterances."
        input: ["items_mono/untrimmed/m1s2b_Talker1.wav", "items_mono/untrimmed/f3s5a_Talker2.wav", "items_mono/music/Track066.wav"]
        azimuth: ["80:1:20 + 360", "80:1:20 + 360", -30]
        elevation: [10, 60, 70]
        level: [-26, -26, -41]
        shift: [0.0, 0.0, 0.0] 

    "04":
        output: "out/VA_3obj_2tlks_music4.wav"
        description: "Two talkers walking around the table in opposite directions, ~30% overlapping utterances."
        input: ["items_mono/untrimmed/m4s12b_Talker1.wav", "items_mono/untrimmed/f1s12b_Talker2.wav", "items_mono/music/Sample02.wav"]
        azimuth: ["60:1:0 + 360", "60:-1:120 - 360", 100]
        elevation: [20, 50, 70]
        level: [-26, -26, -41]
        shift: [0.0, 0.0, 0.0] 
        
    "05":
        output: "out/VA_3obj_3tlks_1.wav"
        description: "Three static talkers, partially overlapping utterances."
        input: ["items_mono/untrimmed/m4s12b_Talker1.wav", "items_mono/untrimmed/f1s12b_Talker2.wav", "items_mono/untrimmed/m3s1a_Talker2.wav"]
        azimuth: [30, -45, 100]
        elevation: [20, 20, 30]
        level: [-26, -26, -26]
        shift: [0.0, 0.0, -2.5] 
        
    "06":
        output: "out/VA_3obj_3tlks_2.wav"
        description: "One walking talker, two static talkers, non-overlapping utterances."
        input: ["items_mono/untrimmed/f2s5a_Talker1.wav", "items_mono/untrimmed/m2s16b_Talker2.wav", "items_mono/untrimmed/m3s8b_Talker2.wav"]
        azimuth: ["-20:0.5:360", 60, -45]
        elevation: [10, 10, 10]
        level: [-26, -26, -26]
        shift: [0.0, 0.0, -3.0] 
        
    "07":
        output: "out/VA_3obj_3tlks_3.wav"
        description: "Two moving talkers, one static talker, partially overlapping utterances."
        input: ["items_mono/untrimmed/f1s16b_Talker2.wav", "items_mono/untrimmed/m4s16a_Talker1.wav", "items_mono/untrimmed/f3s10b_Talker2.wav"]
        azimuth: [-90, "0:1:360", "0:-1:-360"]
        elevation: [0, 30, 30]
        level: [-26, -26, -26]
        shift: [0.0, 0.0, -3.0] 

    "08":
        output: "out/VA_3obj_3tlks_4.wav"
        description: "Three walking talkers, partially overlapping utterances."
        input: ["items_mono/untrimmed/f5s15b_Talker1.wav", "items_mono/untrimmed/m3s1a_Talker2.wav", "items_mono/untrimmed/m2s17b_Talker2.wav"]
        azimuth: ["-90:-1:-360", "-10:1.5:360", "70:1:360"]
        elevation: [0, 20, 0]
        level: [-26, -26, -26]
        shift: [0.0, 0.0, -3.5] 
+160 −0
Original line number Diff line number Diff line
---
################################################
# Item generation - General configuration
################################################

### Any relative paths will be interpreted relative to the working directory the script is called from!
### Usage of absolute paths is recommended.
### Do not use file names with dots "." in them! This is not supported, use "_" instead
### For Windows users: please use double back slash '\\' in paths and add '.exe' to executable definitions

### Output format
format: "FOA"
# masa_tc: 2        # applicable only to OMASA format
# masa_dirs: 2      # applicable only to OMASA format
# sba_order: 2      # applicable only to OSBA format

### Output sampling rate in Hz
fs: 48000

### Generate BINAURAL output (_BINAURAL will be appended to the output filename)
binaural_output: true

### Normalize target loudness to X LKFS 
loudness: -26

### Apply pre-amble and post-amble in X seconds 
preamble: 0.5
postamble: 1.0

### Apply fade-in and fade-out of X seconds 
fade_in_out: 0.5

### Trim the output such that the total duration is X seconds
duration: 8

### Add low-level random background noise (amplitude +-4) instead of silence; default = false (silence)
add_low_level_random_noise: False

### Process with parallel streams
multiprocessing: False

################################################
### Item generation - Filename conventions
################################################

### Naming convention for the input mono files
### The input filenames are represented by:
###   lLLeeettszz.wav
### where: 
###   l stands for the listening lab designator: a (Force Technology), b (HEAD acoustics), c (MQ University), d (Mesaqin.com) 
###   LL stands for the language: JP, FR, GE, MA, DA, EN
###   eee stands for the experiment designator: p01, p02, p04, p05, p06, p07, p08, p09
###   tt stands for the talker ID: f1, f2, f3, m1, m2, m3
###   s stands for 'sample' and zz is the sample number; 01, ..., 14

### Naming convention for the generated output files
### The output filenames are represented by:
###   leeeayszz.wav
### The filenames of the accompanying output metadata files (applicable to metadata-assisted spatial audio, object-based audio) are represented by:
###   leeeayszz.met for metadata-assisted spatial audio
###   leeeayszz.wav.o.csv for object-based audio
### where: 
###   l stands for the listening lab designator: a (Force Technology), b (HEAD acoustics), c (MQ University), d (Mesaqin.com) 
###   eee stands for the experiment designator: p01, p02, p04, p05, p06, p07, p08, p09
###   a stands 'audio'
###   y is the per-experiment category according to IVAS-8a: 01, 02, 03, 04, 05, 06
###   s stands for sample and zz is the sample number; 01, 02, 03, 04, 05, 06, 07 (07 is the preliminary sample)
###   o stands for the object number; 0, 1, 2, 3

### File designators, default is "l" for listening lab, "EN" for language, "p07" for experiment and "g" for company
listening_lab: "b"
language: "GE"
exp: "p02"
provider: "g"

### Insert prefix for all input filenames (default: "")
### l stands for the 'listening_lab' designator, L stands for the 'language', e stands for the 'experiment' 
### the number of consecutive letters define the length of each field
# use_input_prefix: "lLLeee"

### Insert prefix for all output filenames (default: "")
### l stands for the 'listening_lab' designator, L stands for the 'language', e stands for the 'experiment' 
### the number of consecutive letters define the length of each field
use_output_prefix: "leee"

################################################
### Item generation - Scene description
################################################

### Each scene shall de described using the following parameters/properties:
###   output:      output filename
###   description: textual description of the scene
###   input:       input filename(s)
###   IR:          filenames(s) of the input IRs 
###   azimuth:     azimuth in the range [-180,180]; positive values point to the left
###   elevation:   elevation in the range [-90,90]; positive values indicate up
###   shift:       time adjustment of the input signal (negative value delays the signal)
###
### Note 0: you can use relative paths in filenames (the program assumes that the root directory is the parent directory of the ivas_processing_scripts subfolder)
### Note 1: use brackets [val1, val2, ...] when specifying multiple values 
### Note 2: use the "start:step:stop" notation for moving sources, where step will be applied in 20ms frames
### Note 3: we're using right-handed coordinate system with azimuth = 0 pointing from the nose to the screen


scenes:
    "01": 
        output: "out/s01.wav"
        description: "Car with AB microphone pickup, no overlap between the talkers, car noise."
        input: ["items_mono/untrimmed/f1s4b_Talker2.wav", "items_mono/untrimmed/f2s1a_Talker1.wav"]
        IR: ["IRs/IR_do_p04_e_01_01_FOA.wav", "IRs/IR_do_p04_e_02_01_FOA.wav"]
        shift: [0.0, -1.0]
        
    "02": 
        output: "out/s02.wav"
        description: "Car with AB microphone pickup, overlap between the talkers, car noise."
        input: ["items_mono/untrimmed/f1s6a_Talker2.wav", "items_mono/untrimmed/f2s3b_Talker1.wav"]
        IR: ["IRs/IR_do_p04_e_03_01_FOA.wav", "IRs/IR_do_p04_e_04_01_FOA.wav"]
        shift: [0.0, +1.0]
        
    "03": 
        output: "out/s03.wav"
        description: "Car with AB microphone pickup, no overlap between the talkers, car noise."
        input: ["items_mono/untrimmed/f3s3a_Talker2.wav", "items_mono/untrimmed/f3s10b_Talker2.wav"]
        IR: ["IRs/IR_do_p04_e_05_01_FOA.wav", "IRs/IR_do_p04_e_06_01_FOA.wav"]
        shift: [0.0, -1.0]
        
    "04": 
        output: "out/s04.wav"
        description: "Car with AB microphone pickup, no overlap between the talkers, car noise."
        input: ["items_mono/untrimmed/f2s7b_Talker1.wav", "items_mono/untrimmed/f5s15a_Talker1.wav"]
        IR: ["IRs/IR_do_p04_e_07_01_FOA.wav", "IRs/IR_do_p04_e_08_01_FOA.wav"]
        shift: [0.0, -1.0]
        
    "05": 
        output: "out/s05.wav"
        description: "Car with AB microphone pickup, no overlap between the talkers, car noise."
        input: ["items_mono/untrimmed/m2s15a_Talker2.wav", "items_mono/untrimmed/m1s4a_Talker1.wav"]
        IR: ["IRs/IR_do_p04_e_07_01_FOA.wav", "IRs/IR_do_p04_e_01_01_FOA.wav"]
        shift: [0.0, -1.0]
        
    "06": 
        output: "out/s06.wav"
        description: "Car with AB microphone pickup, no overlap between the talkers."
        input: ["items_mono/untrimmed/m3s8a_Talker2.wav", "items_mono/untrimmed/m4s13a_Talker1.wav"]
        IR: ["IRs/IR_do_p04_e_03_01_FOA.wav", "IRs/IR_do_p04_e_01_01_FOA.wav"]
        shift: [0.0, -1.0]
         
    "07": 
        output: "out/s07.wav"
        description: "Preliminary: Car with AB microphone pickup, no overlap between the talkers."
        input: ["items_mono/untrimmed/f1s20a_Talker2.wav", "items_mono/untrimmed/f5s15b_Talker1.wav"]
        IR: ["IRs/IR_do_p04_e_02_01_FOA.wav", "IRs/IR_do_p04_e_07_01_FOA.wav"]
        shift: [0.0, -1.0]
         
    "08": 
        output: "out/s08.wav"
        description: "Car with AB microphone pickup, overlap between the talkers."
        input: ["items_mono/untrimmed/m2s6b_Talker2.wav", "items_mono/untrimmed/f5s14a_Talker1.wav"]
        IR: ["IRs/IR_do_p04_e_08_01_FOA.wav", "IRs/IR_do_p04_e_04_01_FOA.wav"]
        shift: [0.0, +1.0]
Loading