Commit e51ef837 authored by Archit Tamarapu's avatar Archit Tamarapu
Browse files

Merge branch 'main' of...

Merge branch 'main' of ssh://forge.3gpp.org:29419/ivas-codec-pc/ivas-processing-scripts into update-configs-20250910
parents 2d157343 6281ceb1
Loading
Loading
Loading
Loading
Loading
+7 −2
Original line number Diff line number Diff line
@@ -118,9 +118,14 @@ After the processing is finished, the outputs will be present in the respective

  - These scripts collect items from each experiments `proc_output*` folder(s) and puts the needed files for the listening test into a `proc_final` folder. This folder needs to be uploaded for the dry run and the final delivery of the listening items to the labs.

### Hash generation
### Hash generation and checking for duplicates

The hashes for the `proc_final` can be generated using the [get_md5.py](other/get_md5.py) script:
The hashes for the `proc_final` can be generated using the [get_md5.py](other/get_md5.py) script.
This script also checks for identical hashes and thus identifies duplicates in the output files which are reported in a printout.
When generating hashes one should check if duplicates are reported and if yes, what files are identical - note that there might be duplicates between the actual test and the preliminaries/training which is ok.
If there is a case with three or more items being the same or two items being the same inside the test or the preliminaries, the input files should be checked for duplicates.

Script usage:

```shell
> python other/get_md5.py --help
+19 −5
Original line number Diff line number Diff line
@@ -44,17 +44,31 @@ def get_hash_line_for_file(file: Path, output_dir: Path):
    return hashline


def get_duplicates(hashlines: list) -> dict:
    count = Counter([line.split()[-1] for line in hashlines])
    duplicates = {}
    for hash, count in count.items():
        if count == 1:
            continue

        files = [line.replace(hash, "").strip() for line in hashlines if hash in line]
        duplicates[hash] = files

    return duplicates


def main(output_dir, out_file):
    wav_files = sorted(output_dir.glob("*/**/*c[0-9][0-9].wav"))

    hashlines = [get_hash_line_for_file(f, output_dir) for f in wav_files]
    count = Counter([line.split()[-1] for line in hashlines])
    duplicates = [line for line in hashlines if count[line.split()[-1]] != 1]
    duplicates = get_duplicates(hashlines)

    if len(duplicates) != 0:
        print("Found duplicate hashes in these lines:")
        for dup in duplicates:
            print(dup)
        print(
            "Found duplicate hashes! The following hashes were found in multipe files:"
        )
        for hash, files in duplicates.items():
            print(f"{hash} - {', '.join(files)}")

    with open(out_file, "w") as f:
        f.writelines(hashlines)