I am trying to quickly find all folders named in a yyyymmdd_hhmmss
format between two dates and times. These dates and times are variables set on user input.
E.g., all folders between
20221231_120000
20230101_235920
All dates/times looked for being valid is not a requirement for me.
Note that the 'age' of the folders does not match their names.
I have looked at regex but it seems like a complex solution for variable dates/times.
I have looked at Ansible find
module patterns but they are incredibly slow, because it runs the find
command for every sequential number. Taking about 1 second per checked number.
For example:
- name: Find folders matching dates and times
vars:
startdate: "20230209"
enddate: "20230209"
starttime: "120000"
endtime: "130000"
ansible.builtin.find:
paths:
- "/folderstocheck/
file_type: directory
patterns: "{{ item[0:8] }}_{{item[8:-1]}}"
with_sequence: start={{ startdate + starttime }} end={{ enddate + endtime }}
register: found_files
Takes approximately 167 minutes to run
Regarding
Note that the 'age' of the folders does not match their names.
I like to recommend to streamline the folder access and modification times with the names so that simple OS functions or Ansible modules like stat
could come in place. Such will make any processing a lot easier.
How to do that? I have a somehow similar use case of Change creation time of files (RPM) from download time to build time which shows the idea and how one could achieve that.
Given some test directories as input
:~/test$ tree 202*
20221231_110000
20221231_120000
20221231_130000
20221232_000000
20230000_000000
20230101_000000
20230101_010000
20230101_020000
20230101_030000
20230101_120000
20230101_130000
a minimal example playbook
---
- hosts: localhost
become: false
gather_facts: false
vars:
FROM: "20221231_120000"
TO: "20230101_120000"
tasks:
- name: Get an unordered list of directories with pattern 'yyyymmdd_hhmmss'
find:
path: "/home/{{ ansible_user }}/test/"
file_type: directory
use_regex: true
patterns: "^[1-2]{1}[0-9]{7}_[0-9]{6}" # can be more specified
register: result
- name: Order list
set_fact:
dir_list: "{{ result.files | map(attribute='path') | map('basename') | community.general.version_sort }}"
- name: Show directories between
debug:
msg: "{{ item }}"
when: item is version(FROM, '>=') and item is version(TO, '<=') # means between
loop: "{{ dir_list }}"
will result into an output of
TASK [Get a unordered list of directories with pattern 'yyyymmdd_hhmmss'] ******
ok: [localhost]
TASK [Order list] **************************
ok: [localhost]
TASK [Show directories between] ************
ok: [localhost] => (item=20221231_120000) =>
msg: '20221231_120000'
ok: [localhost] => (item=20221231_130000) =>
msg: '20221231_130000'
ok: [localhost] => (item=20221232_000000) =>
msg: '20221232_000000'
ok: [localhost] => (item=20230000_000000) =>
msg: '20230000_000000'
ok: [localhost] => (item=20230101_000000) =>
msg: '20230101_000000'
ok: [localhost] => (item=20230101_010000) =>
msg: '20230101_010000'
ok: [localhost] => (item=20230101_020000) =>
msg: '20230101_020000'
ok: [localhost] => (item=20230101_030000) =>
msg: '20230101_030000'
ok: [localhost] => (item=20230101_120000) =>
msg: '20230101_120000'
Some measurement
Get an unordered list of directories with pattern 'yyyymmdd_hhmmss' -- 0.50s
Show directories between --------------------------------------------- 0.24s
Order list ----------------------------------------------------------- 0.09s
According the given initial description there is no timezone and daylight saving time involved. So this is working because the given pattern is just a kind of incrementing number, even if a human may interpret it as date. It could even be simplified if more information regarding the hour is provided. Means, if it is every time 1200
that insignificant part could be dropped and leaving one with a simple integer number. The same would be true for the delimiter _
.
Regarding
... they are incredibly slow, because it runs the
find
command for every sequential number ...with_sequence
...
that is not necessary and seems for me like the case of How do I optimize performance of Ansible playbook with regards to SSH connections?
Looping over commands and providing one parameter for the command per run results into a lot of overhead and multiple SSH connections as well, providing the list directly to the command might be possible and increase performance and decrease runtime and resource consumption.
Further processing can be done just afterwards.