Cleaning up your Artifacts and Docker Images in Artifactory
Cleaning up Artifacts can be approach in different ways; you can for instance use the Artifactory API, Artifactory Query Language, or CLI tools to find artifacts that have not been used in X days, or that were created before a certain date.
I won’t get into specifics of cleaning up Artifacts in general, but rather on the edge cases, as JFrog already has some articles on cleanup practices. Take a look at the following if you are new to this approach to cleanup.
- Artifactory Cleanup Methods and How Do I Delete Old Artifacts?
- Advanced Cleanup Using Artifactory Query Language (AQL)
There are also built-in maintenance tools that can help you control the size and organization of your artifacts. From the JFrog docs I recommend the following:
- Managing Disk Space Usage: How to use Artifactory’s basic tools to limit the amount of artifacts you keep and cleaning up unnecessary files.
- Regular Maintenance Operations: Goes along with the first one; here you learn to manage settings relating to scheduled cleanup and maintenance tasks.
This will be enough for most people, and the Artifactory REST API will help further by providing features such as version/GAVC search, artifact stats, and more. The JFrog CLI’s del command can also take a File Spec, which uses patterns or AQL, and make the whole delete process faster. To summarize, you can do the following and more:
- Use the Artifacts Not Downloaded Since API and parse through them effectively deleting them.
- Use the Artifacts Created In Date Range API and parse through them effectively deleting them.
- Use the Artifactory Query Language (AQL) along with either the REST API (and parsing) or with the JFrog CLI (deleting directly by passing a Spec File)
However, if you are storing Docker images, the story is a bit different.
Cleaning Up Docker Images
Last year JFrog released a User Plugin for Docker Cleanup to address this concern. This plugin works by looking for properties on certain images, and then removing them accordingly.
Docker labels in Artifactory are stored as properties, so the approach to this plugin is to have users mark their images as they push to define a retention policy. This is a common use case for many users and it might work for you as well, but for those who already have a large amount of images, or prefer the flexibility that other search/cleanup methods offer, will need to heavily rework the plugin for their needs.
But Why?
Docker is stored in layers, and each layer has its own checksum value stored. Just like with any other artifact, Artifactory will store the layers based on this value, causing layers to be shared by different deployments; not only between different tags, but also between different images. The Docker implementation goes a bit further, and makes it so that each Artifactory repository pools these layers, sharing also the statistics attached to them.
That means that deleting a layer based on their last download date might cause issues cleaning up. Let’s say you are using the REST API or AQL to find old Docker images based on the least used, so you run a query to find all artifacts not downloaded since 3 months ago. If you then delete those artifacts you might still have images that have not been used in a long time, and that are now incomplete.
This is because some of the layers might still be used by other tags or images, so those layers did not get deleted. On that line we also want to make it clear that if you delete a layer from one image, it will not be fully deleted as long as other images are referencing it, so what we have to focus on is deleting that image as a whole.
Ok, kinda get it, but How?
We search based on the manifest.json file, which is what will be changed only when that specific image/tag are downloaded/used. Each image will have its own unique manifest, and it is the most reliable way to find information of the image as a whole.
For example the following Python script would look for all manifest.json files that are 4 weeks old or more and delete the entire image
Line 17 is not validated as I added that big message showing that the line will delete images. I did this, so that if you delete anything it’s on you not on me!
But seriously, do a dry run first by printing out the results before you delete something you might not have wanted to delete.
Special Use Cases
Let’s talk about how to put some of this knowledge into practice.
Let’s say you want a way to cleanup artifacts that are older and unused, but the organization of the packages isn’t universal across the repositories or even within a single repository. Also, let’s say that you are using generic repositories with your own made up versioning convention.
When you find an older Artifact, you want to delete the whole version of the package (the entire subfolder that corresponds to that version), however the folder depth, and naming convention can change, which makes it very difficult to find them. Also, you want to always leave at least one package available regardless of the age of the package.
Quite a specific request; how do we approach it?
First look for all the packages older than a certain time and gather their paths as usual.
query = ‘items.find({“name”:{“$match”:”*”},”type”:”file”,”stat.downloaded”:{“$before”:”4w”},”repo”:”test-repo”}).include(“stat.downloaded”)’
From there we get information such as repo, path, and name, to form the full path that that you may want to use later.
Then we will parse over the paths (without name and repo), split them and find the ones that have a folder that match one or multiple expressions that describe the many ways you represent your versions. In Python/pseudocode this would happen:
This should do it, all you’d have to do from here is create two functions: choose_oldest and get_folder_count.
choose_oldest will be the function that will decide what to delete in case of conflict. For example if there are 7 packages to delete and only 7 in the root folder, you’ll want to keep one or more, and you can decide based on age, download count, or original version.
get_folder_count will have to use the REST API to tell you how many packages there are in the folder and delete accordingly.
Conclusion
There are many ways to approach artifact cleanup; and it is not a trivial thing. The reason Artifactory doesn’t give you an option to just go and remove packages in bulk based on a property, is for the same reason we have been discussing all these points: not all artifacts are treated equal, and they tend to be part of a bigger picture.
Good luck with your cleanup endeavors, let me know if you have some feedback.
Keep scripting!