Freedesktop Planet - Latest News

  • Andy Hao: Getting Started - GSoC '23 (2023/06/02 09:43)
    After a week of setting up the iPXE server, my Raspberry Pi 3B+ can now boot Linux kernel from it with TLS mutual authentication enabled. In this post I’ll introduce how I built the SDcard image for my RPi and set up the iPXE server. Both of them are currently only for testing purposes so the process could be kind of messy. Anyway, let’s just get it running and leave the optimization for future.
  • Hans de Goede: Fedora IPU6 camera support now available in rpmfusion-nonfree (2023/05/30 08:51)
    InstallationI am happy to announce that Intel's IPU6 camera stack has been packaged in rpmfusion and now can be installed under Fedora 37 and newer with a single `dnf install` command.Note since this uses an out of tree kernel module build as unsigned akmod you need to disable secureboot for this to work; or alternatively sign the kmod with your own local key (instructions here).First enable both the rpmfusion-free and rpmfusion-nonfree repositories, for instructions see https://rpmfusion.org/ConfigurationThe IPU6 support requires kernel >= 6.3.1 which is in updates-testing for now and v4l2loopback also needs to be updated to the latest version (in case you already have it installed):sudo dnf update \  --enablerepo=updates-testing \  --enablerepo=rpmfusion-free-updates-testing \  --enablerepo=rpmfusion-nonfree-updates-testing \  'kernel*' '*v4l2loopback'And now things are ready to install the IPU6 driver stack:sudo dnf install \  --enablerepo=updates-testing \  --enablerepo=rpmfusion-free-updates-testing \  --enablerepo=rpmfusion-nonfree-updates-testing \  akmod-intel-ipu6After this command reboot and you should be able to test your camera with https://mozilla.github.io/webrtc-landing/gum_test.html under firefox now.This relies on Intel's partly closed-source hw-enablement for the IPU6, as such this known to not work on laptop models which are not covered by Intel's hw-enablement work. If your laptop has an option to come with Linux pre-installed and that SKU uses the IPU6 cameras then this should work. Currently known to work models are:Dell Latitude 9420 (ov01a1s sensor)Dell Precision 5470 (ov01a10 sensor)Dell XPS 13 Plus 9320 (ov01a10 sensor)Lenovo ThinkPad X1 Carbon Gen 10 (ov2740 sensor)Lenovo ThinkPad X1 Nano Gen 2 (ov2740 sensor)Lenovo ThinkPad X1 Yoga Gen 7 (ov2740 sensor)If the IPU6 driver works for you on an unlisted model please drop mean email at so that the above list can be updated.Description of the stackThe IPU6 camera stack consists of the following layers:akmod-intel-ipu6 the IPU6 kernel drivers. These are currently out of tree. Work is ongoing on getting various IO-expander, sensor drivers and the CSI2 receiver patches upstream. This is a slow process though and currently there is no clear path to getting the actual ISP part of the IPU supported upstream.ipu6-camera-bins this is a set of closed-source userspace libraries which the rest of the userspace stack builds on top of. There is a separate set of libraries for each IPU6 variant. Currently there are 2 sets, "ipu6" for Tiger Lake and "ipu6ep" for Alder Lake.ipu6-camera-hal this is a library on top of the set of libraries in ipu6-camera-bins. This needs to be built separately for the "ipu6" and "ipu6ep" library sets from ipu6-camera-bins.gstreamer1-plugins-icamerasrc a gstreamer plugin built on top of ipu6-camera-hal. This allows using the camera through gstreamer.akmod-v4l2loopback + v4l2-relayd. Most apps don't use gstreamer for camera access and even those that do don't know they need to use the icamerasrc element. v4l2-relayd will monitor a v4l2loopback /dev/video0 node and automatically start a gstreamer pipeline to send camera images into the loopback when e.g. firefox opens the /dev/video0 node to capture video.Packaging challenges and technical detailsakmod-intel-ipu6: There were 2 challenges to overcome before the IPU6 kernel drivers could be packaged:The sensor drivers required patches to the main kernel package, specifically to the INT3472 driver which deals with providing GPIO, clk, regulator and LED resources to the sensor drivers. Patches have been written for both the main kernel, including some LED subsystem core additions, as well as patches to the IPU6 sensor drivers to bring them inline with mainline kernel conventions for GPIOs, clks and LEDs. All the necessary patches for this are upstream now, allowing the latest ipu6-drivers code to work with an unmodified mainline kernel.Until now the IPU6 drivers seem to have been used with a script which manually loads the modules in a specific order. Using automatic driver loading by udev exposed various probe-ordering issues. Requiring numerous patches (all upstreamed to Intel's github repo) to fix.ipu6-camera-bins: Since there were 2 sets of libraries for different IPU6 versions, these are now installed in separate /usr/lib64/ipu6[ep] directories and the headers and pkgconfig files are also installed in 2 different variants.ipu6-camera-hal: This needs to be built twice against the 2 different sets of ipu6-camera-bins libraries. Letting the user pick the right libcamhal.so build to install is not very user friendly, to avoid the user needing to manually chose:Both builds are installed in separate /usr/lib64/ipu6[ep] directories.The libcamhal.so under these directories is patched to include the directory it is installed in as RPATH, so that dynamic-linking against that libcamhal.so will automatically link against the right set of ipu6-camera-bins libraries.To make all this all work transparently the actual /usr/lib64/libcamhal.so is a symlink to /run/libcamhal.so and /run/libcamhal.so is set by udev rules to point to /usr/lib64/ipu6[ep]/libcamhal.so depending on the actual hw. The /run/libcamhal.so indirection is there so that things will also work with an immutable /usr .ipu6-camera-hal's udev rules also set a /run/v4l2-relayd config file symlink to configure the gstreamer pipeline use by v4l2-relayd to match the ipu6 vs ipu6ep capabilities.akmod-v4l2loopback + v4l2-relayd: Getting this to work with Firefox was somewhat tricky, there were 2 issues which had to be solved:Firefox does not accept the NV12 image format generated by ipu6ep pipelines. To work around this a conversion to YUV420 has been added to the v4l2-relayd pipeline feeding into v4l2loopback. This workaround can be dropped once Firefox 114, which will have NV12 support, is released.Gstreamer sends v4l2-buffers with a wrong bytesused field value into v4l2loopback causing Firefox to reject the video frames. A patch has been written and merged upstream to make v4l2loopback fix up the bytesused value, fixing this.Many thanks to my colleague Kate Hsuan for doing most of the packaging work for this.And also a big thank you to the rpmfusion team for maintaining the rpmfusion repo and infrastructure which allows packaging software which does not meet Fedora's strict guidelines outside of the Fedora infra.
  • Tomeu Vizoso: Etnaviv NPU update 1: Planning for performance (2023/05/29 09:31)
    As I wrote in the last update, my OpenCL branch is able to correctly run MobileNet v1 with the GPU delegate in TensorFlow-Lite, albeit much slower than with VeriSilicon's proprietary stack.In the weeks that passed I have been investigating the performance difference, understanding better how the HW works and what could the explanation be. Inference with Etnaviv took 1200 ms, while the proprietary stack did the same in less than 10 ms (120x faster!). When trying to understand the big performance difference I discovered that the existing reverse engineering tools that I had been using to understand how to run OpenCL workloads weren't working. They detected a single OpenCL kernel at the end of the execution, and there was no way that single kernel could be executing the whole network.After a lots of fumbling around in the internets I stumbled upon a commit that included an interestingly-named environment variable: VIV_VX_DISABLE_TP_NN_EVIS. With it, VeriSilicon's OpenVX implementation will execute the network without using nor the TP or NN fixed-function units, nor the EVIS instruction set (which helps with reducing memory bandwith use by allowing operations on packed int8 and int16 types).With that environment variable OpenVX was using regular OpenCL to run the inference, and the performance difference was interesting: 398.428 ms. Still much better than our time, but also more than 50 times slower than when fully using the capabilities of the hardware. The reason for this is that there is only one core in the NPU that is able to run programmable kernels. The rest are fixed-function units as I'm going to explain next.Digging further in VeriSilicon's kernel driver and on marketing documents I gathered that this particular NPU has 8 convolution cores (they call them NN cores) and 4 cores for accelerating some tensor operations (TP cores). What these units cannot do, has to be done in the single slow programmable core.Next step was to understand how the proprietary stack made use of the fixed function units in the NPU.The MobileNet v1 model I used contains these operations, as output by TFLite's model analyzer:  Op#0 CONV_2D(T#88, T#6, T#4[28379, 17476, 18052, -2331, 17431, ...]) -> [T#5]  Op#1 DEPTHWISE_CONV_2D(T#5, T#33, T#32[-249, 165, 173, -2, 158, ...]) -> [T#31]... [12 more pairs of CONV_2D and DEPTHWISE_CONV_2D] ...  Op#27 AVERAGE_POOL_2D(T#29) -> [T#0]  Op#28 CONV_2D(T#0, T#3, T#2[-5788, -4159, 2282, -6706, -9783, ...]) -> [T#1]  Op#29 RESHAPE(T#1, T#86[-1, 1001]) -> [T#85]  Op#30 SOFTMAX(T#85) -> [T#87]As can be seen, it is basically a bunch of convolutions with a final reshaping and a SOFTMAX operation at the end. By using some of the environment variables that are mentioned in this issue in GitHub, we can get some information on how the proprietary stack plans the execution on the hardware:  operation_name:VXNNE_OPERATOR_TENSOR_TRANS operation_target:VXNNE_OPERATION_TARGET_TP  operation_name:VXNNE_OPERATOR_RESHUFFLE operation_target:VXNNE_OPERATION_TARGET_TP  operation_name:VXNNE_OPERATOR_CONVOLUTION operation_target:VXNNE_OPERATION_TARGET_NN... [34 more VXNNE_OPERATOR_CONVOLUTION on VXNNE_OPERATION_TARGET_NN] ...  operation_name:VXNNE_OPERATOR_POOLING operation_target:VXNNE_OPERATION_TARGET_SH  operation_name:VXNNE_OPERATOR_FULLYCONNECTED operation_target:VXNNE_OPERATION_TARGET_TP  operation_name:VXNNE_OPERATOR_SOFTMAX operation_target:VXNNE_OPERATION_TARGET_SHFrom that we can see that the TP units are used to prepare the input tensor, then all convolution operations are going to the NN cores, and then the output of the convolutions is passed through a pooling operation in the programmable core, passing its input to the TP cores for further processing and then finishing with SOFTMAX on the programmable cores.So in this case, only a small part of the network is actually ran on the programmable cores, via OpenCL...Next steps What I will be working on next:Adapt the existing RE tooling to dump information regarding NN and TP workflowsStart to fill the data structures by reading the code of VeriSilicon's kernel driver, which executes some trivial workloads to, presumably, reset the HW between context switches to prevent information leaks.Write some simple OpenVX graphs that exercise each of the operations that the documentation claims to be supported by the NPU.Observe the data that VeriSilicon's userspace stack passes to the kernel, and infer from there the exact layout of the configuration buffers that program the fixed-function units.Hack Mesa to send a NN job if the name of the CL kernel contains "convolution".Get things working for this specific network and measure performance.If performance is at least 3x faster than running the inference on the CPU, I would call this good enough to be useful and I will switch to upstreaming. The Mesa side of it doesn't look that bad, but I think the bigger challenge will be getting something merged in TensorFlow that can run fast on this hardware.The most reasonable approach I have been able to think of would be adding new CL C and SPIR-V vendor extensions that add a new intrinsic for the whole convolution operation (with parameters similar to those of the vxConvolutionLayer node).The GPU delegate in TensorFlow Lite would use it on the Vivante NPU and Mesa would have a robust way of knowing that this kernel should be run with a NN job, and with what configuration.That's a lot of work, but I would say at this point that afterwards I will start looking at making fuller use of the NPU's capabilities by doing something similar with the operations that the TP cores can accelerate.
  • Samuel Iglesias: Closing a cycle (2023/05/25 10:09)
    For the last four years I’ve served as a member of the X.Org Foundation Board of Directors, but some days ago I stepped down after my term ended and not having run for re-election. I started contributing to Mesa in 2014 and joined the amazing freedesktop community. Soon after, I joined the X.Org Foundation as an regular member in order to participate in the elections and get access to some interesting perks (VESA, Khronos Group). You can learn more about what X.Org Foundation does in Ricardo’s blogpost. But everything changed in 2018. That year, Chema and I organized XDC 2018 in A Coruña, Spain. The following year, I ran for the yearly election of X.Org Foundation’s board of directors (as it is a two years term, we renew half of the board every year)… and I was elected! It was awesome! Almost immediately, I started coordinating XDC, and looking for organization proposals for the following XDC. I documented my experience organizing XDC 2018 in an attempt to make the job easier for future organizers, reducing the burden that organizing such a conference entails. In 2021, I was re-elected and everything continued without changes (well, except the pandemic and having our first 2 virtual XDCs: 2020 and 2021). Unfortunately, my term finished this year… and I did not re-run for election. The reasons were a mix of personal life commitments (having 2 kids change your life completely) and new professional responsibilities. After those changes, I could not contribute as much as I wanted, and that was enough to me to pass the torch and let others contribute to the X.Org Foundation instead. Congratulations to Christopher Michale and Arek Hiler, I’m pretty sure you are going to do great! Surprisingly enough, I am closing the cycle as it started: organizing X.Org Developers Conference 2023 in A Coruña, Spain from 17th to 19th October 2023. I leave the board of directors but I won friends and great memories. In case you are interested on participating to the community via the board of directors, prepare your candidancy for next year! See you in A Coruña!
  • Ricardo Garcia: What is the X.Org Foundation, anyway? (2023/05/24 07:52)
    A few weeks ago the annual X.Org Foundation Board of Directors election took place. The Board of Directors has 8 members at any given moment, and members are elected for 2-year terms. Instead of renewing the whole board every 2 years, half the board is renewed every year. Foundation members, which must apply for or renew membership every year, are the electorate in the process. Their main duty is voting in board elections and occasionally voting in other changes proposed by the board. As you may know, thanks to the work I do at Igalia, and the trust of other Foundation members, I’m part of the board and currently serving the second year of my term, which will end in Q1 2024. Despite my merits coming from my professional life, I do not represent Igalia as a board member. However, to avoid companies from taking over the board, I must disclose my professional affiliation and we must abide by the rule that prohibits more than two people with the same affiliation from being on the board at the same time. Figure 1. X.Org Logo by Wikipedia user Sven, released under the terms of the GNU Free Documentation License Because of the name of the Foundation and for historical reasons, some people are confused about its purpose and sometimes they tend to think it acts as a governance body for some projects, particularly the X server, but this is not the case. The X.Org Foundation wiki page at freedesktop.org has some bits of information but I wanted to clarify a few points, like mentioning the Foundation has no paid employees, and explain what we do at the Foundation and the tasks of the Board of Directors in practical terms. Cue the music. (“The Who - Who Are You?” starts playing) The main points would be: The Foundation acts as an umbrella for multiple projects, including the X server, Wayland and others. The board of directors has no power to decide who has to work on what. The largest task is probably organizing XDC. Being a director is not a paid position. The Foundation pays for project infrastructure. The Foundation, or its financial liaison, acts as an intermediary with other orgs. Umbrella for multiple projects Some directors have argued in the past that we need to change the Foundation name to something different, like the Freedesktop.org Foundation. With some healthy sense of humor, others have advocated for names like Freedesktop Software Foundation, or FSF for short, which should be totally not confusing. Humor or not, the truth is the X.Org Foundation is essentially the Freedesktop Foundation, so the name change would be nice in my own personal opinion. If you take a look at the Freedesktop Gitlab instance, you can navigate to a list of projects and sort them by stars. Notable mentions you’ll find in the list: Mesa, PipeWire, GStreamer, Wayland, the X server, Weston, PulseAudio, NetworkManager, libinput, etc. Most of them closely related to a free and open source graphics stack, or free and open source desktop systems in general. X.Org server unmaintained? I feel you As I mentioned above, the Foundation has no paid employees and the board has no power to direct engineering resources to a particular project under its umbrella. It’s not a legal question, but a practical one. Is the X.Org server dying and nobody wants to touch it anymore? Certainly. Many people who worked on the X server are now working on Wayland and creating and improving something that works better in a modern computer, with a GPU that’s capable of doing things which were not available 25 years ago. It’s their decision and the board can do nothing. On a tangent, I’m feeling a bit old now, so let me say when I started using Linux more than 20 years ago people were already mentioning most toolkits were drawing stuff to pixmaps and putting those pixmaps on the screen, ignoring most of the drawing capabilities of the X server. I’ve seen tearing when playing movies on Linux many times, and choppy animations everywhere. Attempting to use the X11 protocol over a slow network resulted in broken elements and generally unusable screens, problems which would not be present when falling back to a good VNC server and client (they do only one specialized thing and do it better). For the last 3 or 4 years I’ve been using Wayland (first on my work laptop, nowadays also on my personal desktop) and I’ve seen it improve all the time. When using Wayland, animations are never choppy in my own experience, tearing is unheard of and things work more smoothly, as far as my experience goes. Thanks to using the hardware better, Wayland may also give you improved battery life. I’ve posted in the past that you can even use NVIDIA with Gnome on Wayland these days, and things are even simpler if you use an Intel or AMD GPU. Naturally, there may be a few things which may not be ready for you yet. For example, maybe you use a DE which only works on X11. Or perhaps you use an app or DE which works on Wayland, but its support is not great and has problems there. If it’s an app, likely power users or people working on distributions can tune it to make it use XWayland by default, instead of Wayland, while bugs are ironed out. X.Org Developers Conference Ouch, there we have the “X.Org” moniker again…​ Back on track, if the Foundation can do nothing about the lack of people maintaining the X server and does not set any technical direction for projects, what does it do? (I hear you shouting “nothing!” while waving your fist at me.) One of the most time-consuming tasks is organizing XDC every year, which is arguably one of the most important conferences, if not the most important one, for open source graphics right now. Specifically, the board of directors will set up a commission composed of several board members and other Foundation members to review talk proposals, select which ones will have a place at the conference, talk to speakers about shortening or lengthening their talks, and put them on a schedule to be used at the conference, which typically lasts 3 days. I chaired the paper committee for XDC 2022 and spent quite a lot of time on this. The conference is free to attend for anyone and usually alternates location between Europe and the Americas. Some people may want to travel to the conference to present talks there but they may lack the budget to do so. Maybe they’re a student or they don’t have enough money, or their company will not sponsor travel to the conference. For that, we have travel grants. The board of directors also reviews requests for travel grants and approves them when they make sense. But that is only the final part. The board of directors selects the conference contents and prepares the schedule, but the job of running the conference itself (finding an appropriate venue, paying for it, maybe providing some free lunches or breakfasts for attendees, handling audio and video, streaming, etc) falls in the hands of the organizer. Kid you not, it’s not easy to find someone willing to spend the needed amount of time and money organizing such a conference, so the work of the board starts a bit earlier. We have to contact people and request for proposals to organize the conference. If we get more than one proposal, we have to evaluate and select one. As the conference nears, we have to fire some more emails and convince companies to sponsor XDC. This is also really important and takes time as well. Money gathered from sponsors is not only used for the conference itself and travel grants, but also to pay for infrastructure and project hosting throughout the whole year. Which takes us to…​ Spending millions in director salaries No, that’s not happening. Being a director of the Foundation is not a paid position. Every year we suffer a bit to be able to get enough candidates for the 4 positions that will be elected. Many times we have to extend the nomination period. If you read news about the Foundation having trouble finding candidates for the board, that barely qualifies as news because it’s almost the same every year. Which doesn’t mean we’re not happy when people spread the news and we receive some more nominations, thank you! Just like being an open source maintainer is not a grateful task sometimes, not everybody wants to volunteer and do time-consuming tasks for free. Running the board elections themselves, approving membership renewals and requests every year, and sending voting reminders also takes time. Believe me, I just did that a few weeks ago with help from Mark Filion from Collabora and technical assistance from Martin Roukala. Project infrastructure The Foundation spends a lot of money on project hosting costs, including Gitlab and CI systems, for projects under the Freedesktop.org umbrella. These systems are used every day and are fundamental for some projects and software you may be using if you run Linux. Running our own Gitlab instance and associated services helps keep the web decentralized and healthy, and provides more technical flexibility. Many people seem to appreciate those details, judging by the number of projects we host. Speaking on behalf of the community The Foundation also approaches other organizations on behalf of the community to achieve some stuff that would be difficult otherwise. To pick one example, we’ve worked with VESA to provide members with access to various specifications that are needed to properly implement some features. Our financial liaison, formerly SPI and soon SFC, signs agreements with the Khronos Group that let them waive fees for certifying open source implementations of their standards. For example, you know RADV is certified to comply with the Vulkan 1.3 spec and the submission was made on behalf of Software in the Public Interest, Inc. Same thing for lavapipe. Similar for Turnip, which is Vulkan 1.1 conformant. Conclusions The song is probably over by now and you have a better idea of what the Foundation does, and what the board members do to keep the lights on. If you have any questions, please let me know.
  • Samuel Iglesias: Joining the Linux Foundation Europe Advisory Board (2023/05/23 11:30)
    Last year, the Linux Foundation announced the creation of the Linux Foundation Europe. The goal of the Linux Foundation Europe is, in a nutshell, to promote Open Source in Europe not only to individuals (via events and courses), but to companies (guidance and hosting projects) and European organizations. However, this effort needs the help of European experts in Open Source. Thus, the Linux Foundation Europe (LFE) has formed an advisory board called the Linux Foundation Europe Advisory Board (LFEAB), which includes representatives from a cross-section of 20 leading European organizations within the EU, the UK, and beyond. The Advisory Board will play an important role in stewarding Linux Foundation Europe’s growing community, which now spans 100 member organizations from across the European region. Early this year, I was invited to join the LFEAB as an inaugural member. I would not be in this position without the huge amount of work done by the rest of my colleagues at Igalia since the company was founded in 2001, which has paved the way for us to be one of the landmark consultancies specialized in Open Source, both globally and in Europe. My presence in the LFEAB will help to share our experience, and help the Linux Foundation Europe to grow and spread Open Source everywhere in Europe. I’m excited to participate in the Linux Foundation Europe Advisory Board! I and the rest of the LFEAB will be at the Open Source Summit Europe, send me an email if you want to meet me to know more about LFEAB, about Igalia or about how you can contribute more to Open Source. Happy hacking!
  • Maira Canal: May Update: Finishing my Second Igalia CE (2023/05/22 12:30)
    After finishing up my first Igalia Coding Experience in January, I got the amazing opportunity to keep working in the DRI community by extending my Igalia CE to a second round. Huge thanks to Igalia for providing me with this opportunity! Another four months passed by and here I am completing another milestone with Igalia. Previously, in the last final reports, I described GSoC as “an experience to get a better understanding of what open source is” and the first round of the Igalia CE as “an opportunity for me to mature my knowledge of technical concepts”. My second round of the Igalia CE was a period for broadening my horizons. I had the opportunity to deepen my knowledge of a new programming language and learn more about Kernel Mode Setting (KMS). I took my time learning more about Vulkan and the Linux graphics stack. All of this new knowledge about the DRM infrastructure fascinated me and made me excited to keep developing. So, this is a summary report of my journey at my second Igalia CE. Wrapping Up First, I took some time to wrap up the contributions of my previous Igalia CE. In my January Update, I described the journey to include IGT tests for V3D. But at the time, I hadn’t yet sent the final versions of the tests. Right when I started my second Igalia CE, I sent the final versions of the V3D tests, which were accepted and merged. Series Status [PATCH i-g-t 0/6] V3D Job Submission Tests Accepted [PATCH i-g-t 0/3] V3D Mixed Job Submission Tests Accepted Rustgem The first part of my Igalia CE was focused on rewriting the VGEM driver in Rust. VGEM (Virtual GEM Provider) is a minimal non-hardware-backed GEM (Graphics Execution Manager) service. It is used with non-native 3D hardware for buffer sharing between the X server and DRI. The goal of the project was to explore Rust in the DRM subsystem and have a working VGEM driver written in Rust. Rust is a blazingly fast and memory-efficient language with its powerful ownership model. It was really exciting to learn more about Rust and implement from the beginning a DRM driver. During the project, I wrote two blog posts describing the technical aspects of rustgem driver. If you are interested in this project, check them out! Date Blogpost 28th February Rust for VGEM 22th March Adding a Timeout feature to Rustgem By the end of the first half of the Igalia CE, I sent an RFC patch with the rustgem driver. Thanks to Asahi Lina, the Rust for Linux folks, and Daniel Vetter for all the feedback provided during the development of the driver. I still need to address some feedback and rebase the series on top of the new pin-init API, but I hope to see this driver upstream soon. You can check the driver’s current status in this PR. Series Status [RFC PATCH 0/9] Rust version of the VGEM driver In Review Apart from rewriting the VGEM driver, I also sent a couple of improvements to the C version of the VGEM driver and its IGT tests. I found a missing mutex_destroy on the code and also an unused struct. Patches Status [PATCH] drm/vgem: add missing mutex_destroy Accepted [PATCH] drm/vgem: Drop struct drm_vgem_gem_object Accepted On the IGT side, I added some new tests to the VGEM tests. I wanted to ensure that my driver returned the correct values for all possible error paths, so I wrote this IGT test. Initially, it was just for me, but I decided to submit it upstream. Series Status [PATCH v3 i-g-t 0/2] Add negative tests to VGEM Accepted Virtual Kernel Mode Setting (VKMS) Focusing on the VKMS was the major goal of the second part of my Igalia CE. Melissa Wen is one of the maintainers of the VKMS, and she provided me with a fantastic opportunity to learn more about the VKMS. So far, I haven’t dealt with displays, and learning new concepts in the graphics stack was great. Rotating Planes VKMS is a software-only KMS driver that is quite useful for testing and running X (or similar compositors) on headless machines. At the time, the driver didn’t have any support for optional plane properties, such as rotation and blend mode. Therefore, my goal was to implement the first plane property of the driver: rotation. I described the technicalities of this challenge in this blog post, but I can say that it was a nice challenge of this mentorship project. In the end, we have the first plane property implemented for the VKMS and it is already committed. Together with the VKMS part, I sent a series to the IGT mailing list with some improvements to the kms_rotation_crc tests. These improvements included adding new tests for rotation with offset and reflection and the isolation of some Intel-specific tests. Series Status [PATCH v4 0/6] drm/vkms: introduce plane rotation property Accepted [PATCH 1/2] drm/vkms: Add kernel-doc to the function vkms_compose_row() In Review [PATCH i-g-t 0/4] kms_rotation_crc improvements and generalization In Review Improvements As I was working with the rotation series, I discovered a couple of things that could be improved in the VKMS driver. Last year, Igor Torrente sent a series to VKMS that changed the composition work in the driver. Before his series, the plane composition was executed on top of the primary plane. Now, the plane composition is executed on top of the CRTC. Although his series was merged, some parts of the code still considered that the composition was executed on top of the primary plane, limiting the VKMS capabilities. So I sent a couple of patches to the mailing list, improving the handling of the primary plane and allowing full alpha blending on all planes. Moreover, I sent a series that added a module parameter to set a background color to the CRTC. This work raised an interesting discussion about the need for this property by the user space and whether this parameter should be a KMS property. Apart from introducing the rotation property to the VKMS driver, I also took my time to implement two other properties: alpha and blend mode. This series is still awaiting review, but it would be a nice addition to the VKMS, increasing its IGT test coverage rate. Finally, I found a bug in the RGB565 conversion. The RGB565 conversion to ARGB16161616 involves some fixed-point operations and, when running the pixel-format IGT test, I verified that the RGB565 test was failing. So, some of those fixed-point operations were returning erroneous values. I checked that the RGB coefficients weren’t being rounded when converted from fixed-point to integers. But, this should happen in order to provided the proper coefficient values. Therefore, the fix was: implement a new helper that rounds the fixed-point value when converting it to a integer. After performing all this work on the VKMS, I sent a patch adding myself as a VKMS maintainer, which was acked by Javier Martinez and Melissa Wen. So now, I’m working together with Melissa, Rodrigo Siqueira and all DRI community to improve and maintain the VKMS driver. Series Status [PATCH v2 0/2] Update the handling of the primary plane Accepted [PATCH v2 1/2] drm/vkms: allow full alpha blending on all planes Accepted [PATCH v2] drm/vkms: add module parameter to set background color In Review [PATCH] drm/vkms: Implement all blend mode properties In Review [PATCH v3 1/2] drm: Add fixed-point helper to get rounded integer values Accepted Virtual Hardware A couple of years ago, Sumera Priyadarsini, an Outreachy intern, worked on a Virtual Hardware functionality for the VKMS. The idea was to add a Virtual Hardware or vblank-less mode as a kernel parameter to enable VKMS to emulate virtual devices. This means no vertical blanking events occur and page flips are completed arbitrarily when required for updating the frame. Unfortunately, she wasn’t able to wrap things up and this ended up never being merged into VKMS. Melissa suggested rebasing this series and now we can have the Virtual Hardware functionality working on the current VKMS. This was a great work by Sumera and my work here was just to adapt her changes to the new VKMS code. Series Status [PATCH 0/2] drm/vkms: Enable Virtual Hardware support In Review Bug Fixing! Finally, I was in the last week of the project, just wrapping things up, when I decided to run the VKMS CI. I had recently committed the rotation series and I had run the CI before, but to my surprise, I got the following output: [root@fedora igt-gpu-tools]# ./build/tests/kms_writeback IGT-Version: 1.27.1-gce51f539 (x86_64) (Linux: 6.3.0-rc4-01641-gb8e392245105-dirty x86_64) (kms_writeback:1590) igt_kms-WARNING: Output Writeback-1 could not be assigned to a pipe Starting subtest: writeback-pixel-formats Subtest writeback-pixel-formats: SUCCESS (0.000s) Starting subtest: writeback-invalid-parameters Subtest writeback-invalid-parameters: SUCCESS (0.001s) Starting subtest: writeback-fb-id Subtest writeback-fb-id: SUCCESS (0.020s) Starting subtest: writeback-check-output (kms_writeback:1590) CRITICAL: Test assertion failure function get_and_wait_out_fence, file ../tests/kms_writeback.c:288: (kms_writeback:1590) CRITICAL: Failed assertion: ret == 0 (kms_writeback:1590) CRITICAL: Last errno: 38, Function not implemented (kms_writeback:1590) CRITICAL: sync_fence_wait failed: Timer expired Stack trace: #0 ../lib/igt_core.c:1963 __igt_fail_assert() #1 [get_and_wait_out_fence+0x83] #2 ../tests/kms_writeback.c:337 writeback_sequence() #3 ../tests/kms_writeback.c:360 __igt_unique____real_main481() #4 ../tests/kms_writeback.c:481 main() #5 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main() #6 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34() #7 [_start+0x25] Subtest writeback-check-output failed. **** DEBUG **** (kms_writeback:1590) CRITICAL: Test assertion failure function get_and_wait_out_fence, file ../tests/kms_writeback.c:288: (kms_writeback:1590) CRITICAL: Failed assertion: ret == 0 (kms_writeback:1590) CRITICAL: Last errno: 38, Function not implemented (kms_writeback:1590) CRITICAL: sync_fence_wait failed: Timer expired (kms_writeback:1590) igt_core-INFO: Stack trace: (kms_writeback:1590) igt_core-INFO: #0 ../lib/igt_core.c:1963 __igt_fail_assert() (kms_writeback:1590) igt_core-INFO: #1 [get_and_wait_out_fence+0x83] (kms_writeback:1590) igt_core-INFO: #2 ../tests/kms_writeback.c:337 writeback_sequence() (kms_writeback:1590) igt_core-INFO: #3 ../tests/kms_writeback.c:360 __igt_unique____real_main481() (kms_writeback:1590) igt_core-INFO: #4 ../tests/kms_writeback.c:481 main() (kms_writeback:1590) igt_core-INFO: #5 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main() (kms_writeback:1590) igt_core-INFO: #6 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34() (kms_writeback:1590) igt_core-INFO: #7 [_start+0x25] **** END **** Subtest writeback-check-output: FAIL (1.047s) 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 🫠 Initially, I thought I had introduced the bug with my rotation series. Turns out, I just had made it more likely to happen. This bug has been hidden in VKMS for a while, but it happened just on rare occasions. Yeah, I’m talking about a race condition… The kind of bug that just stays hidden in your code for a long while. When I started to debug, I thought it was a performance issue. But then, I increased the timeout to 10 seconds and even then the job wouldn’t finish. So, I thought that it could be a deadlock. But after inspecting the DRM internal locks and the VKMS locks, it didn’t seem the case. Melissa pointed me to a hint: there was one framebuffer being leaked when removing the driver. I discovered that it was the writeback framebuffer. It meant that the writeback job was being queued, but it wasn’t being signaled. So, the problem was inside the VKMS locking mechanism. After tons of GDB and ftrace, I was able to find out that the composer was being set twice without any calls to the composer worker. I changed the internal locks a bit and I was able to run the test repeatedly for minutes! I sent the fix for review and now I’m just waiting for a Reviewed-by. Patches Status [PATCH] drm/vkms: Fix race-condition between the hrtimer and the atomic commit In Review While debugging, I found some things that could be improved in the VKMS writeback file. So, I decided to send a series with some minor improvements to the code. Series Status [PATCH 0/3] drm/vkms: Minor Improvements In Review Improving IGT tests If you run all IGT KMS tests on the VKMS driver, you will see that some tests will fail. That’s not what we would expect: we would expect that all tests would pass or skip. The tests could fail due to errors in the VKMS driver or be wrong exceptions on the IGT side. So, on the final part of my Igalia CE, I inspected a couple of IGT failures and sent fixes to address the errors. Linux Kernel This patch is a revival of a series I sent in January to fix the IGT test kms_addfb_basic@addfb25-bad-modifier. This test also failed in VC4, and I investigated the reason in January. I sent a patch to guarantee that the test would pass and after some feedback, I came down to a dead end. So, I left this patch aside for a while and decided to recapture it now. Now, with this patch being merged, we can guarantee that the test kms_addfb_basic@addfb25-bad-modifier is passing for multiple drivers. Patches Status [PATCH] drm/gem: Check for valid formats Accepted IGT On the IGT side, I sent a couple of improvements to the tests. The failure was usually just a scenario that the test didn’t consider. For example, the kms_plane_scaling test was failing in VKMS, because it didn’t consider the case in which the driver did not have the rotation property. As VKMS didn’t use to have the rotation property, the tests were failing instead of skipping. Therefore, I just developed a path for drivers without the rotation property for the tests to skip. I sent improvements to the kms_plane_scaling, kms_flip, and kms_plane tests, making the tests pass or skip on all cases for the VKMS. Patches Status [PATCH i-g-t] tests/kms_plane_scaling: negative tests can return -EINVAL or -ERANGE Accepted [PATCH i-g-t] tests/kms_plane_scaling: fix variable misspelling Accepted [PATCH i-g-t] tests/kms_plane_scaling: remove unused parameters Accepted [PATCH i-g-t] tests/kms_plane_scaling: Only set rotation if rotation != rotate-0 Accepted [PATCH v2 i-g-t] tests/kms_flip: Check if is Intel device before doing all the setup Accepted [PATCH i-g-t v2] tests/kms_plane: allow pixel-format tests to run on drivers without legacy LUT In Review VKMS CI List One important thing to VKMS is creating a baseline of generic KMS tests that should pass. This way, we can test new contributions against this baseline and avoid introducing regressions in the codebase. I sent a patch to IGT to create a testlist for the VKMS driver with all the KMS tests that must pass on the VKMS driver. This is great for maintenance, as we can run the testlist to ensure that the VKMS functionalities are preserved. With new features being introduced in VKMS, it is important to keep the test list updated. So, I verified the test results and updated this test list during my time at the Igalia CE. I intend to keep this list updated as long as I can. Series Status [PATCH i-g-t] tests/vkms: Create a testlist to the vkms DRM driver Accepted [PATCH i-g-t 0/3] tests/vkms: Update VKMS’s testlist Accepted Acknowledgment First, I would like to thank my great mentor Melissa Wen. Melissa and I are completing a year together as mentee and mentor and it has been an amazing journey. Since GSoC, Melissa has been helping me by answering every single question I have and providing me with great encouragement. I have a lot of admiration for her and I’m really grateful for having her as my mentor during the last year. Also, I would like to thank Igalia for giving me this opportunity to keep working in the DRI community and learning more about this fascinating topic. Thanks to all Igalians that helped through this journey! Moreover, I would like to thank the DRI community for reviewing my patches and giving me constructive feedback. Especially, I would like to thank Asahi Lina, Daniel Vetter and the Rust for Linux folks for all the help with the rustgem driver.
  • Dave Airlie (blogspot): lavapipe and sparse memory bindings: part two (2023/05/22 03:12)
     Thanks for all the suggestions, on here, and on twitter and on mastodon, anyway who noted I could use a single fd and avoid all the pain was correct!I hacked up an ever growing ftruncate/madvise memfd and it seemed to work fine. In order to use it for sparse I have to use it for all device memory allocations in lavapipe which means if I push forward I probably have to prove it works and scales a bit better to myself. I suspect layering some of the pb bufmgr code on top of an ever growing fd might work, or maybe just having multiple 2GB buffers might be enough.Not sure how best to do shaderResourceResidency, userfaultfd might be somewhat useful, mapping with PROT_NONE and then using write(2) to get a -EFAULT is also promising, but I'm not sure how best to avoid segfaults for read/writes to PROT_NONE regions.Once I got that going, though I ran headfirst into something that should have been obvious to me, but I hadn't thought through.llvmpipe allocates all it's textures linearly, there is no tiling (even for vulkan optimal). Sparse textures are incompatible with linear implementations. For sparseImage2D you have to be able to give the sparse tile sizes from just the image format. This typically means you have to work out how large the tile that fits into a hw page is in w/h. Of course for a linear image, this would be dependent on the image stride not just the format, and you just don't have that information.I guess it means texture tiling in llvmpipe might have to become a thing, we've thought about it over the years but I don't think there's ever been a solid positive for implementing it.Might have to put sparse support on the back burner for a little while longer.
  • Simon Ser: Status update, May 2023 (2023/05/21 22:00)
    Hi all! This status update comes in a bit late because I was on leave last week. The highlight this month is the HDR hackfest, I’ve written a dedicated blog post about it. After the publication of that blog post, I’ve sent out an RFC to dri-devel. We’ve made some good progress on wlroots’ Vulkan renderer. Manuel Stoeckl has added support for an intermediate buffer for blending, which is required for non-8-bit output formats and for color management features. The renderer now has an optional extra rendering pass to run a shader after blending. This is currently used to encode color values to sRGB, and will be used in the future to apply ICC profiles and to perform color space conversions. I’ve added support for the NV12 DMA-BUF format, support for more YCbCr formats is in a merge request. The new cursor-shape-v1 protocol has been merged in wayland-protocols thanks to KDE and winit folks. Traditionally Wayland clients needed to load XCursor themes and submit these as wl_shm buffers to the compositor. However there are a few downsides: there is no mechanism to configure the theme that gets loaded, the theme cannot be changed on-the-fly, there is no way to configure separate themes per seat, and loading cursors slows down client startup. The cursor-shape-v1 protocol allows clients to set a cursor image by its name instead of using wl_shm buffers. I’ve worked on adding a new mode to wayland-scanner to generate enums only. This is useful for libraries like wlroots which use C enums generated from protocol XML in their public headers. We plan to ship these headers as part of a wayland-protocols installation. To wrap up this status update, let’s mention a few updates for miscellaneous projects. A handful of new formats have been added to pixfmtdb. gqlclient now handles GraphQL interfaces correctly and generates methods to unwrap the underlying type. This is now used in hut to show ticket comments, among other things. go-imap now supports SEARCHRES, LITERAL+, and features a simplified API for STATUS commands. See you next month!
  • Dave Airlie (blogspot): lavapipe and sparse memory bindings (2023/05/17 07:28)
    Mike nerdsniped me into wondering how hard sparse memory support would be in lavapipe.The answer is unfortunately extremely.Sparse binding essentially allows creating a vulkan buffer/image of a certain size, then plugging in chunks of memory to back it in page-size multiple chunks.This works great with GPU APIs where we've designed this, but it's actually hard to pull off on the CPU.Currently lavapipe allocates memory with an aligned malloc. It allocates objects with no backing and non-sparse bindings connect objects to the malloced memory.However with sparse objects, the object creation should allocate a chunk of virtual memory space, then sparse binding should bind allocated device memory into the virtual memory space. Except Linux has no interfaces for doing this without using a file descriptor.You can't mmap a chunk of anonymous memory that you allocated with malloc to another location. So if I malloc backing memory A at 0x1234000, but the virtual memory I've used for the object is at 0x4321000, there's no nice way to get the memory from the malloc to be available at the new location (unless I missed an API).However you can do it with file descriptors. You can mmap a PROT_NONE area for the sparse object, then allocate the backing memory into file descriptors, then mmap areas from those file descriptors into the correct places.But there are limits on file descriptors, you get 1024 soft, or 4096 hard limits by default, which is woefully low for this. Also *all* device memory allocations would need to be fd backed, not just ones going to be used in sparse allocations.Vulkan has a limit maxMemoryAllocationCount that could be used for this, but setting it to the fd limit is a problem because some fd's are being used by the application and just in general by normal operations, so reporting 4096 for it, is probably going to explode if you only have 3900 of them left.Also the sparse CTS tests don't respect the maxMemoryAllocationCount anyways :-)I shall think on this a bit more, please let me know if anyone has any good ideas!
  • Maira Canal: Cross-Compiling CTS for the Raspberry Pi 4 (2023/05/16 12:00)
    This blogpost was actually written partially in November/December 2022 while I was developing IGT tests for the V3D driver. I ended up leaving it aside for a while and now, I came back and finished the last loose ends. That’s why I’m referencing the time where I was fighting against V3D’s noop jobs. Currently, during my Igalia Coding Experience, I’m working on the V3D’s IGT tests and therefore, I’m dealing a lot with the Raspberry Pi 4. During the project, I had a real struggle to design the tests for the v3d_submit_cl ioctl, as I was not capable of submit a proper noop job to the GPU. In order to debug the tests, my mentor Melissa Wen suggested to me to run the CTS tests to reproduce a noop job and debug it through Mesa. I cloned the CTS repository into my Raspberry Pi 4 and I tried to compile, but my Raspberry Pi 4 went OOM. This sent me on a journey to cross-compile CTS for the Raspberry Pi 4. I decided to compile this journey into this blogpost. During this blogpost, I’m using a Raspbian OS with desktop 64-bit. Installing Mesa First, you need to install Mesa on the Raspberry Pi 4. I decided to compile Mesa on the Raspberry Pi 4 itself, but maybe one day, I can write a blogpost about cross-compiling Mesa. 1. Installing libdrm Currently, the Raspbian repositories only provide libdrm 2.4.104 and Mesa’s main branch needs libdrm >=2.4.109. So, first, let’s install libdrm 2.4.109 on the Raspberry Pi 4. First, let’s make sure that you have meson installed on your RPi4. We will need meson to build libdrm and Mesa. I’m installing meson through pip3 because we need a meson version greater than 0.60 to build Mesa. # On the Raspberry Pi 4 $ sudo pip3 install meson Then, you can install libdrm 2.4.109 on the RPi4. # On the Raspberry Pi 4 $ wget https://dri.freedesktop.org/libdrm/libdrm-2.4.114.tar.xz $ tar xvpf libdrm-2.4.114.tar.xz $ cd libdrm-2.4.114 $ mkdir build $ cd build $ FLAGS="-O2 -march=armv8-a+crc+simd -mtune=cortex-a72" \ CXXFLAGS="-O2 -march=armv8-a+crc+simd -mtune=cortex-a72" \ meson -Dudev=true -Dvc4="enabled" -Dintel="disabled" -Dvmwgfx="disabled" \ -Dradeon="disabled" -Damdgpu="disabled" -Dnouveau="disabled" -Dfreedreno="disabled" \ -Dinstall-test-programs=true .. $ sudo ninja install 2. Going back to Mesa So, now let’s install Mesa. During this blogpost, I will use ${USER} as the username on the machine. Note that, in order to run sudo apt build-dep mesa, you will have to uncomment some deb-src on the file /etc/apt/sources.list and run sudo apt update. # On the Raspberry Pi 4 # Install Mesa's build dependencies $ sudo apt build-dep mesa # Build and Install Mesa $ git clone https://gitlab.freedesktop.org/mesa/mesa $ cd mesa $ mkdir builddir $ mkdir installdir $ CFLAGS="-mcpu=cortex-a72" CXXFLAGS="-mcpu=cortex-a72" \ meson -Dprefix="/home/${USER}/mesa/installdir" -D platforms=x11 \ -D vulkan-drivers=broadcom \ -D gallium-drivers=kmsro,v3d,vc4 builddir $ cd builddir $ ninja $ cd .. $ ninja -C builddir install Creating the Raspberry Pi’s sysroot In order to cross-compile the Raspberry Pi, you need to clone the target sysroot to the host. For it, we are going to use rsync, so the host and the target need to be connected through a network. On the Raspberry Pi 4 1. Update the system $ sudo apt update $ sudo apt dist-upgrade 2. Enable rsync with elevated rights As I said before, we will be using the rsync command to sync files between the host and the Raspberry Pi. For some of these files, root rights is required internally, so let’s enable rsync with elevated rights. $ echo "$USER ALL=NOPASSWD:$(which rsync)" | sudo tee --append /etc/sudoers 3. Setup important symlinks Some symbolic links are needed to make the toolchain work properly, so to create all required symbolic link reliably, this bash script is needed. $ wget https://raw.githubusercontent.com/abhiTronix/raspberry-pi-cross-compilers/master/utils/SSymlinker Once it is downloaded, you just need to make it executable, and then run it for each path needed. $ sudo chmod +x SSymlinker $ ./SSymlinker -s /usr/include/aarch64-linux-gnu/asm -d /usr/include $ ./SSymlinker -s /usr/include/aarch64-linux-gnu/gnu -d /usr/include $ ./SSymlinker -s /usr/include/aarch64-linux-gnu/bits -d /usr/include $ ./SSymlinker -s /usr/include/aarch64-linux-gnu/sys -d /usr/include $ ./SSymlinker -s /usr/include/aarch64-linux-gnu/openssl -d /usr/include $ ./SSymlinker -s /usr/lib/aarch64-linux-gnu/crtn.o -d /usr/lib/crtn.o $ ./SSymlinker -s /usr/lib/aarch64-linux-gnu/crt1.o -d /usr/lib/crt1.o $ ./SSymlinker -s /usr/lib/aarch64-linux-gnu/crti.o -d /usr/lib/crti.o On the host machine 1. Setting up the directory structure First, we need to create a workspace for building CTS, where the Raspberry Pi 4 sysroot is going to be built. $ sudo mkdir ~/rpi-vk $ sudo mkdir ~/rpi-vk/installdir $ sudo mkdir ~/rpi-vk/tools $ sudo mkdir ~/rpi-vk/sysroot $ sudo mkdir ~/rpi-vk/sysroot/usr $ sudo mkdir ~/rpi-vk/sysroot/usr/share $ sudo chown -R 1000:1000 ~/rpi-vk $ cd ~/rpi-vk 2. Sync Raspberry Pi 4 sysroot Now, we need to sync up our sysroot folder with the system files from the Raspberry Pi. We will be using rsync that let us sync files from the Raspberry Pi. To do this, enter the following commands one by one into your terminal and remember to change username and 192.168.1.47 with the IP address of your Raspberry Pi. $ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/lib sysroot $ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/usr/include sysroot/usr $ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/usr/lib sysroot/usr $ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/usr/share sysroot/usr $ rsync -avz --rsync-path="sudo rsync" --delete pi@192.168.1.47:/home/${USER}/mesa/installdir installdir 3. Fix symbolic links The files we copied in the previous step still have symbolic links pointing to the file system on the Raspberry Pi. So, we need to alter this, so that they become relative links from the new sysroot directory on the host machine. There is a Python script available online that can help us. $ wget https://raw.githubusercontent.com/abhiTronix/rpi_rootfs/master/scripts/sysroot-relativelinks.py Once it is downloaded, you just need to make it executable and run it. $ sudo chmod +x sysroot-relativelinks.py $ ./sysroot-relativelinks.py sysroot 4. Installing the Raspberry Pi 64-Bit Cross-Compiler Toolchain As Raspbian OS 64-bits uses GCC 10.2.0, let’s install the proper cross-compiler toolchain on our host machine. I’m using the toolchain provided by abhiTronix/raspberry-pi-cross-compilers, but there are many other around the web that you can use. We are going to use the tools folder to setup our toolchain. $ cd ~/rpi-vk/tools $ wget https://sourceforge.net/projects/raspberry-pi-cross-compilers/files/Bonus%20Raspberry%20Pi%20GCC%2064-Bit%20Toolchains/Raspberry%20Pi%20GCC%2064-Bit%20Cross-Compiler%20Toolchains/Bullseye/GCC%2010.2.0/cross-gcc-10.2.0-pi_64.tar.gz/download $ tar xvf download $ rm download 5. Setting up Wayland If you run all the steps from this tutorial expect this one, you will still get some weird Wayland-related errors when cross-compiling it. This will happen because probably the wayland-scanner version from your host is different from the wayland-scanner version of the target. For example, on Fedora 37, the wayland-scanner version is 1.21.0 and the version on the Raspberry Pi 4 is 1.18.0. In order to build Wayland, you will need the following dependencies: $ sudo dnf install expat-devel xmlto So, let’s install the proper Wayland version on our sysroot. $ wget https://wayland.freedesktop.org/releases/wayland-1.18.0.tar.xz $ tar xvf wayland-1.18.0.tar.xz $ cd wayland-1.18.0 $ meson --prefix ~/rpi-vk/sysroot/usr build $ ninja -C install Let’s cross-compile CTS! Now that we have the hole Raspberry Pi environment set up, we just need to create a toolchain file for CMake and its all set! So, let’s clone the CTS repository. $ git clone https://github.com/KhronosGroup/VK-GL-CTS $ cd VK-GL-CTS To build dEQP, you need first to download sources for zlib, libpng, jsoncpp, glslang, vulkan-docs, spirv-headers, and spirv-tools. To download sources, run: $ python3 external/fetch_sources.py Inside the CTS directory, we are going to create a toolchain file called cross_compiling.cmake with the following contents: set(CMAKE_VERBOSE_MAKEFILE ON) set(CMAKE_SYSTEM_NAME Linux) set(CMAKE_SYSTEM_VERSION 1) set(CMAKE_SYSTEM_PROCESSOR aarch64) # Check if the sysroot and toolchain paths are correct set(tools /home/${USER}/rpi-vk/tools/cross-pi-gcc-10.2.0-64) set(rootfs_dir $ENV{HOME}/rpi-vk/sysroot) set(CMAKE_FIND_ROOT_PATH ${rootfs_dir}) set(CMAKE_SYSROOT ${rootfs_dir}) set(ENV{PKG_CONFIG_PATH} "") set(ENV{PKG_CONFIG_LIBDIR} "${CMAKE_SYSROOT}/usr/lib/pkgconfig:${CMAKE_SYSROOT}/usr/share/pkgconfig") set(ENV{PKG_CONFIG_SYSROOT_DIR} ${CMAKE_SYSROOT}) set(CMAKE_LIBRARY_ARCHITECTURE aarch64-linux-gnu) set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -fPIC -Wl,-rpath-link,${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE} -L${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE}") set(CMAKE_C_FLAGS "${CMAKE_CXX_FLAGS} -fPIC -Wl,-rpath-link,${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE} -L${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE}") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC -Wl,-rpath-link,${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE} -L${CMAKE_SYSROOT}/usr/lib/${CMAKE_LIBRARY_ARCHITECTURE}") set(WAYLAND_SCANNER ${CMAKE_SYSROOT}/usr/bin/wayland-scanner) ## Compiler Binary SET(BIN_PREFIX ${tools}/bin/aarch64-linux-gnu) SET (CMAKE_C_COMPILER ${BIN_PREFIX}-gcc) SET (CMAKE_CXX_COMPILER ${BIN_PREFIX}-g++ ) SET (CMAKE_LINKER ${BIN_PREFIX}-ld CACHE STRING "Set the cross-compiler tool LD" FORCE) SET (CMAKE_AR ${BIN_PREFIX}-ar CACHE STRING "Set the cross-compiler tool AR" FORCE) SET (CMAKE_NM {BIN_PREFIX}-nm CACHE STRING "Set the cross-compiler tool NM" FORCE) SET (CMAKE_OBJCOPY ${BIN_PREFIX}-objcopy CACHE STRING "Set the cross-compiler tool OBJCOPY" FORCE) SET (CMAKE_OBJDUMP ${BIN_PREFIX}-objdump CACHE STRING "Set the cross-compiler tool OBJDUMP" FORCE) SET (CMAKE_RANLIB ${BIN_PREFIX}-ranlib CACHE STRING "Set the cross-compiler tool RANLIB" FORCE) SET (CMAKE_STRIP {BIN_PREFIX}-strip CACHE STRING "Set the cross-compiler tool RANLIB" FORCE) set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER) set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY) set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY) Note that we had to specify our toolchain and also the specify the path to the wayland-scanner. Now that we are all set, we can finally cross-compile CTS. $ mkdir build $ cd build $ cmake -DCMAKE_BUILD_TYPE=Debug \ -DCMAKE_LIBRARY_PATH=/home/${USER}/rpi-vk/installdir/lib \ -DCMAKE_INCLUDE_PATH=/home/${USER}/rpi-vk/installdir/include \ -DCMAKE_GENERATOR=Ninja \ -DCMAKE_TOOLCHAIN_FILE=/home/${USER}/VK-GL-CTS/cross_compiling.cmake .. $ ninja Now, you can transfer the compiled files to the Raspberry Pi 4 and run CTS! This was a fun little challenge of my CE project and it was pretty nice to learn more about CTS. Running CTS was also a great idea from Melissa as I was able to hexdump the contents of a noop job for the V3DV and fix my noop job on IGT. So, now I finally have a working noop job on IGT and you can check it here. Also, a huge thanks to my friend Arthur Grillo for helping me with resources about cross-compiling for the Raspberry Pi.
  • Arthur Grillo: I Got Into GSoC! (2023/05/15 09:24)
    Wow, it’s really happening :D After some warm-up and work put on, I got accepted into the 2023 Google Summer of Code program with the X.Org organization. What I’m Going to Do The title of my project is “Increasing Code Coverage on the DRM Code”. The DRM subsystem is the standard way to interact with complex graphics devices. It provides many global helpers for use inside the drivers. As these helpers are used by many drivers on the DRM subsystem, testing those functions for asserting that no regressions are made is crucial for kernel development. Many units test were written for those helpers with the KUnit framework. But there is still much work to do. Running the Gcov code covering analysis tool, we see that just one file has 100% of code coverage. Knowing this, I will create more tests for the drm-format-helper.c. This file handles color format conversion. Currently, the conversion functions can’t handle planar formats. Instead of having the color information packed in a single plane, those have their information separated into multiple planes. I pretend to add support for it by modifying the drm_fb_xfrm() function. My Mentors This Summer, I will be mentored by: Maíra Canal André “Tony” Almeida Tales L. Aparecida What I’m Doing Right Now During this community bonding period, I’m reserving time to review patches on the dri-devel mailing list and read more about the DRM. These are my reviews: https://lore.kernel.org/all/87fb8f63-ae38-33eb-08ef-7410b52b4f98@riseup.net/ https://lore.kernel.org/all/ffee1587-5236-ce35-40b4-5b8286dd095b@riseup.net/ https://lore.kernel.org/all/7ac2cfb2-3912-675a-3ba0-171caab3ba30@riseup.net/ https://lore.kernel.org/all/bc601be4-14e3-14f3-8d14-baea399150e2@riseup.net/ In parallel, I’m trying to add support for the NV12 format to the VKMS driver. This would be nice for testing userspace programs that contain framebuffers with video. This is turning out to be bigger than I thought, as I need to add support for planar formats there and make the format conversion in software. Stay tuned for blog posts on that too ;). See ya! :)
  • Andy Hao: The Project - GSoC '23 (2023/05/14 12:41)
    I received the email saying that my proposal for GSoC ‘23 was accepted by X.Org Foundation on May 5th. It’s my first time working with open-source communities and I’m really excited about it. My mentor Martin Roukala gave me some great advice and helped me understand the project and my tasks better. Now it’s time to dive into the project details.
  • Mike Blumenkrantz: Friday Updates (2023/05/12 00:00)
    Team Updates A number of members of my team at Valve don’t blog, and I like to bring visibility to their work when I can. Here’s a quick roundup of what some of them have been doing, in no particular order. Or maybe from most important work to least important work. You decide. Konstantin After another week of heroic work, his implementation of VK_EXT_descriptor_indexing for Lavapipe has finally passed CI. This means it will probably be merged within the next week or two. Given that Dave Airlie is practically done with his VK_EXT_mesh_shader (it draws multiple triangles now), this means the only thing remaining for full VKD3D-PROTON functionality on Lavapipe is sparse binding. I think. Which begs the question: will Lavapipe get sparse binding support before Intel? It’s depressing, it seems impossible given how many years we’ve all been waiting, and we don’t want to admit it to ourselves, but we already know the answer. Friedrich Another of our well-monitored raytracing wranglers, Friedrich has managed to escape the from his designated habitat extremely comfortable work environment to do something unrelated to the tracing of rays. Specifically, he started a blog, and it’s got just the right mix of extremely serious content and extremely seriouser content to make for a great read. I have high hopes for the future of his content provided we can get him back on target. Unfortunately the most recent sightings have been in dangerous proximity to the RADV pipeline defrobulator infrastructure, which doesn’t bode well for anyone’s safety. Samuel I would never accuse Samuel of taking time off. This is a man who, when he goes grocery shopping, carries a tablet in one hand so he can continue implementing extensions while he sifts through produce. This is a guy who, when he goes through airport security, gets asked to review the metal detectors rather than walk through them. This is an individual who, when he takes time off, brings his laptop with him to rebase his plethora of code refactoring branches. What I’m trying to say is it was a great week for code refactoring. [REDACTED] This individual has a blog but refuses to post about their employment status until a prominent news site covers it first. Well, Captain Code Deletion, consider this news C O V E R E D. We’re glad to have you deleting things here after your brief and questionably-successful stint at Twitter. And I had some other stuff I was gonna blog about, but everyone’s already gotta click through to Friedrich’s post, and it’s long, so I’m gonna save it. Because I’ve totally got that content. And it’s ready. And I could post it. But also it’s Friday? And Capricorn is in the ascendant–you know what that means. Yeah. Great post.
  • Lucas Fryzek: Igalia’s Mesa 23.1 Contributions - Behind the Scenes (2023/05/11 04:00)
    It’s an exciting time for Mesa as its next major release is unveiled this week. Igalia has played an important role in this milestone, with Eric Engestrom managing the release and 11 other Igalians contributing over 110 merge requests. A sample of these contributions are detailed below. radv: Implement vk.check_status As part of an effort to enhance the reliability of GPU resets on amdgpu, Tony implemented a GPU reset notification feature in the RADV Vulkan driver. This new function improves the robustness of the RADV driver. The driver can now check if the GPU has been reset by a userspace application, allowing the driver to recover their contexts, exit, or engage in some other appropriate action. You can read more about Tony’s changes in the link below https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22253 turnip: KGSL backend rewrite With a goal of improving feature parity of the KGSL kernel mode driver with its drm counterpart, Mark has been rewriting the backend for KGSL. These changes leverage the new, common backend Vulkan infrastructure inside Mesa and fix multiple bugs. In addition, they introduce support for importing/exporting sync FDs, pre-signalled fences, and timeline semaphore support. If you’re interested in taking a deeper dive into Mark’s changes, you can read the following MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21651 turnip: a7xx preparation, transition to C++ Danylo has adopted a significant role for two major changes inside turnip: 1)contributing to the effort to migrate turnip to C++ and 2)supporting the next generation a7xx Adreno GPUs from Qualcomm. A more detailed overview of Danylo’s changes can be found in the linked MRs below: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21931 https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22148 v3d/v3dv various fixes & CTS conformance Igalia maintains the v3d OpenGL driver and v3dv Vulkan drive for broadcom videocore GPUs which can be found on devices such as the Raspberry Pi. Iago, Alex and Juan have combined their expertise to implement multiple fixes for both the v3d gallium driver and the v3dv vulkan driver on the Raspberry Pi. These changes include CPU performance optimizations, support for 16-bit floating point vertex attributes, and raising support in the driver to OpenGL 3.1 level functionality. This Igalian trio has also been addressing fixes for conformance issues raised in the Vulkan 1.3.5 conformance test suite (CTS). You can dive into some of their Raspberry Pi driver changes here: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22131 https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21361 https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20787 ci, build system, and cleanup In addition to managing the 23.1 release, Eric has also implemented many fixes in Mesa’s infrastructure. He has assisted with addressing a number of CI issues within Mesa on various drivers from v3d to panfrost. Eric also dedicated part of his time to general clean-up of the Mesa code (e.g. removing duplicate functions, fixing and improving the meson-based build system, and removing dead code). If you’re interested in seeing some of his work, check out some of the MRs below: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22410 https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21504 https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21558 https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20180
  • Friedrich Vock: GPU Hang Exploration: Splitgate (2023/05/11 00:00)
    GPU hangs are one of the most common results of pretty much anything going wrong GPU-side, and finding out why they occur isn’t always easy. In this blog post, I’ll document my journey towards finding the cause of one specific hang in the game “Splitgate”. Right off the bat, I noticed a few oddities with this particular hang. Firstly, the actual game always ran completely fine. The only place where it hung was on the first startup where the game automatically configures graphics settings (I’ll call it autoconfiguration from here). Additionally, while I could reproduce the hang on my Steam Deck, I couldn’t get the hang to appear on my desktop. I have an RDNA2 graphics card, which is the same architecture as the Deck, so it seemed unlikely that specifics about the hardware architecture were the problem here. API Validation As a first step, I tried running the game with the Vulkan Validation Layers. If the game is using the API in an invalid way and that is the cause of the hangs, there’s a rather good chance the Validation Layers will catch it. Even though there were a few errors from the validation layers, it seemed like none of the errors were actually relevant to the hang. Most importantly, the errors with autoconfiguration on were the same as the errors with autoconfiguration off. As any software, the Validation Layers aren’t perfect and can’t detect every possible invalid behaviour. At this point I was still unsure whether I’d have to search for the bug on the application side or on the driver side. API dumping With the validation layers being unable to detect any invalid behaviour by the app during the autoconfiguration phase, another question comes to mind: What is the application doing, actually? To answer that, I utilized the API Dump Vulkan layer by LunarG. When this layer is activated, it dumps all the commands made by the application, including every parameter and return value to standard output. While API dumps are good to have for debugging, large API dumps from large engines are often difficult to navigate (not just because it’s an 800MB large file and your text editor dies trying to scroll through them). Instead, it’s often best to extract just the work that hangs for further debugging. But what frame is this? Finding the hanging submission The CPU and GPU do work asynchronously, which means that the CPU is free to do more work while GPU starts with its work. Somewhat unfortunately, this also means the CPU can do more Vulkan calls which will show up in the API dump after the app already submitted the hanging frame to the GPU. This means that I couldn’t just look at the last command in the API dump and assume that command caused the hang. Luckily, there were other hints towards what caused the hang. In Vulkan, when you want to know when a particular work submission finishes, you give a VkFence to the submit function. Later, you can wait for the submission to finish with vkWaitForFences, or you can query whether the submission has already finished with vkGetFenceStatus. I noticed that after work was submitted, the app seemed to call vkGetFenceStatus from time to time, polling whether that submission was finished. Usually, vkGetFenceStatus would return VK_SUCCESS after a few calls, indicating that the submission finished. However, there was one submission where vkGetFenceStatus seemed to always return VK_NOT_READY. It seemed very likely that the GPU was hanging while executing that submission. To test my theory, I modified the implementation of vkQueueSubmit, which you call for submitting work, to call vkDeviceWaitIdle immediately after submitting the work. vkDeviceWaitIdle waits for all outstanding GPU work to finish. When the GPU hangs, the vkQueueSubmit which caused the hang should be the last line in the API dump1. This time, the API dump cut off at the vkQueueSubmit for which vkGetFenceStatus always returned VK_NOT_READY. Bingo. Going lower-level Now we know which submission hangs, but that submission still contains a lot of commands. Even though the text editor now survives scrolling through the commands, finding what is wrong by inspection is highly unlikely. Instead, I tried to answer the question: “What specific command is making the GPU hang?” In order to find the answer, I needed to find out as much as possible about what state the GPU is in when it hangs. There are a few useful tools which helped me gather info: umr The first thing I did was use umr to query if any waves were active at the time of the hang. Waves (or Wavefronts) are groups of 32 or 64 shader invocations (or threads in DX terms) that the GPU executes at the same time2. There were indeed quite a few waves currently executing. For each wave, umr can show a disassembly of the GPU code that is currently executing, as well as the values of all registers, and more. In this case, I was especially interested in the halt and fatal_halt status bits for each wave. These bits are set when the wave encounters a fatal exception (for example dereferencing invalid pointers) and won’t continue execution. These bits were not set for any waves I inspected, so it was unlikely that exceptions in a shader were causing the hang. Aside from exceptions, the other common way for shaders to trigger GPU hangs is by accidentally executing infinite loops. But the shader code currently executing was very simple and didn’t even have a jump instruction anywhere, so the hang couldn’t be caused by infinite loops either. RADV_DEBUG=hang Shaders aren’t the only thing that the GPU executes, and as such shaders aren’t the only thing that can cause GPU hangs. In RADV, command buffers recorded in Vulkan are translated to a hardware-specific command buffer format called PKT33. Commands encoded in this format are written to GPU-accessible memory, and executed by the GPU’s command processor (CP for short) when the command buffer is submitted. These commands might also be involved in the hang, so I tried finding out which commands the CP was executing when the hang happened. RADV has integrated debug functionality that can help with exactly this, which can be enabled by setting an environment variable named RADV_DEBUG to "hang". But when I tried triggering the hang with this environment variable in place, it started up just fine! This isn’t the first time I’ve seen this. RADV_DEBUG=hang has a funny side effect: It also inserts commands to wait for draws or dispatches to complete immediately after the dispatch is triggered. This immensely helps with figuring out which shader is faulty if there are multiple shaders executing concurrently. But it also prevents certain hangs from happening: Where things executing concurrently causes the hang in the first place. In other words, we seem to be looking at a synchronization issue. Synchronization boogaloo Even though we know we’re dealing with a synchronization issue, the original question remains unsolved: What command causes the hang? The “sync after every draw/dispatch” method of RADV_DEBUG=hang fixes the issue, but it has a very broad effect. Since the issue seems to reproduce very reliably (which in itself is a rarity for synchronization bugs), we can apply that sync selectively to only some draws or dispatches to narrow down what commands exactly cause the hangs. First, I tried restricting the synchronization to only apply to dispatches (so no draws were synchronized). This made the hang appear again. Testing the other way around (restricting the synchronization to only draws) confirmed: All compute dispatches were fine, the issue was about draw synchronization only. Next, I tried only synchronizing at the end of renderpasses. This also fixed the hang. However, synchronizing at the start of renderpasses fixed nothing. Therefore it was impossible that missing synchronization across renderpasses was the cause of the hang. The last likely option was that there was missing synchronization in between the draws and something in between renderpasses. At this point, the API dump of the hanging submission proved very helpful. Upon taking a closer look, it became clear that the commands in the submitted command buffer had a very simple pattern (some irrelevant commands omitted for brevity): vkCmdBeginRenderPass to begin a new renderpass vkCmdDraw vkCmdEndRenderPass, ending the renderpass vkCmdWriteTimestamp, writing the current elapsed time What stuck out to me was that vkCmdWriteTimestamp was called with a pipelineStage of VK_PIPELINE_STAGE_TOP_OF_PIPE. In simple terms, this means that the timestamp can be written before the preceding draw finished.4 Further testing confirmed: If I insert synchronization before writing the timestamp, the hang is fixed. Inserting synchronization immediately after writing the timestamp makes the hang re-appear. How hard can writing a timestamp be? By now, it has become pretty clear that timestamp queries are the problem here. But it just didn’t really make sense that the timestamp write itself would hang. Timestamp writes on AMD hardware don’t require launching any shaders. They can be implemented using one PKT3 command called COPY_DATA5, which accepts many data sources other than memory. One of these data sources is the current timestamp. RADV uses COPY_DATA to write the timestamp to memory. The memory for these timestamps is managed by the driver, so it’s exceedingly unlikely the memory write would fail. From the wave analysis with umr earlier I also knew that the in-flight shaders didn’t actually write or read any memory that might interfere with the timestamp write (somehow). The timestamp write itself being the cause of the hang seemed impossible. Taking a step back If timestamp writes can’t be the problem, what else can there be that might hang the GPU? There is one other part to timestamp queries aside from writing the timestamp itself: In Vulkan, timestamps are always written to opaque “query pool” objects. In order to actually view the timestamp value, an app has to copy the results stored in the query pool to a buffer in CPU or GPU memory. Splitgate uses Unreal Engine 4, which has a known bug related to query pool copies that RADV has to work around. It isn’t too far-fetched to think there might be other bugs in UE’s Vulkan RHI regarding query copies. Synchronizing the query copy didn’t do anything, but just commenting out the query copy fixed the hang as well. ???? Up until this point, I was pretty sure that something about the timestamp write must be the cause of the problems. Now it seemed like query copies might also influence the problem somehow? I was pretty unsure how to reconcile these two observations, so I tried finding out more about how exactly the query copy affected things. Query copies on RADV are implemented using small compute shaders written directly in NIR. Having the simple driver-internal shaders in NIR is a nice and simple way of storing them inside the driver, but they’re a bit hard to read for people not used to the syntax. For demonstration purposes I’ll use a GLSL translation of the shader6. The copy shader for timestamp queries looks like this: location(binding = 0) buffer dst_buf; location(binding = 1) buffer src_buf; void main() { uint32_t result_size = flags & VK_QUERY_RESULT_64_BIT ? sizeof(uint64_t) : sizeof(uint32_t); uint32_t dst_stride = result_size; if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT) dst_stride += sizeof(uint32_t); uint32_t src_stride = 8; uint64_t result = 0; bool available = false; uint64_t src_offset = src_stride * global_id.x; uint64_t dst_offset = dst_stride * global_id.x; uint64_t timestamp = src_buf[src_offset]; if (timestamp != TIMESTAMP_NOT_READY) { result = timestamp; available = true; } if ((flags & VK_QUERY_RESULT_PARTIAL_BIT) || available) { if (flags & VK_QUERY_RESULT_64_BIT) { dst_buf[dst_offset] = result; } else { dst_buf[dst_offset] = (uint32_t)result; } } if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT) { dst_buf[dst_offset + result_size] = available; } } At first, I tried commenting out the stores to dst_buf, which resulted in the hangs disappearing again. This can indicate that dst_buf is the problem, but it’s not the only possibility. The compiler can also optimize out the load because it isn’t used further down in the shader, so this could also mask an invalid read as well. When I commented out the read and always stored a constant instead - it also didn’t hang! But could it be that the shader was reading from an invalid address? Splitgate is by far not the only app out there using timestamp queries, and those apps all work fine - so it can’t just be fundamentally broken, right? To test this out, I modified the timestamp write command once again. Remember how PKT3_COPY_DATA is really versatile? Aside from copying memory and timestamps, it can also copy a 32/64-bit constant supplied as a parameter. I undid all the modifications to the copy shader and forced a constant to be written instead of timestamps. No hangs to be seen. ????????? It seems like aside from the synchronization, the value that is written as the timestamp influences whether a hang happens or not. But that also means neither of the two things already investigated can actually be the source of the hang, can they? It’s essentially the same question as in the beginning, still unanswered: “What the heck is hanging here???” RADV_DEBUG=hang (but useful this time) Stabbing in the dark with more guesses won’t help here. The only thing that can is more info. I already had a small GPU buffer that I used for some other debugging I skipped over. To get definitive info on whether it hangs because of the timestamp write, the timestamp copy, or something else entirely, I modified the command buffer recording to write some magic numbers into that debug buffer whenever these operations happened. It went something along the lines of: write 0xAAAAAAAA if timestamp write is complete write 0xBBBBBBBB if timestamp copy is complete However, I still needed to ensure I only read the magic numbers after the GPU had time to execute them (without waiting forever during GPU hangs).. This required a different intricate and elaborate synchronization algorithm. // VERY COMPLICATED SYNCHRONIZATION sleep(1); With that out of the way, let’s take a look at the magic number of the hanging submission. Magic: 0x0 what??? this means neither write nor copy have executed? Alright, what if I add another command writing a magic number right at the beginning of the command buffer? Magic: 0x0 So… the hang happens before the command buffer starts executing? Something can’t be right here.7 At this point I started logging all submits that contained either timestamp writes or timestamp copies, and I noticed that there was another submission with the same pattern of commands right before the hanging one. Multi-submit madness This previous submission had executed just fine - all timestamps were written, all shaders finished without hangs. This meant that neither the way timestamps were written nor the way they were copied could be direct causes of hangs, because they worked just one submission prior. I verified this theory by forcing full shader synchronization to happen before the timestamp write, but only for the submission that actually hangs. To my surprise, this did nothing to fix the hangs. When I applied the synchronization trick to the previous submit (that always worked fine!), the hangs stopped appearing. It seems like the cause of the hang is not in the hanging submission, but in a completely separate one that completed successfully. What is the app doing? Let’s rewind to the question that started this whole mess. “What is the app doing?” Splitgate (as of today) uses Unreal Engine 4.27.2. Luckily, Epic Games make the source code of UE available to anyone registering for it with their Epic Games account. There was hope that the benchmark code they were using was built into Unreal, where I could examine what exactly it does. Searching in the game logs from a run with the workaround enabled, I found this: LogSynthBenchmark: Display: Graphics: LogSynthBenchmark: Display: Adapter Name: 'AMD Custom GPU 0405 (RADV VANGOGH)' LogSynthBenchmark: Display: (On Optimus the name might be wrong, memory should be ok) LogSynthBenchmark: Display: Vendor Id: 0x1002 LogSynthBenchmark: Display: Device Id: 0x163F LogSynthBenchmark: Display: Device Revision: 0x0 LogSynthBenchmark: Display: GPU first test: 0.06s LogSynthBenchmark: Display: ... 3.519 s/GigaPix, Confidence=100% 'ALUHeavyNoise' (likely to be very inaccurate) LogSynthBenchmark: Display: ... 2.804 s/GigaPix, Confidence=100% 'TexHeavy' (likely to be very inaccurate) LogSynthBenchmark: Display: ... 2.487 s/GigaPix, Confidence=100% 'DepTexHeavy' (likely to be very inaccurate) LogSynthBenchmark: Display: ... 8.917 s/GigaPix, Confidence=100% 'FillOnly' (likely to be very inaccurate) LogSynthBenchmark: Display: ... 0.330 s/GigaPix, Confidence=100% 'Bandwidth' (likely to be very inaccurate) LogSynthBenchmark: Display: ... 0.951 s/GigaVert, Confidence=100% 'VertThroughPut1' (likely to be very inaccurate) LogSynthBenchmark: Display: ... 6.053 s/GigaVert, Confidence=100% 'VertThroughPut2' (likely to be very inaccurate) LogSynthBenchmark: Display: GPU second test: 0.54s LogSynthBenchmark: Display: ... 4.186 s/GigaPix, Confidence=100% 'ALUHeavyNoise' (likely to be inaccurate) LogSynthBenchmark: Display: ... 3.118 s/GigaPix, Confidence=100% 'TexHeavy' (likely to be inaccurate) LogSynthBenchmark: Display: ... 2.844 s/GigaPix, Confidence=100% 'DepTexHeavy' (likely to be inaccurate) LogSynthBenchmark: Display: ... 9.127 s/GigaPix, Confidence=100% 'FillOnly' (likely to be inaccurate) LogSynthBenchmark: Display: ... 0.339 s/GigaPix, Confidence=100% 'Bandwidth' (likely to be inaccurate) LogSynthBenchmark: Display: ... 0.983 s/GigaVert, Confidence=100% 'VertThroughPut1' (likely to be inaccurate) LogSynthBenchmark: Display: ... 6.422 s/GigaVert, Confidence=100% 'VertThroughPut2' (likely to be inaccurate) LogSynthBenchmark: Display: GPU Final Results: LogSynthBenchmark: Display: ... 4.186 s/GigaPix, Confidence=100% 'ALUHeavyNoise' LogSynthBenchmark: Display: ... 3.118 s/GigaPix, Confidence=100% 'TexHeavy' LogSynthBenchmark: Display: ... 2.844 s/GigaPix, Confidence=100% 'DepTexHeavy' LogSynthBenchmark: Display: ... 9.127 s/GigaPix, Confidence=100% 'FillOnly' LogSynthBenchmark: Display: ... 0.339 s/GigaPix, Confidence=100% 'Bandwidth' LogSynthBenchmark: Display: ... 0.983 s/GigaVert, Confidence=100% 'VertThroughPut1' LogSynthBenchmark: Display: ... 6.422 s/GigaVert, Confidence=100% 'VertThroughPut2' FSynthBenchmark indeed appears in the UE codebase as a benchmark tool to auto-calibrate settings. From reading its code, it seemed like it does 3 separate benchmark ru… wait. 3?? We can clearly see from the logs there are only two benchmark runs. Maybe the third run hangs the GPU somehow? Hang? Well yes, but actually no While thinking about this, another possibility came to my mind. The GPU driver can’t actually detect if the GPU is hung because of some fatal error or if it just takes an obscenely long amount of time for some work. No matter what it is, if it isn’t finished in 10 seconds, the GPU will be reset.8 So what if the hang I’ve been chasing all this time isn’t actually a hang? How do I even find out? The amdgpu kernel driver has a parameter named lockup_timeout for this exact purpose: You can modify this parameter to change the amount of time after which the GPU is reset if a job doesn’t finish, or disable this GPU reset entirely. To test this theory, I went with disabling the GPU reset. After setting all the parameters up and rebooting, I started the game another time. And it worked! It took a really long time, but eventually, the game started up fully. It was indeed just hammering the poor Deck’s GPU with work that took way too long. Why does my workaround work? Finally, things start clearing up a bit. There is still an open question, though: What does the workaround do to prevent this? The code that runs the 3 benchmark passes doesn’t always run them unconditionally. Instead, the 3 benchmarks have an increasingly larger workload (each roughly 10x as much as the previous one). Comments nearby explain that this choice was made because the larger benchmark runs cause driver resets on low-end APUs (hey, that’s exactly the problem we’re having!). It measures the time it takes for the benchmark workloads to complete using the timestamp queries, and if the total benchmark time is beyond a certain point, it skips the other benchmark runs. If you’ve been paying extremely close attention all the way until here, you might notice a small problem. UE4 interprets the timestamp values as the time until the benchmark workload completes. But as I pointed out all the way near the beginning, the timestamp can be written before the benchmark workload is even finished! If the timestamp is written before the benchmark workload finishes, the measured benchmark time is much less than the workload actually took. In practice, this results in the benchmark results indicating a much faster GPU than there actually is. I assume this led to the third benchmark (which was too heavy for the Deck GPU) to be launched. My desktop GPU seems to be powerful enough to get through the benchmark before the lockup timeout, which is why I couldn’t reproduce the issue there. In the end, the hack I originally found to work around the issue turned out to be a fitting workaround. And I even got to make my first bugreport for Unreal Engine. Footnotes If an app is using Vulkan on multiple threads, this might not always be the case. This is a rare case where I’m grateful for Unreal Engine to have a single RHI thread. ↩ Nvidia also calls them “warps”. ↩ Short for “Packet 3”. Packet 2, 1 and 0 also exist, although they aren’t widely used on newer AMD hardware. ↩ If you insert certain pipeline barriers, writing the timestamp early would be disallowed, but these barriers weren’t there in this case. ↩ This command writes the timestamp immediately when the CP executes it. There is another command which waits for previous commands to finish before writing the timestamp. ↩ You can also view the original NIR and the GLSL translation here ↩ As it turned out later, the debugging method was flawed. In actuality, both timestamp writes and copies completed successfully, but the writes indicating this seemed to be still in the write cache. Forcing the memory containing the magic number to be uncached solved this. ↩ Usually, the kernel driver can also command the GPU to kill whatever job it is doing right now. For some reason, it didn’t work here though. ↩
  • Mike Blumenkrantz: Debugging Primer (2023/05/10 00:00)
    Release Pending If nothing goes wrong, Mesa 23.1 will ship in the next few hours, which means that at last everyone will have a new zink release. And it’s a big one. Since I’m expecting lots of people will be testing zink for the first time now that it should be usable for most things, I thought it would be useful to have a post about debugging zink. Or maybe not debugging but issue reporting. Yeah that sounds right. How2Debug Zink is a complex driver. It has many components: DRI frontend Mesa’s GL API Gallium state tracker zink driver ntv compiler Vulkan driver underneath There are many systems at play when zink is running, and malfunctions in any of them may lead to bugs. If you encounter a bug, there are a number of steps you can take to try mitigating its effect or diagnosing it. Let’s dig in. Step 1: Threads Zink tries to use a lot of threads. Sometimes they can cause problems. One of the first steps I take when I encounter an issue is to disable them: mesa_glthread=false GALLIUM_THREAD=0 ZINK_DEBUG=flushsync MESA_LOADER_DRIVER_OVERRIDE=zink <command> mesa_glthread=false disables glthread, which very occasionally causes issues related to vertex buffers GALLIUM_THREAD=0 disables threaded context, which very occasionally causes issues in everything ZINK_DEBUG=flushsync disables threaded queue submission, which has historically never been an issue but may remove some timing discrepancies to enable bugs to shine through more readily If none of the above affects your problem, it’s time to move on to the next step. Step 2: Optimizations Zink tries a lot of optimizations. Sometimes they can cause problems. One of the first steps I take when I encounter an issue that isn’t fixed by the above is to disable them: mesa_glthread=false GALLIUM_THREAD=0 ZINK_DEBUG=flushsync,noreorder,norp MESA_LOADER_DRIVER_OVERRIDE=zink <command> noreorder disables command reordering, which will probably fix any issue norp disables renderpass optimization, which is only enabled by default on tiling GPUs so it’s definitely not impacting anyone reading this If none of the above affects your problem, then you’re probably screwed it’s time to move on to the next step. Step 3: Synchronization Zink tries to obey Vulkan synchronization rules. Sometimes these rules are difficult to understand and confusing to implement. One of the first steps I take when I encounter an issue that isn’t fixed by the above is to cry a lot check out this blog post the synchronization. Manually inspecting this is impossible hard, so there’s only a big hammer: mesa_glthread=false GALLIUM_THREAD=0 ZINK_DEBUG=flushsync,noreorder,norp,sync MESA_LOADER_DRIVER_OVERRIDE=zink <command> sync forces full memory synchronization before every command, which will solve the problem of extant FPS If none of the above affects your problem, then I totally have more ideas, and I’m gonna blog about them right here, but I gotta go, um, get something. From my…refrigerator. It’s just down the street I’ll berightback. Step 4: Pray Zink tries to conform to Vulkan API rules. Sometimes these rules are obfuscated by a trillion Valid Usage statements that nobody has time to read or the mental capacity to comprehend in their totality. And when a problem appears that cannot be mitigated with any of the above, we must pray to our lord and savior the Vulkan Validation Layer for guidance: mesa_glthread=false GALLIUM_THREAD=0 ZINK_DEBUG=flushsync,noreorder,norp,sync,validation MESA_LOADER_DRIVER_OVERRIDE=zink <command> validation enables VVL at runtime, which will (hopefully) explain how I fucked up If this doesn’t fix your problem, Step 5: Acceptance You should file a ticket if you find an issue. But if any of the above provided some additional info, providing that can reduce the time it takes to resolve that issue.
  • Peter Hutterer: libei and a fancy protocol (2023/05/09 00:51)
    libei is the library for Emulated Input - see this post for an introduction. Like many projects, libei was started when it was still unclear if it could be the right solution to the problem. In the years (!) since, we've upgraded the answer to that question from "hopefully" to "yeah, I reckon" - doubly so since we added support for receiver contexts and got InputLeap working through the various portal changes. Emulating or capturing input needs two processes to communicate for obvious reasons so the communication protocol is a core part of it. But initially, libei was a quickly written prototype and the protocol was hacked up on an as-needed let's-get-this-working basis. The rest of the C API got stable enough but the protocol was the missing bit. Long-term the protocol must be stable - without a stable protocol updating your compositor may break all flatpaks still shipping an older libei. Or updating a flatpak may not work with an older compositor. So in the last weeks/months, a lot of work as gone into making the protocol stable. This consisted of two parts: drop protobuf and make the variuos features interface-dependent, unashamedly quite like the Wayland protocol which is also split into a number of interfaces that can be independently versioned. Initially, I attempted to make the protocol binary compatible with Wayland but dropped that goal eventually - the benefits were minimal and the effort and limitations (due to different requirements) were quite significant. The protocol is defined in a single XML file and can be used directly from language bindings (if any). The protocol documentation is quite extensive but it's relatively trivial in principal: the first 8 bytes of each message are the object ID, then we have 4 bytes for the message length in bytes, then 4 for the object-specific opcode. That opcode is one of the requests or events in the object's interface - which is defined at object creation time. Unlike Wayland, the majority of objects in libei are created in server-side (the EIS implementation decides which seats are available and which devices in those seats). The remainder of the message are the arguments. Note that unlike other protocols the message does not carry a signature - prior knowledge of the message is required to parse the arguments. This is a direct effect of initially making it wayland-compatible and I didn't really find it worth the effort to add this. Anyway, long story short: swapping the protocol out didn't initially have any effect on the C library but with the changes came some minor updates to remove some of the warts in the API. Perhaps the biggest change is that the previous capabilities of a device are now split across several interfaces. Your average mouse-like emulated device will have the "pointer", "button" and "scroll" interfaces, or maybe the "pointer_absolute", "button" and "scroll" interface. The touch and keyboard interfaces were left as-is. Future interfaces will likely include gestures and tablet tools, I have done some rough prototyping locally and it will fit in nicely enough with the current protocol. At the time of writing, the protocol is not officialy stable but I have no intention of changing it short of some bug we may discover. Expect libei 1.0 very soon.
  • Maira Canal: Rotating Planes on VKMS (2023/05/08 00:00)
    In my last blog post, I described a bit of my previous work on the rustgem project, and after that, as I had finished the VGEM features, I sent a RFC to the mailing list. Although I still need to work on some rustgem feedback, I started to explore more of the KMS (Kernel Mode Setting) and its properties. I talked to my mentor Melissa Wen, one of the VKMS maintainers, and she proposed implementing plane rotation capabilities to VKMS. The VKMS (Virtual Kernel Mode Setting) is a software-only KMS driver that is quite useful for testing and running X (or similar compositors) on headless machines. It sounded like a great idea, as I would like to explore a bit more of the KMS side of things. What is Plane Rotation? In order to have an image on a display, we need to go through the whole Kernel Mode Setting (KMS) Display Pipeline. The pipeline has a couple of different objects, such as framebuffers, planes, and CRTCs, and the relationship between them can be quite complicated. If you are interested in the KMS Display Pipeline, I recommend reading the great KMS documentation. But here we are focused in only one of those abstractions, the plane. In the context of graphics processing, a plane refers to an image source that can be superimposed or blended on top of a CRTC during the scanout process. The plane itself specifies the cropping and scaling of that image, and where it is placed on the visible area of the CRTC. Moreover, planes may possess additional attributes that dictate pixel positioning and blending, such as rotation or Z-positioning. Rotation is an optional KMS property of the DRM plane object, which we use to specify the rotation amount in degrees in counter-clockwise direction. The rotation is applied to the image sampled from the source rectangle, before scaling it to fit in the destination rectangle. So, basically, the rotation property adds a rotation and a reflection step between the source and destination rectangles. |*********|$$$$$$$$$| |$$$$$$$$$|@@@@@@@@@| |*********|$$$$$$$$$| ---------> |$$$$$$$$$|@@@@@@@@@| |#########|@@@@@@@@@| 90º |*********|#########| |#########|@@@@@@@@@| |*********|#########| The possible rotation values are rotate-0, rotate-90, rotate-180, rotate-270, reflect-x and reflect-y. Now that we understand what plane rotation is, we can think about how to implement the rotation property on VKMS. Rotation on VKMS VKMS has some really special driver attributes, as all its composition happens by software operations. The rotation is usually an operation that is performed on the user-space, but the hardware can also perform it. In order for the hardware to perform it, the driver will set some registers, change some configurations, and indicate to the hardware that the plane should be rotated. This doesn’t happen on VKMS, as the composition is essentially a software loop. So, we need to modify this loop to perform the rotation. First, we need a brief notion of how the composition happens in VKMS. The composition in VKMS happens line-by-line. Each line is represented by a staging buffer, which contains the composition for one plane, and an output buffer, which contains the composition of all planes in z-pos order. For each line, we query an array by the first pixel of the line and go through the whole source array linearly, performing the proper pixel conversion. The composition of the line can be summarized by: void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state *plane, int y) { struct pixel_argb_u16 *out_pixels = stage_buffer->pixels; struct vkms_frame_info *frame_info = plane->frame_info; u8 *src_pixels = get_packed_src_addr(frame_info, y); int limit = min_t(size_t, drm_rect_width(&frame_info->dst), stage_buffer->n_pixels); for (size_t x = 0; x < limit; x++, src_pixels += frame_info->cpp) plane->pixel_read(src_pixels, &out_pixels[x]); } Here we can see that we have the line, represented by the stage buffer and the y coordinate, and the source pixels. We read each source pixel in a linear manner, through the for-loop, and we place it on the stage buffer in the appropriate format. With that in mind, we can think that rotating a plane is a matter of changing how we read and interpret the lines. Let’s think about the reflect-x operation. |*********|$$$$$$$$$| |$$$$$$$$$|*********| |*********|$$$$$$$$$| -----------> |$$$$$$$$$|*********| |#########|@@@@@@@@@| reflect-x |@@@@@@@@@|#########| |#########|@@@@@@@@@| |@@@@@@@@@|#########| Thinking that the VKMS composition happens line-by-line, we can describe the operation as a read in reverse order. So, instead of start reading the pixels from left to right, we need to start reading the pixels from right to left. We can implement this by getting the limit of the line and subtracting the current x position: static int get_x_position(const struct vkms_frame_info *frame_info, int limit, int x) { if (frame_info->rotation & DRM_MODE_REFLECT_X) return limit - x - 1; return x; } For the reflect-y operation, we need to start reading the plane from the last line, instead of reading it from the first line. |*********|$$$$$$$$$| |#########|@@@@@@@@@| |*********|$$$$$$$$$| -----------> |#########|@@@@@@@@@| |#########|@@@@@@@@@| reflect-y |*********|$$$$$$$$$| |#########|@@@@@@@@@| |*********|$$$$$$$$$| This can be performed by changing the y on the external composition loop. Similarly from the reflect-x case, we can get the y limit and subtract the current y position. static int get_y_pos(struct vkms_frame_info *frame_info, int y) { if (frame_info->rotation & DRM_MODE_REFLECT_Y) return drm_rect_height(&frame_info->rotated) - y - 1; return y; } So, to implement the rotation in VKMS, we need to change how we interpret the boundaries of the plane and read accordingly. This might seem odd because we could just rotate the src rectangle by using drm_rect_rotate, but this wouldn’t work as the composition in VKMS is performed line-by-line and the pixels are accessed linearly. However, drm_rect_rotate is of great help for us on the rotate-90 and rotate-270 cases. Those cases demand scaling and drm_rect_rotate helps us tremendously with it. Basically, what it does is: |$$|@@| |$$|@@| |*********|$$$$$$$$$| |$$|@@| |*********|$$$$$$$$$| --------------------> |$$|@@| |#########|@@@@@@@@@| drm_rect_rotate(90) |**|##| |#########|@@@@@@@@@| |**|##| |**|##| |**|##| After the drm_rect_rotate operation, we need to read the columns as lines and the lines as columns. See that even for a case like rotate-90, it is just a matter of changing the point of view and reading the lines differently. The complete implementation of all rotation modes is available here. Together with the rotation feature, I sent a patch to reduce the code repetition in the code by isolating the pixel conversion functionality. This patch was already merged, but the rest of the series is still pending a Reviewed-by. Rotating planes on VKMS was a fun challenge of my Igalia Coding Experience and I hope to keep working on VKMS to bring more and more features.
  • Maira Canal: Rotating Planes on VKMS (2023/05/08 00:00)
    In my last blog post, I described a bit of my previous work on the rustgem project, and after that, as I had finished the VGEM features, I sent a RFC to the mailing list. Although I still need to work on some rustgem feedback, I started to explore more of the KMS (Kernel Mode Setting) and its properties. I talked to my mentor Melissa Wen, one of the VKMS maintainers, and she proposed implementing plane rotation capabilities to VKMS. The VKMS (Virtual Kernel Mode Setting) is a software-only KMS driver that is quite useful for testing and running X (or similar compositors) on headless machines. It sounded like a great idea, as I would like to explore a bit more of the KMS side of things. What is Plane Rotation? In order to have an image on a display, we need to go through the whole Kernel Mode Setting (KMS) Display Pipeline. The pipeline has a couple of different objects, such as framebuffers, planes, and CRTCs, and the relationship between them can be quite complicated. If you are interested in the KMS Display Pipeline, I recommend reading the great KMS documentation. But here we are focused in only one of those abstractions, the plane. In the context of graphics processing, a plane refers to an image source that can be superimposed or blended on top of a CRTC during the scanout process. The plane itself specifies the cropping and scaling of that image, and where it is placed on the visible area of the CRTC. Moreover, planes may possess additional attributes that dictate pixel positioning and blending, such as rotation or Z-positioning. Rotation is an optional KMS property of the DRM plane object, which we use to specify the rotation amount in degrees in counter-clockwise direction. The rotation is applied to the image sampled from the source rectangle, before scaling it to fit in the destination rectangle. So, basically, the rotation property adds a rotation and a reflection step between the source and destination rectangles. |*********|$$$$$$$$$| |$$$$$$$$$|@@@@@@@@@| |*********|$$$$$$$$$| ---------> |$$$$$$$$$|@@@@@@@@@| |#########|@@@@@@@@@| 90º |*********|#########| |#########|@@@@@@@@@| |*********|#########| The possible rotation values are rotate-0, rotate-90, rotate-180, rotate-270, reflect-x and reflect-y. Now that we understand what plane rotation is, we can think about how to implement the rotation property on VKMS. Rotation on VKMS VKMS has some really special driver attributes, as all its composition happens by software operations. The rotation is usually an operation that is performed on the user-space, but the hardware can also perform it. In order for the hardware to perform it, the driver will set some registers, change some configurations, and indicate to the hardware that the plane should be rotated. This doesn’t happen on VKMS, as the composition is essentially a software loop. So, we need to modify this loop to perform the rotation. First, we need a brief notion of how the composition happens in VKMS. The composition in VKMS happens line-by-line. Each line is represented by a staging buffer, which contains the composition for one plane, and an output buffer, which contains the composition of all planes in z-pos order. For each line, we query an array by the first pixel of the line and go through the whole source array linearly, performing the proper pixel conversion. The composition of the line can be summarized by: void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state *plane, int y) { struct pixel_argb_u16 *out_pixels = stage_buffer->pixels; struct vkms_frame_info *frame_info = plane->frame_info; u8 *src_pixels = get_packed_src_addr(frame_info, y); int limit = min_t(size_t, drm_rect_width(&frame_info->dst), stage_buffer->n_pixels); for (size_t x = 0; x < limit; x++, src_pixels += frame_info->cpp) plane->pixel_read(src_pixels, &out_pixels[x]); } Here we can see that we have the line, represented by the stage buffer and the y coordinate, and the source pixels. We read each source pixel in a linear manner, through the for-loop, and we place it on the stage buffer in the appropriate format. With that in mind, we can think that rotating a plane is a matter of changing how we read and interpret the lines. Let’s think about the reflect-x operation. |*********|$$$$$$$$$| |$$$$$$$$$|*********| |*********|$$$$$$$$$| -----------> |$$$$$$$$$|*********| |#########|@@@@@@@@@| reflect-x |@@@@@@@@@|#########| |#########|@@@@@@@@@| |@@@@@@@@@|#########| Thinking that the VKMS composition happens line-by-line, we can describe the operation as a read in reverse order. So, instead of start reading the pixels from left to right, we need to start reading the pixels from right to left. We can implement this by getting the limit of the line and subtracting the current x position: static int get_x_position(const struct vkms_frame_info *frame_info, int limit, int x) { if (frame_info->rotation & DRM_MODE_REFLECT_X) return limit - x - 1; return x; } For the reflect-y operation, we need to start reading the plane from the last line, instead of reading it from the first line. |*********|$$$$$$$$$| |#########|@@@@@@@@@| |*********|$$$$$$$$$| -----------> |#########|@@@@@@@@@| |#########|@@@@@@@@@| reflect-y |*********|$$$$$$$$$| |#########|@@@@@@@@@| |*********|$$$$$$$$$| This can be performed by changing the y on the external composition loop. Similarly from the reflect-x case, we can get the y limit and subtract the current y position. static int get_y_pos(struct vkms_frame_info *frame_info, int y) { if (frame_info->rotation & DRM_MODE_REFLECT_Y) return drm_rect_height(&frame_info->rotated) - y - 1; return y; } So, to implement the rotation in VKMS, we need to change how we interpret the boundaries of the plane and read accordingly. This might seem odd because we could just rotate the src rectangle by using drm_rect_rotate, but this wouldn’t work as the composition in VKMS is performed line-by-line and the pixels are accessed linearly. However, drm_rect_rotate is of great help for us on the rotate-90 and rotate-270 cases. Those cases demand scaling and drm_rect_rotate helps us tremendously with it. Basically, what it does is: |$$|@@| |$$|@@| |*********|$$$$$$$$$| |$$|@@| |*********|$$$$$$$$$| --------------------> |$$|@@| |#########|@@@@@@@@@| drm_rect_rotate(90) |**|##| |#########|@@@@@@@@@| |**|##| |**|##| |**|##| After the drm_rect_rotate operation, we need to read the columns as lines and the lines as columns. See that even for a case like rotate-90, it is just a matter of changing the point of view and reading the lines differently. The complete implementation of all rotation modes is available here. Together with the rotation feature, I sent a patch to reduce the code repetition in the code by isolating the pixel conversion functionality. This patch was already merged, but the rest of the series is still pending a Reviewed-by. Rotating planes on VKMS was a fun challenge of my Igalia Coding Experience and I hope to keep working on VKMS to bring more and more features.
  • Mike Blumenkrantz: New Era (2023/05/03 00:00)
    Lavapipe: Dragged Into The Future Some time ago I teased a new Mesa project that involved both features and perf. At last, it’s time to unveil the goods: a complete rewrite of all descriptor handling in Lavapipe thanks to one of RADV’s raytracing tamers, Konstantin Seurer. It’s a feature that’s been confounding developers and users everywhere for years, but thanks to Konstantin’s tireless efforts over the past ETOOLONG, at last everyone will have the features they crave. Was It Really That Difficult? Yes. In short, the work required rewriting all the LLVM-based JIT to dynamically generate all the image/buffer access code instead of being able to use a more fixed-function style of JIT. As the MR shows, the diff here is massive, and work is still ongoing to make it work everywhere. It’s a truly Herculean effort by Konstantin that was only hindered by my goalpost-moving to fingerpaint in support for EXT_descriptor_buffer and EXT_mutable_descriptor_type. What Does This Mean? Primarily it means that Lavapipe should start to be able to work with VKD3D-PROTON and play (haha) real games. This has more uses in CI purposes as well, allowing broader coverage for all layered drivers that depend on EXT_descriptor_indexing. And Perf? Unfortunately, the added indirection is going to slightly reduce performance in some cases. Work is ongoing to mitigate this, and I don’t have any benchmark results.
  • Mike Blumenkrantz: Addendum (2023/05/01 00:00)
    Please File Responsibly The other day I posted about new efforts to create a compendium of games that work with zink. Initial feedback has been good, and thanks to everyone who has contributed so far. I realized too late that I needed to be more explicit about what feedback I wanted, however. There are three tickets open, one for each type of problem: This ticket is for games that work without any issues This ticket is for games with bad perf This ticket is for games with issues Please don’t dump the results of your entire steam library into a single ticket. Take the extra few seconds and split up your results appropriately. Results With all that said, the quantity of games in the “working” list far outnumbers those in the “not working” lists, which is great. On the flip side, anyone looking for a starting point to contribute to zink now has quite a few real world cases to check out.
  • Simon Ser: HDR hackfest wrap-up (2023/04/30 22:00)
    Last week I’ve been attending the HDR hackfest organized by Red Hat. The trip to Prague was rather chaotic: the morning right after the SourceHut DoS incident, I got a notification that my plane was cancelled, so had to re-book a new flight and hotel. Then a few hours before leaving for the airport, I realized that the train line to get there was cut off, so I had to take a longer route via bus (and of course, everybody taking a plane had the same idea). Thankfully Saturday evening I arrived in Prague as planned, and even had some time before the train the next day to visit the old city center with the Jewish cemetery and synagogues, and enjoy a traditional guláš. I arrived at Brno — the venue of the hackfest — on Sunday evening. I met with some hackfest participants Monday morning in the hotel lobby, then we joined everybody else at the Red Hat office. People from various organizations were on-site: Red Hat, KDE, System76, AMD, Igalia, Collabora, Canonical, etc. Some more people from NVIDIA, Intel and Google joined us remotely (some of them waking up at 2 AM due to their timezone!). It was super nice meeting all these folks I’ve been working with remotely for years! Sebastian had prepared a list of topics we could discuss. We started by brainstorming use-cases we cared about for HDR and color management. There are two main separate classes of users here: one wants to enjoy HDR movies and games, the other wants to perform color-critical work such as image or video editing. The former mainly cares about the colors looking good, while the latter cares about the colors looking right. These two use-cases are kind of orthogonal (a compositor can implement one without the other) but still closely related. We noted that displaying a single HDR surface full-screen is pretty easy to achieve but we really want to properly handle mixing HDR and SDR content, if only to be able to display overlay menus, notifications, cursors, and so on (and of course windowed HDR content). Additionally keeping the power usage down is important for mobile devices. We mentioned a million of other issues (screen recording, brightness adjustment, “Night Light” feature, etc) but this blog post would turn into a thick book if I tried to cover everything. Then we switched gears and discussed about variable refresh rate (VRR). There are two unresolved issues when it comes to VRR: cursor handling and flickering. The first issue manifests itself when the cursor plane is moved while VRR is enabled. Either the cursor is moved at the maximum refresh rate (effectively disabling VRR), either the cursor is moved at the game/video refresh rate (resulting in a very choppy result). We need a new kernel uAPI to move the cursor plane without scheduling a new page-flip somehow. The second issue is that some screens (not all) flicker when the refresh rate is changed abruptly. This is a bit annoying to handle, we need to ensure that refresh rate changes are smoothed over multiple frames for these displays. It would be best for user-space to handle this, because the refresh rate compensation will mess up frame timings. It would be nice to be able to automatically tell apart “good” and “bad” screens, there are some HDMI and DisplayID standards for this but they are not widely supported. More experimentation and testing is required to figure out how much we can do in user-space. Then we got into the color management topics. First the “easy” part: AMD is missing support for the Colorspace KMS property. There are patches floating around but a blocker: AMD may decide to encode the signal as either RGB or YUV on the wire depending on the available bandwidth, and the Colorspace property has different enum entries for RGB and YUV. However user-space has no way to know whether the driver picked RGB or YUV, so has no way to pick the correct enum entry. We decided that the best course of action was to retain backwards uAPI compatibility by keeping the existing enum entries, but treat them as equal in the driver and let it Do The Right Thing. That way user-space can unconditionally pick the RGB variant and the driver will silently convert that to the YUV variant if it happens to encode the signal as YUV on the wire. Before we got into some more complicated color management and HDR discussions, Sebastian and Pekka explained in more detail how it’s all supposed to work. This is a very wide and tricky topic, so it can be especially complicated to learn and understand. Pekka gave some enlightening and colorful explanations (see what I did here?), I believe that helped a lot of people in the room. If you are interested, have a look at the learn page in Pekka’s color-and-hdr repository. With that out of the way, we started debating about vendor-specific KMS properties. With the existing kernel uAPI, compositors can implement HDR and color management just fine, but have to resort to OpenGL or Vulkan shaders. This isn’t great for power efficiency, because this keeps the 3D engine in GPUs busy. Ideally we want to offload all color transformations to the display engine and keep the 3D engine completely idle (during video playback with hardware-accelerated decoding for instance). So we need a new kernel API. A week before, Melissa has sent a patch series to introduce AMD-specific KMS properties to configure various color management related hardware blocks. The amdgpu documentation explains exactly what these hardware blocks are. Josh has implemented support for this in gamescope and this will be shipped in SteamOS soon. This is great, because this is the first time a real HDR compositor has been implemented on Linux, and with full hardware acceleration even! If nothing else, this is a very valuable testing grounds. So the question we asked ourselves is whether or not we want to merge the KMS vendor-specific properties. On one hand this allows easier and wider experimentation to come up with a good vendor-neutral uAPI. On the other hand we don’t want to end up stuck with vendor-specific user-space and no generic API. Everybody had a different opinion on this topic, so that made for an interesting discussion. At the end, we agreed that we can merge vendor-specific color management properties on the condition that the uAPI is documented as unstable, hidden behind an experimental build option and a kernel parameter. This should allow for more testing while avoiding the pitfalls of hardcoding chunks of vendor-specific code in each compositor. Things got really interesting when we discussed about long-term plans. We want to design some kind of vendor-neutral API that compositors can use to program the color pipeline in GPUs. Other platforms (e.g. Android) typically provide a descriptive API: compositors set the color space and other related parameters for the source and the destination, and the driver comes up with the appropriate hardware configuration. However there are multiple ways of doing color conversions (e.g. gamut and tone mapping). Each way will give a different result. This will result in glitches when switching between OpenGL/Vulkan and KMS offloading. Unfortunately the switches can happen pretty often, e.g. when a notification comes in, or when a window is moved around. Another issue is that the descriptive API doesn’t give full control to the compositors, thus compositors cannot come up with novel color pipelines. We’ve decided that for KMS a prescriptive API would be superior: drivers expose a list of available hardware blocks (mathematical operations like look-up tables and matrices), then user-space directly programs each hardware block. The tricky part is coming up with a good API which fits all hardware, present and future. It would seem like this design would work well for AMD and Intel hardware, but NVIDIA GPUs are more opinionated and have hardware blocks converting between two fixed color spaces and cannot be disabled. We decided that it would be reasonable to expose these fixed hardware blocks to user-space as well just like any other hardware block. I will soon send an RFC to the dri-devel mailing list with more details about our API proposal. Since the kernel API would just expose what the hardware can do (very much like KMS planes), user-space will need to translate its color pipeline to the hardware, and fallback to shaders if it cannot leverage the hardware. We plan to establish a common user-space library (similar to libliftoff) to help offload color pipelines to KMS. Throughout the days we had various other discussions, for instance about testing or about new features we’d like KMS to have. The detailed notes should get published soon if you’re interested. You can probably tell, we didn’t write much code during this hackfest. We just talked together the whole time. Everyone was very passionate and invested in the topics we discussed. The hackfest was very exhausting, by 5 PM the discussions were a lot slower. However that effort payed off, and we’ve made great progress! We now have a clear path forward, and I can’t wait to see the fruits of the hackfest materialize in various Git repositories. Many thanks to Carlos for organizing everything, and to Red Hat + Collabora for sponsoring the event!
  • Mike Blumenkrantz: Thinko (2023/04/28 00:00)
    Shader Objects Yep, it’s all merged. That means if your driver supports VK_EXT_shader_object, you can finally enjoy Tomb Raider (2013) without any issues. NVIDIA has just released a new beta driver that I need to test, but I’m hopeful it will no longer crash when trying to use this extension. I Didn’t Want To Join Your Club Anyway Remember that time I mentioned how zink wasn’t allowed to use VK_EXT_vertex_input_dynamic_state on AMDVLK? We all got a chuckle, and it was sad/funny, and nobody was surprised, but the funnier part is some code that’s been in zink for much longer. Almost a year, in fact: if (screen->info.have_EXT_extended_dynamic_state) { if (screen->info.have_EXT_extended_dynamic_state2) { if (screen->info.have_EXT_extended_dynamic_state3) { if (screen->info.have_EXT_vertex_input_dynamic_state) dynamic = ZINK_DYNAMIC_VERTEX_INPUT; else dynamic = ZINK_DYNAMIC_STATE3; } else { if (screen->info.have_EXT_vertex_input_dynamic_state) dynamic = ZINK_DYNAMIC_VERTEX_INPUT2; else dynamic = ZINK_DYNAMIC_STATE2; } } else { dynamic = ZINK_DYNAMIC_STATE; } } else { dynamic = ZINK_NO_DYNAMIC_STATE; } This is the conditional for enabling dynamic state usage in zink using an enum. As we can see, the only time VK_EXT_vertex_input_dynamic_state is enabled is if either VK_EXT_extended_dynamic_state2 or VK_EXT_extended_dynamic_state3 are also enabled. This cuts down on the number of codepaths that can be used by drivers, which improves performance and debuggability. AMD’s drivers don’t yet support VK_EXT_extended_dynamic_state3, as anyone can see from gpuinfo. They do, however, support VK_EXT_extended_dynamic_state2, so the driver-side disablement of VK_EXT_vertex_input_dynamic_state does have some effect. Not. Way, way back, just over a year ago, I was doing some testing on AMD’s drivers. One thing I noticed was that trying to run any of the GLES CTS caselists on these drivers caused GPU hangs, so I stopped running those, leaving me with just GL4.6 CTS. And what happened when I enabled VK_EXT_extended_dynamic_state2 there, you ask? Test failures. Lots of test failures. Thus, AMD got the cone of shame: a driver workaround to explicitly disable this extension. In conclusion, we all had a good chuckle about AMD blocking zink from using VK_EXT_vertex_input_dynamic_state, but… Well, there’s nothing in this story that we didn’t already expect. Tracking On another topic, I’ve been doing some per-app tracking in zink. Specifically tracking games that don’t work great. If you know of other games with issues, post them there. But there hasn’t been any sort of canonical list for games that do work great on zink, which leads to a lot of confusion about how useful it is for gaming. Thus, I’ve created the GAMES THAT WORK tracking ticket. If you’ve played a game on zink and it works, post about it. If you want to know whether a game works, check that ticket and maybe it’ll be updated enough to be useful. Remember: I don’t play video games, so I can’t fill out this table on my own. The only way I know if a game works is if I spend a comprehensive amount of time benchmarking or debugging it, which is the only reason I have 300+ hours logged in the benchmark mode of Tomb Raider (2013). Upcoming Things have been quiet on the surface lately, with nothing big in development. But there is big work underway. It’s an extremely secret project (not Half Life 3) that (or Portal 3) I’m hoping (or any Valve-related title) can be (it’s not even a game) brought into the light (it’s driver-related) within the next week (involving features) or two (and perf). I can’t say more on this topic. Don’t even bother asking. It’s too secret. What I can say is that it’s been in development for almost a month. And we all know how much time a month is when it comes to SGC speed.
  • Tomeu Vizoso: A long overdue update (2023/04/26 09:54)
    Cannot believe it has been years since my last update here!There are two things that I would like to tell people about:The first is that I no longer work at Collabora. It has been almost 13 years full of excitement and recently I came to believe that I wanted a proper change.They are great folks to work with, so if you are thinking of a career change and want to do open-source stuff upstream, I recommend you to consider them.And the other topic is what I have been working on lately: a free software driver for the NPUs that VeriSilicon sells to SoC vendors.TL;DRtomeu@arm-64:~/tensorflow/build/examples/label_image$ SMALLER_SOFTMAX=1 RUSTICL_ENABLE=etnaviv LD_LIBRARY_PATH=/home/tomeu/opencl/lib LIBGL_DRIVERS_PATH=/home/tomeu/opencl/lib/dri/ ./label_image --gpu_backend=cl --use_gpu=true --verbose 1 --tflite_model ../../../assets/mobilenet_quant_v1_224.tflite --labels ../../../assets/labels.txt --image ../../../assets/grace_hopper.bmp --warmup_runs 1 -c 1[snip]INFO: invokedINFO: average time: 1261.99 msINFO: 0.666667: 458 bow tieINFO: 0.294118: 653 military uniformINFO: 0.0117647: 835 suitINFO: 0.00784314: 611 jerseyINFO: 0.00392157: 922 book jacketThat is TensorFlow Lite's OpenCL delegate detecting objects with Etnaviv from Grace Hopper's portrait in military uniform.The story behind this workMany years ago, when I was working on the operating system for the One Laptop Per Child project, I became painfully aware of the problems derived by IP vendors not providing the source code for their drivers.This and other instances of the same problem motivated me to help out on the Panfrost project, writing a free software driver for the Mali GPUs by Arm. That gave me a great opportunity to learn about reverse engineering from Alyssa Rosenzweig.Nowadays the Mesa project contains drivers for most GPUs out there, some maintained by the same companies that develop the IP, some by their customers and hobbyists alike. So the problem of the availability of source code for GPU drivers is pretty much solved.Only that, with the advent of machine learning in the edge, we are reliving this problem with the drivers for accelerating those workloads with NPUs, TPUs, etc.Vivante's NPU IP is very closely based on their GPUs. And it is pretty popular, being included in SoCs by Amlogic, Rockchip, NXP, Broadcom and more.We already have a reasonably complete driver (Etnaviv) for their GPU IP, so I started by looking at what the differences were and how much of the existing userspace and kernel drivers we could reuse.The kernel driver works with almost no changes, just took me some time to implement the hardware initialization properly in upstream. As of Linux 6.3 the driver loads correctly on Khadas' VIM3, but for a chance at decent performance this patch is needed:[PATCH] arm64: dts: VIM3: Set the rates of the clocks for the NPUDue to its experimental status, it is disabled by default in the device tree. To enable it, add the below to arch/arm64/boot/dts/amlogic/meson-g12b-a311d-khadas-vim3.dts: &npu {        status = "okay";};Enabling Etnaviv for other boards with this IP should be relatively straightforward, by describing how the HW is initialized by inspecting the downstream kernel sources for the board in question.Mesa has seen most of the work, as this IP is compute-only and the userspace driver only targeted OpenGL ES.First step was wiring up the existing driver to Mesa's OpenCL implementation, and then I focused on getting the simplest kernel to correctly run. For this and all the subsequent work, the reverse-engineering tools used by the Etnaviv community have been of great use.At that point I had to pause the work to focus on other unrelated stuff, but Collabora's Italo Nicola and Faith Ekstrand did great work to extend the existing compiler to generate OpenCL kernels.Once I didn't have a day job getting in the way anymore, I started adding the features needed to run the label_image example in TensorFlow Lite.And eventually we got to this point. 1.2 seconds to run that inferrence is a lot of time, so the next steps for me will be to figure out what are the biggest causes for the low performance.With the goal in mind of providing a free software driver that companies can use to run inferrence on their products containing Vivante's NPU IP, I need for those tasks to be performanced at at least the same order of magnitude as the closed source solution provided by Vivante.Right now Etnaviv is about twice as slow as running label_image with the OpenCL delegate on Vivante's driver, but the solution that they provide uses a special delegate that is able to better use their hardware is several times faster.Current performance situation (label_image):OpenCL delegate with Etnaviv: 1261.99 msOpenCL delegate with Galcore: 787.733 msCPU: 149.19 msTIM-VX delegate: 2.567 ms (!)The plan is to first see why we are slower with the OpenCL delegate and fix it, and afterwards the real fun stuff will start: seeing how we can use more of the HW capabilities through the OpenCL API and with upstream TensorFlow Lite.Next stepsItalo is cleaning up an initial submission for inclusion in Mesa upstream. Once that is done I will rebase my branch and start submitting features.In parallel to upstreaming, I will be looking at what is needed to get closer to the performance of the closed source driver, for ML acceleration.ThanksThere is a lot of people besides the ones mentioned above that have made this possible. Some of they are:The Mesa community, for having put together such a great framework for GPU drivers. Their CI system has been great to track progress and avoid regressions.The Etnaviv community, for all the previous reverse engineering work that documented most of the OpenCL specificities, for a great pair of drivers to base the work on and the very useful tooling around it.And the Linux kernel community, that made it so easy to get the hardware recognized and the Etnaviv driver probed on it.Last but not least, there are some individuals to whom I was able to turn when I needed help:Christian Gmeiner (austriancoder)Lucas Stach (lynxeye)Neil Armstrong (narmstrong)Faith Ekstrand (gfxstrand)Karol Herbst (karolherbst)A big thanks, it has been a lot of fun!
  • Dave Airlie (blogspot): Fedora 38 LLVM vs Team Fortress 2 (TF2) (2023/04/24 03:29)
    F38 just released and seeing a bunch of people complain that TF2 dies on AMD or other platforms when lavapipe is installed. Who's at fault? I've no real idea. How to fix it? I've no real idea.What's happening?AMD OpenGL drivers use LLVM as the backend compiler. Fedora 38 updated to LLVM 16. LLVM 16 is built with c++17 by default. C++17 introduces new "operator new/delete" interfaces[1].TF2 ships with it's own libtcmalloc_minimal.so implementation, tcmalloc expects to replace all the new/delete interfaces, but the version in TF2 must not support or had incorrect support for the new align interfaces.What happens is when TF2 probes OpenGL and LLVM is loaded, when DenseMap initializes, one "new" path fails to go into tcmalloc, but the "delete" path does, and this causes tcmalloc to explode with"src/tcmalloc.cc:278] Attempt to free invalid pointer"Fixing it?I'll talk to Valve and see if we can work out something, LLVM 16 doesn't seem to support building with C++14 anymore. I'm not sure if static linking libstdc++ into LLVM might avoid the tcmalloc overrides, it might not also be acceptable to the wider Fedora community.[1] https://www.cppstories.com/2019/08/newnew-align/
  • Mike Blumenkrantz: Embarrassments (2023/04/21 00:00)
    This Week I escaped my RADV pen once again. I know, it’s been a while, but every so often my handler gets distracted and I sprint for the fence with all my might. This time I decided to try out a shiny new Intel Arc A770 that was left by my food trough. The Dota2 performance? Surprisingly good. I was getting 100+ FPS in all the places I expected to have good perf and 4-5 FPS in all the places I expected to have bad perf. The GL performance? Also surprisingly good*. Some quick screenshots: DOOM 2016 on Iris: Playable. And of course, this being a zink blog, I have to do this: The perf on zink is so good that the game thinks it’s running on an NVIDIA GPU. If anyone out there happens to be a prominent hardware benchmarking site, this would probably be an interesting comparison on the upcoming Mesa 23.1 release. More News I’ve seen a lot of people with AMD hardware getting hit by the Fedora 38 / LLVM 16 update crash. While this is unfortunate, and there’s nothing that I, a simple meme connoisseur, can do about it, I have some simple workarounds that will enable you to play all your favorite games without issue: For Vulkan-based games, delete/rename your Lavapipe ICD (rm /usr/share/vulkan/icd.d/lvp_icd.*) This will be workaroundeded in the upcoming Mesa releases and is probably already fixed in main For OpenGL-based games, set MESA_LOADER_DRIVER_OVERRIDE=zink %command% in your game’s launch options I realize the latter suggestion seems meme-adjacent, but so long as you’re on Mesa 23.1-rc2 or a recent git build, I doubt you’ll notice the difference for most games. You can’t run a full desktop on zink yet, but you can now play most games at native-ish performance. Or better! Lastly I’ve made mention of Big Triangle a number of times on the blog. Everyone’s had a good chuckle, and we’re all friends so we know it’s an inside joke. But what if I told you I was serious each and every time I said it? What if Big Triangle really does exist? I know what you’re thinking: Mike, you’re not gonna get me again. You can’t trick me this time. I’ve seen this coming for— CHECK OUT THESE AMDVLK RELEASE NOTES! Incredible, they’re finally supporting that one extension I’ve been saying is crucial for having good performance. Isn’t that ama— And they’ve even added an app profile for zink! I assume they’re going to be slowly rolling out all the features zink needs in a controlled manner since zink is a known good-citizen when it comes to behaving within the boundaries of— …
  • Dave Airlie (blogspot): nouveau/gsp + kernel module firmware selection for initramfs generation (2023/04/19 05:16)
    There are plans for nouveau to support using the NVIDIA supplied GSP firmware in order to support new hardware going forward The nouveau project doesn't have any input or control over the firmware. NVIDIA have made no promises around stable ABI or firmware versioning. The current status quo is that NVIDIA will release versioned signed gsp firmwares as part of their driver distribution packages that are version locked to their proprietary drivers (open source and binary). They are working towards allowing these firmwares to be redistributed in linux-firmware.The NVIDIA firmwares are quite large. The nouveau project will control the selection of what versions of the released firmwares are to be supported by the driver, it's likely a newer firmware will only be pulled into linux-firmware for:New hardware support (new GPU family or GPU support)Security fix in the firmwareNew features that is required to be supportedThis should at least limit the number of firmwares in the linux-firmware project.However a secondary effect of the size of the firmwares is that having the nouveau kernel module at more and more MODULE_FIRMWARE lines for each iteration will mean the initramfs sizes will get steadily larger on systems, and after a while the initramfs will contain a few gsp firmwares that the driver doesn't even need to run.To combat this I've looked into adding some sort of module grouping which dracut can pick one out off.It currently looks something like:MODULE_FIRMWARE_GROUP_ONLY_ONE("ga106-gsp"); MODULE_FIRMWARE("nvidia/ga106/gsp/gsp-5258902.bin"); MODULE_FIRMWARE("nvidia/ga106/gsp/gsp-5303002.bin"); MODULE_FIRMWARE_GROUP_ONLY_ONE("ga106-gsp"); This group only one will end up in the module info section and dracut will only pick one module from the group to install into the initramfs. Due to how the module info section is constructed this will end up picking the last module in the group first.The dracut MR is: https://github.com/dracutdevs/dracut/pull/2309The kernel one liner is:https://lore.kernel.org/all/20230419043652.1773413-1-airlied@gmail.com/T/#u 
  • Mike Blumenkrantz: They Say (2023/04/19 00:00)
    A Picture Is Worth A Thousand Words You think you’re ready but you’re not.
  • Simon Ser: Status update, April 2023 (2023/04/16 22:00)
    Hi! In the last month I’ve continued working on go-imap v2. I’ve written the server side, implemented an in-memory server backend, and spent quite a bit of time fixing issues reported by imaptest. I only have a handful of test failures, most of which due to \Recent being unimplemented on purpose because it’s been removed from the new IMAP4rev2 RFC. The end result is a much more correct and reliable server implementation compared to v1. I’ve pushed some incremental improvements for the client side as well, fixing compatibility issues with servers in the wild and adding a few more extensions. Next, I’d like to explore server-side command pipelining and fix the remaining issues related to unilateral updates. In other news, I’ve (finally!) released new versions of soju and goguma. soju v0.6.0 adds a database message store, a new sojuctl utility, external authentication support, and many more improvements. goguma v0.5.0 adds image previews, UnifiedPush support, performance improvements, and new IRCv3 extensions. Since the goguma release I’ve also implemented previews for Web pages. While we’re on the topic of new releases, there is one more piece of software which got its version bumped this month: hut v0.3.0 adds pagination, improved Web hooks support, a few new sub-commands and other quality-of-life improvements. Thanks a lot to Thorben Günther for their numerous contributions! The NPotM is yojo. I’ve already written two tools to integrate builds.sr.ht with other code forges, so here’s a third one focused on Forgejo/Gitea. It’s pretty similar to hottub, a public instance is available for Codeberg integration. It doesn’t support pull requests yet, patches welcome! While working on yojo I got once again annoyed by golang.org/x/oauth2 so I started working on a simpler alternative creatively called go-oauth2. Last but not least, after days of battling with the Pixman API, I’ve managed to finish up my new renderer API for wlroots. I’m excited about it because the next step is to lay down the first bricks of the color management infrastructure. My plan is to work on basic support for per-output ICC profiles, then go from there. I’ll be participating in Red Hat’s HDR hackfest next week, I hope the discussions with the rest of the stakeholders (compositor and driver developers) can help us move this forward! That’s all for April, see you next month!
  • Mike Blumenkrantz: Adventures In Linking (2023/04/14 00:00)
    First As I mentioned a week or three ago, I deleted comments on the blog because (apparently) the widget was injecting ads. My b. I wish I could say the ad revenue was worth it, but it wasn’t. With that said, I’m looking at ways to bring comments back. I’ve seen a number of possibilities, but none have really grabbed me: hyvor looks nice, but the pricing/features seem entirely dynamic with no future guarantee what I’d actually be getting disqus is what I was using, so I won’t be going back to that fastcomments might be a possibility? static comments baked into the site is an option with some fancy webhooks, but it already takes a couple minutes to do any sort of page deployment, and then there’s the issue of spam If anyone has other ideas, post here about it. EDIT: Thanks to a brilliant suggestion by the other Rhys Perry, I’ve activated giscus for comments. Took about 2 mins. Does it work? We’ll see. Main Topic After this boring procedural opening, let’s get to something exciting that nobody blogs about: shader linking. What is shader linking? Shader linking is the process by which shaders are “linked” together to match up I/O blocks and optimize the runtime. There’s a lot of rules for what compilers can and can’t do during linking, and I’m sure that’s all very interesting, and probably there’s someone out there who would want to read about that, but we’ll save that topic for another day. And another blog. I want to talk about one part of linking in particular, and that’s interface matching. Let’s check out some Vulkan spec text: 15.1.3. Interface Matching An output variable, block, or structure member in a given shader stage has an interface match with an input variable, block, or structure member in a subsequent shader stage if they both adhere to the following conditions: • They have equivalent decorations, other than: ◦ XfbBuffer, XfbStride, Offset, and Stream ◦ one is not decorated with Component and the other is declared with a Component of 0 ◦ Interpolation decorations ◦ RelaxedPrecision if one is an input variable and the other an output variable • Their types match as follows: ◦ if the input is declared in a tessellation control or geometry shader as an OpTypeArray with an Element Type equivalent to the OpType* declaration of the output, and neither is a structure member; or ◦ if the maintenance4 feature is enabled, they are declared as OpTypeVector variables, and the output has a Component Count value higher than that of the input but the same Component Type; or ◦ if the output is declared in a mesh shader as an OpTypeArray with an Element Type equivalent to the OpType* declaration of the input, and neither is a structure member; or ◦ if the input is decorated with PerVertexKHR, and is declared in a fragment shader as an OpTypeArray with an Element Type equivalent to the OpType* declaration of the output, and neither the input nor the output is a structure member; or ◦ if in any other case they are declared with an equivalent OpType* declaration. • If both are structures and every member has an interface match. Fascinating. Take a moment to digest. Anyway Once again that’s all very interesting, and probably there’s someone out there who wanted to read about that, but this isn’t quite today’s topic either. Today’s topic is this one line a short ways below: Shaders can declare and write to output variables that are not declared or read by the subsequent stage. This allows e.g., a vertex shader to write an output variable that a fragment shader doesn’t read. Nobody has ever seen a problem with this in Vulkan. The reason is pipelines. Yes, that concept about which Khronos has recently made questionable statements, courtesy of Nintendo, based on the new VK_EXT_shader_object extension. In a pipeline, all the shaders get linked, which means the compiler can delete these unused variables. Or, if not delete, then it can at least use the explicit location info for variables to ensure that I/O is matched up properly. And because of pipelines, everything works great. But what happens if pipelines/linking go away? Uh-oh Everyone saw this coming as soon as the blog loaded. With shader objects (and GPL fastlink), it now becomes possible to create unlinked shaders with mismatched outputs. The shader code is correct, the Vulkan API usage to create the shaders is correct, but is the execution still going to be correct? Right. CTS. So let’s check… Okay, there’s no public CTS available for VK_EXT_shader_object yet, but I’m sure it’s coming soon. I have access to the private CTS repos, and I can see that there is (a lot of) CTS for this extension, which is a relief, and obviously I already knew this since lavapipe has passed everything, and I’m sure there must be testing for shader interface mismatches either there or in the GPL tests. Sure, maybe there’s no tests for this, but it must be on the test plan since that’s so comprehensive. Alright, so it’s not in the test plan, but I can add it, and that’s not a problem. In the meanwhile, since zink needs this functionality, I can just test it there, and I’m sure it’ll work fine. It’s more broken than AMD’s VK_EXT_robustness2 handling, but I’m sure it’ll be easy to fix. It’s nightmarishly difficult, and I wasted an entire day trying to fix nir_assign_io_var_locations, but I’m sure only lavapipe uses it. The Vulkan drivers affected by this issue: RADV IMG V3DV PANVK Lavapipe Turnip Basically everyone except ANV. But also maybe ANV since the extension isn’t implemented there. And probably all the proprietary drivers too since there’s no CTS. Great. What Exactly Is The Problem? nir_assign_io_var_locations works like this: sort the input/output variables in a given shader using the Location decoration assign each variable an index increment the index based on the number of locations each variable consumes This results in a well-ordered list of variables with proper indexing that should match up both on the input side and the output side. Except no, not really. Consider the following simple shader interface: vertex shader layout(location = 0) in highp vec4 i_color; layout(location = 0) out highp vec4 o_color; void main() { gl_Position = vec4(some value); o_color = i_color; } fragment shader layout(location = 0) in highp vec4 i_color; layout(location = 0) out highp vec4 o_color; void main() { o_color = i_color; } We expect that the vertex attribute color will propagate through to the fragment output color, and that’s what happens. Vertex shader outputs: o_color, driver_location=0 Fragment shader inputs: i_color, driver_location=0 Let’s modify it slightly: vertex shader layout(location = 0) in highp vec4 i_color; layout(location = 0) out highp vec2 o_color; layout(location = 2) out highp vec2 o_color2; void main() { gl_Position = vec4(some value); o_color = i_color; } fragment shader layout(location = 0) in highp vec2 i_color; layout(location = 2) in highp vec2 i_color2; layout(location = 0) out highp vec4 o_color; void main() { o_color = vec4(i_color.xy, i_color2.xy); } Vertex shader outputs: o_color, driver_location=0 o_color2, driver_location=1 Fragment shader inputs: i_color, driver_location=0 i_color2, driver_location=1 No problems yet. But what about this: vertex shader layout(location = 0) in highp vec4 i_color; layout(location = 0) out highp vec2 o_color; layout(location = 1) out highp vec4 lol; layout(location = 2) out highp vec2 o_color2; void main() { gl_Position = vec4(some value); o_color = i_color; lol = vec4(1.0); } fragment shader layout(location = 0) in highp vec2 i_color; layout(location = 2) in highp vec2 i_color2; layout(location = 0) out highp vec4 o_color; void main() { o_color = vec4(i_color.xy, i_color2.xy); } In a linked pipeline this works just fine: lol is optimized out during linking since it isn’t read by the fragment shader, and location indices are then assigned correctly. But in unlinked shader objects (and with non-LTO EXT_graphics_pipeline_library), there is no linking. Which means lol isn’t optimized out. And what happens once nir_assign_io_var_locations is run? Vertex shader outputs: o_color, driver_location=0 lol, driver_location=1 o_color2, driver_location=2 Fragment shader inputs: i_color, driver_location=0 i_color2, driver_location=1 Tada, now the shaders are broken. Testing Hopefully there will be some, but at present I’ve had to work around this issue in zink by creating multiple separate shader variants with different locations to ensure everything matches up. Fixing I made an attempt at fixing this, but it was unsuccessful. I then contacted the great Mesa compiler sage, Timothy Arceri, and he provided me with a history lesson from The Before Times. Apparently this NIR pass was originally written for GLSL and lived in mesa/st. Then Vulkan drivers wanted to use it, so it was moved to common code. Since all pipelines were monolithic and could do link-time optimizations, there were no problems. But now LTO isn’t always possible, and so we are here. It seems to me that the solution is to write an entirely new pass for Vulkan drivers to use, and that’s all very interesting, and probably there’s someone out there who wants to read about that, but this is the end of the post.
  • Mike Blumenkrantz: Branched (2023/04/12 00:00)
    New Quarter, New Branch Just a quick post to sum up all the new features and things to watch for in zink for 23.1: real ARB_separate_shader_objects support but only for VS+FS improved async shader compile zero stuttering in most modern games! improved memory utilization no more ooming in 32bit games! unless you are @mbriar improved performance oversynchronization reduced very safe trust me internal rendergraphing now more aggressive polar coordinates only tho GL_QUADS natively suppored I still don’t understand how geometry shaders work EXT_multisample_render_to_texture now uses VK_EXT_multisampled_render_to_single_sampled EXT_descriptor_buffer is now the default for descriptor handling NV_compute_shader_derivatives support does anything use this? perfetto trace marker support find your own perf bottlenecks! lots more ZINK_DEBUG options for debugging none of them help performance so don’t bother lots of refactoring pointlessly moving code around to provide the illusion of activity Also Has anyone else heard that Alyssa is going to Dyson to work on some new vaccuum tech? This is upending everything I thought I knew, but the source seems credible.
  • Alyssa Rosenzweig: Passing the reins on Panfrost (2023/04/10 05:00)
    Today is my last day at Collabora and my last day leading the Panfrost driver. It’s been a wild ride. In 2017, I began work on the chai driver for Mali T (Midgard). chai would later be merged into Lyude Paul’s and Connor Abbott’s BiOpenly project for Mali G (Bifrost) to form Panfrost. In 2019, I joined Collabora to accelerate work on the driver stack. The initial goal was to run GNOME on a Mali-T860 Chromebook. Huge success. Today, Panfrost supports a broad spectrum of Mali GPUs, conformant to the OpenGL ES 3.1 specification on Mali-G52 and Mali-G57. It’s hard to overstate how far we’ve come. I’ve had the thrills of architecting several backend shader compilers as well as the Gallium-based OpenGL driver, while my dear colleague Boris Brezillon has put together a proof-of-concept Vulkan driver which I think you’ll hear more about soon. Lately, my focus has been ensuring the project can stand on its own four legs. I have every confidence in other Collaborans hacking on Panfrost, including Boris and Italo Nicola. The project has a bright future. It’s time for me to pass the reins. I’m still alive. I plan to continue working on Mesa drivers for a long time, including the common infrastructure upon which Panfrost relies. And I’ll still send the odd Panfrost patch now and then. That said, my focus will shift. I’m not ready to announce what’s in store yet… but maybe you can read between the lines!
  • Mike Blumenkrantz: Weekend (2023/04/07 00:00)
    Another Week Another week, more blog posts is what I meant to say when I started writing this post last Friday. But now it’s Monday, and everything is different. In particular, zink is different. There’s a branchpoint coming up, and I’ll do a separate post about that and all the new features people can expect, but today’s topic is something else. Something more exciting. Obviously it’s EXT_shader_object. Literally The Only Person In The World… who is excited about this extension is me. That’s right, I said it. For years now, Tomb Raider (2013) has plagued zink users with separate shader objects that could not be precompiled even with EXT_graphics_pipeline_library. Why? Because the game uses tessellation. And when I suggested we’d probably want that in EXT_graphics_pipeline_library, someone said “oh we can just add that later, it’ll be easy”, and then since it’s Vulkan it wasn’t easy and it didn’t get added. But then Nintendo came along and solved this problem for me in a much, much better way with EXT_shader_object. The thing about OpenGL is that ARB_separate_shader_objects is a thing, and it’s a thing for every shader stage. Even if 99% of apps/games only use VS+FS, there’s still that 1% that wants to use it with those other geometry stages. Like Tomb Raider (2013). And yes, the (2013) is necessary so nobody imagines I’m talking about a more recent, more relevant game. Some months ago, I implemented basic separate shaders (VS+FS only) using EXT_graphics_pipeline_library. It’s gross. Really just not an ideal way of doing things when mapping to GL. Effectively each stage gets its own mini GPL pipeline which then gets combined on-the-fly for a couple frames of use to avoid stuttering until the real pipeline is done with its background compile. But this is stupid. The GL architecture is for separate shaders, not for just-in-time linking; we leave the linking under the hood to screw us over when it doesn’t work right so we can complain. It’s a solved problem in that regard. Making this explicit and mapping from one to the other needs all kinds of refcounting, and hash tables, and complexity, and the fact that it works at all is a miracle that science can’t explain. Now, however, there is a direct 1:1 mapping to separate shaders with EXT_shader_object. If the app compiles a shader, zink compiles that shader (object). If the app binds a shader, zink binds that shader (object). It’s that simple. And then in the background I can still do all the optimized monolithic pipeline compiling like usual to guarantee huge FPS the next time that group of shaders is used together. Finally this one problem game will run without any frame hitching or other issues. As soon as drivers besides NVIDIA implement it, of course. Thanks NVIDIA for your great Day 1 support of this great extension that solves… Of this great extension… Of… Oh for fuck’s sake. Whatever This game will never run without issues on zink. I’m over it. But you know what I’m not over yet? This totally unexpected news that Panfrost is now without firm leadership and Alyssa is now without gainful employment. How could such a thing happen? As everyone who’s anyone in the graphics community knows, SGC is the first place to receive any hiring-related rumors. It was here that the news first broke about Valve hiring some nutjob to work on zink. It was also here that everyone learned Broadcom, Bose, the entire multiverse, and Collabora were all vying to hire the five-time winner of Mesa’s Most Loudest Keyboard On Conference Call award (And yes, I can hear her clacking away towards a sixth win right now). That’s right. It’s been a while, but I’ve got another scoop. And this one’s big. I couldn’t even believe it when I stumbled upon this, and I’m sure many of you won’t either. That’s why I’m gonna tell you, and then I’m gonna spell it out for you. Alyssa has been hired by Boston Dynamics to work on driver-level computer vision integration in their robotics systems. It just makes sense if you stop and think about it. Or if you re-read her last blog post in which she basically spells it out for us: The initial goal was to run Like BD’s Atlas robot? my focus has been ensuring the project can stand on its own four legs This is just too obvious maybe you can read between the lines! I’ll admit this one took me some time until I stepped back to look at the whole picture. Then it came together like one of those magic eye puzzles I could never do as a kid and still can’t now leading to thousands of USD being spent on therapy any time I see certain types of static imagery: this was a reference to lines of code. I’ll spare you the cumbersome details of the process, but long story short, if you take all of Alyssa’s commits authored over the past year, ignore the ones that weren’t authored on Mondays (as today is Monday), and then make a word cloud of the result, the largest and most commonly-used words are “My Little Pony”, which everyone who follows robotics knows has already been announced as the new design schema for an upcoming product launch. So yeah, nice try, but you’ll need to put a lot more effort into covering your tracks if you want to conceal your job hops from SGC. Stay tuned for the crucial details everyone craves on the new Panfrost project leader: do they put toothpaste on their toothbrush before or after wetting the bristles?
  • Mike Blumenkrantz: The Last Bug (2023/04/03 00:00)
    Release Work As everyone is well aware, the Mesa 23.1 branchpoint is definitely going to be next week, and there is zero chance that it could ever be delayed*. As everyone is also well aware, this is the release in which I’ve made unbreakable* promises about the viability of gaming on Zink. Specifically, it will now be viable*. But exactly one* bug remains as a blocker to that. Just one. So naturally I had to fix it quick before anyone noticed*. * Don’t @me for factual inconsistencies in any of the previous statements. The Bug: Background The thing about OpenGL games is a lot of them are x86 binaries, which means they run in a 32bit process. Any 32bit application gets 32bit address space. 32bit address space means a 4GiB limit on addressable memory. But what does that mean? What is addressable memory? Addressable memory is any memory that can be accessed by a process. If malloc is called, this memory is addressable. If a file is mmaped, this memory is addressable. If GPU memory is mapped, this memory is addressable. What happens if the limit is exceeded? Boom. Why is the limit only 4GiB? Stop asking hard questions. Why is this difficult? The issue from a driver perspective is that this limit includes both the addressable memory from the game (e.g., the game’s internal malloc calls) as well as the addressable memory from the driver (e.g., all the GPU mapped memory). Thus, while I would like to have all 4GiB (or more, really; doesn’t everyone have 32GiB RAM in their device these days?) to use with Zink, I do not have that luxury. The Bug: How Bad Is It? Judging by recent bug reports and the prevalance on 32bit games, it’s pretty bad. Given that I solved GPU map VA leaking a long time ago, the culprit must be memory utilization in the driver. Let’s check out some profiling. The process for this is simple: capture a (long) trace from a game and then run it through massif. Sound familiar? The game in this case is, of course, Tomb Raider (2013), the home of our triangle princess. Starting a new game runs through a lot of intro cinematics and loads a lot of assets, and the memory usage is explosive. See what I did there? Yeah, jokes. On a Monday. Whew I need a vacation. This is where I started: 2.4 GiB memory allocated by the driver. In a modern, 64bit process, where we can make full use of the 64GiB memory in the device, this is not a problem and we can pretend to be a web browser using this much for a single tab. But here, from an era when memory management was important and everyone didn’t have 128GiB memory available, that’s not going to fly. Hm 🤔 Initial analysis yielded the following pattern: n3: 112105776 0x5F88B73: nir_intrinsic_instr_create (nir.c:759) n1: 47129360 0x5F96216: clone_intrinsic (nir_clone.c:358) n1: 47129360 0x5F9692E: clone_instr (nir_clone.c:496) n1: 47129360 0x5F96BB4: clone_block (nir_clone.c:563) n2: 47129360 0x5F96DEE: clone_cf_list (nir_clone.c:617) n1: 46441568 0x5F971CE: clone_function_impl (nir_clone.c:701) n3: 46441568 0x5F974A4: nir_shader_clone (nir_clone.c:774) n1: 28591984 0x67D9DE5: zink_shader_compile_separate (zink_compiler.c:3280) n1: 28591984 0x69005F8: precompile_separate_shader_job (zink_program.c:2022) n1: 28591984 0x57647B7: util_queue_thread_func (u_queue.c:309) n1: 28591984 0x57CD7BC: impl_thrd_routine (threads_posix.c:67) n1: 28591984 0x4DDB14C: start_thread (in /usr/lib64/libc.so.6) n0: 28591984 0x4E5BBB3: clone (in /usr/lib64/libc.so.6) Looking at the code, I found an obvious issue: when I implemented precompile for separate shaders a month or two ago, I had a teensie weensie little bug. Turns out when memory is allocated, it has to be freed or else it becomes unreachable. This is commonly called a leak. It wasn’t caught before now because it only affects Tomb Raider and a handful of unit tests. But I caught it, and it was so minor that I already (“quietly”) landed the fix without anyone noticing. This sort of thing will be fixed when zink is rewritten in Rust*. Hm. 🤔🤔 With an actual bug fixed, what does memory utilization look like now? Down 300MiB to 2.1GiB. A 12.5% reduction. Not that exciting. Certainly nothing that would warrant a SGC blog post. My readers have standards. Time to expand some expandables. Here’s another common pattern in the massif output: n4: 317700704 0x57570DA: ralloc_size (ralloc.c:117) n1: 226637184 0x57583BB: create_slab (ralloc.c:759) n3: 226637184 0x5758579: gc_alloc_size (ralloc.c:789) n6: 215583536 0x575868C: gc_zalloc_size (ralloc.c:814) n7: 91059504 0x5F88CE6: nir_alu_instr_create (nir.c:696) n4: 35399104 0x5F90C49: nir_build_alu2 (nir_builder.c:162) n0: 12115376 in 29 places, all below massif's threshold (1.00%) n1: 11690848 0x67C90F1: nir_iadd (nir_builder_opcodes.h:1309) n2: 11690848 0x67CB493: nir_iadd_imm (nir_builder.h:719) n1: 6074016 0x67D691C: remove_bo_access_instr (zink_compiler.c:2013) n1: 6074016 0x67C89A9: nir_shader_instructions_pass (nir_builder.h:88) n1: 6074016 0x67D6DB2: remove_bo_access (zink_compiler.c:2044) n1: 6074016 0x67E4827: zink_shader_create (zink_compiler.c:4409) n1: 6074016 0x690443E: zink_create_gfx_shader_state (zink_program.c:1885) n1: 6074016 0x623484B: util_live_shader_cache_get (u_live_shader_cache.c:141) n1: 6074016 0x69044CC: zink_create_cached_shader_state (zink_program.c:1900) This is some ralloc usage from zink’s shader creation. In short, the in-memory shader IR is… Hold on. Doesn’t this sound familiar? It turns out that nothing is ever new, and all problems have been solved before. By applying the exact same solution, we’re gonna start to see some big movement in these numbers. Serializing Serialized NIR is much more compact than object-form NIR. The memory footprint is an order of magnitude smaller, which begs the question why would anyone ever store NIR structs in memory. I don’t have an answer. One might try to make the argument that it makes shader variant creation easier, but then, it also needs to be said that shader variants require the NIR to be cloned anyway, which deserialization already (functionally) does. There’s shader_info, but that’s small, unchanging, and can be easily copied. I think it’s just convenience. And that’s fine. But it’s not fine for me or zink. Thus, I began converting all the NIR objects I was keeping around (and there’s lots) to serialized form. The first task was tackling zink_shader::nir, the object that exists for every shader created in the driver. How much would this help? Down another 500MiB to 1.6GiB total. That’s another 24% reduction. Now we’re getting somewhere. But again, SGC enthusiasts have standards, and a simple 33% improvement from where things started is hardly worth mentioning here, so I apologize for wasting time. Continuing, it’s easy to keep finding these patterns: n1: 64055264 0x57583BB: create_slab (ralloc.c:759) n2: 64055264 0x5758579: gc_alloc_size (ralloc.c:789) n6: 61664176 0x575868C: gc_zalloc_size (ralloc.c:814) n2: 22299104 0x5F88CE6: nir_alu_instr_create (nir.c:696) n1: 19814432 0x60B3804: read_alu (nir_serialize.c:905) n1: 19814432 0x60B6713: read_instr (nir_serialize.c:1787) n1: 19814432 0x60B69BD: read_block (nir_serialize.c:1856) n1: 19814432 0x60B6D6A: read_cf_node (nir_serialize.c:1949) n2: 19814432 0x60B6EA0: read_cf_list (nir_serialize.c:1976) n1: 19195888 0x60B708A: read_function_impl (nir_serialize.c:2012) n1: 19195888 0x60B7C2A: nir_deserialize (nir_serialize.c:2219) n2: 19195888 0x67E754A: zink_shader_deserialize (zink_compiler.c:4820) n2: 19195888 0x6901899: zink_create_gfx_program (zink_program.c:1041) n1: 17921504 0x6901C6C: create_linked_separable_job (zink_program.c:1105) n1: 17921504 0x57647B7: util_queue_thread_func (u_queue.c:309) n1: 17921504 0x57CD7BC: impl_thrd_routine (threads_posix.c:67) n1: 17921504 0x4DDB14C: start_thread (in /usr/lib64/libc.so.6) n0: 17921504 0x4E5BBB3: clone (in /usr/lib64/libc.so.6) This one is from the NIR copy that happens when linking shaders. Simple enough to compress. New graph: An additional 37.5% reduction to 1.0GiB? That’s not too shabby. Now we’re looking at an overall 58% reduction in memory utilization. This is the kind of improvement that SGC readers have come to expect. S 🚀 A 🚀 M 🚀 U 🚀 E 🚀 L But wait! I was doing all this last week. And the start of this post was a really long time ago, but wasn’t there something else causing high memory utilization last week? That’s right, these graphs are still being hit by the now-fixed RADV shader IR ballooning. What hap What happens if I apply that fix too? 482.7MiB total memory usage. That’s another 51.7% improvement. Overall a 79.9% reduction in memory usage. I’d expect similar (or greater?) savings for all games. The MR is up now, and I expect it should be merged soon™. Q&A Doesn’t this negatively affect performance? No. But doesn’t using more memory improve performance? No. What will I do with the rest of my 256GiB RAM? Open two more browser tabs.
  • Mike Blumenkrantz: Developers Developers Developers (2023/03/31 00:00)
    A New Frontier As everyone expects, Khronos has recently done a weekly spec update for Vulkan. What nobody expected was that this week’s update would include a mammoth extension, VK_EXT_shader_object. Or that it would be developed by Nintendo? Cool It’s a very cool extension for Zink. Effectively, it means (unoptimized) shader variants can be generated very fast. So fast that the extension should solve all the remaining issues with shader compilation and stuttering by enabling applications (zink) to create and bind shaders directly without the need for pipeline objects. Widespread adoption in the ecosystem will take time, but Lavapipe has day one support as everyone expects for all the cool new extensions that I work on. Footnote Since Samuel is looming over me, I must say that it is unlikely RADV will have support for this landed in time for 23.1, though there is an implementation in the works which passes all of CTS. A lot of refactoring is involved. Like, a lot. But we’re definitely, 100% committed to shipping GPL by default, or you’ve lost the game.
  • Peter Hutterer: New gitlab.freedesktop.org spamfighting abilities (2023/03/29 07:31)
    As of today, gitlab.freedesktop.org allows anyone with a GitLab Developer role or above to remove spam issues. If you are reading this article a while after it's published, it's best to refer to the damspam README for up-to-date details. I'm going to start with the TLDR first. For Maintainers Create a personal access token with API access and save the token value as $XDG_CONFIG_HOME/damspam/user.token Then run the following commands with your project's full path (e.g. mesa/mesa, pipewire/wireplumber, xorg/lib/libX11): $ pip install git+https://gitlab.freedesktop.org/freedesktop/damspam $ damspam request-webhook foo/bar # clean up, no longer needed. $ pip uninstall damspam $ rm $XDG_CONFIG_HOME/damspam/user.token The damspam command will file an issue in the freedesktop/fdo-bots repository. This issue will be automatically processed by a bot and should be done by the time you finish the above commands, see this issue for an example. Note: the issue processing requires a git push to an internal repo - if you script this for multiple repos please put a sleep(30) in to avoid conflicts. Once the request has been processed (and again, this should be instant), any issue in your project that gets assigned the label Spam will be processed automatically by damspam. See the next section for details. For Developers Once the maintainer for your project has requested the webhook, simply assign the Spam label to any issue that is spam. The issue creator will be blocked (i.e. cannot login), this issue and any other issue filed by the same user will be closed and made confidential (i.e. they are no longer visible to the public). In the future, one of the GitLab admins can remove that user completely but meanwhile, they and their spam are gone from the public eye and they're blocked from producing more. This should happen within seconds of assigning the Spam label. For GitLab Admins Create a personal access token with API access for the @spambot user and save the token value as $XDG_CONFIG_HOME/damspam/spambot.token. This is so you can operate as spambot instead of your own user. Then run the following command to remove all tagged spammers: $ pip install git+https://gitlab.freedesktop.org/freedesktop/damspam $ damspam purge-spammers The last command will list any users that are spammers (together with an issue that should make it simple to check whether it is indeed spam) and after interactive confirmation purge them as requested. At the time of writing, the output looks like this: $ damspam purge-spammers 0: naughtyuser : https://gitlab.freedesktop.org/somenamespace/project/-/issues/1234: [STREAMING@TV]!* LOOK AT ME 1: abcuseless : https://gitlab.freedesktop.org/somenamespace/project/-/issues/4567: ((@))THIS STREAM IS IMPORTANT 2: anothergit : https://gitlab.freedesktop.org/somenamespace/project/-/issues/8778: Buy something, really 3: whatawasteofalife : https://gitlab.freedesktop.org/somenamespace/project/-/issues/9889: What a waste of oxygen I am Purging a user means a full delete including all issues, MRs, etc. This is nonrecoverable! Please select the users to purge: [q]uit, purge [a]ll, or the index: Purging the spammers will hard-delete them and remove anything they ever did on gitlab. This is irreversible. How it works There are two components at play here: hookiedookie, a generic webhook dispatcher, and damspam which handles the actual spam issues. Hookiedookie provides an HTTP server and "does things" with JSON data on request. What it does is relatively generic (see the Settings.yaml example file) but it's set up to be triggered by a GitLab webhook and thus receives this payload. For damspam the rules we have for hookiedookie come down to something like this: if the URL is "webhooks/namespace/project" and damspam is set up for this project and the payload is an issue event and it has the "Spam" label in the issue labels, call out to damspam and pass the payload on. Other rules we currently use are automatic reload on push events or the rule to trigger the webhook request processing bot as above. This is also the reason a maintainer has to request the webhook. When the request is processed, the spambot installs a webhook with a secret token (a uuid) in the project. That token will be sent as header (a standard GitLab feature). The project/token pair is also added to hookiedookie and any webhook data must contain the project name and matching token, otherwise it is discarded. Since the token is write-only, no-one (not even the maintainers of the project) can see it. damspam gets the payload forwarded but is otherwise unaware of how it is invoked. It checks the issue, fetches the data needed, does some safety check and if it determines that yes, this is spam, then it closes the issue, makes it confidential, blocks the user and then recurses into every issue this user ever filed. Not necessarily in that order. There are some safety checks, so you don't have to worry about it suddenly blocking every project member. Why? For a while now, we've suffered from a deluge of spam (and worse) that makes it through the spam filters. GitLab has a Report Abuse feature for this but it's... woefully incomplete. The UI guides users to do the right thing - as reporter you can tick "the user is sending spam" and it automatically adds a link to the reported issue. But: none of this useful data is visible to admins. Seriously, look at the official screenshots. There is no link to the issue, all you get is a username, the user that reported it and the content of a textbox that almost never has any useful information. The link to the issue? Not there. The selection that the user is a spammer? Not there. For an admin, this is frustrating at best. To verify that the user is indeed sending spam, you have to find the issue first. Which, at best, requires several clicks and digging through the profile activities. At worst you know that the user is a spammer because you trust the reporter but you just can't find the issue for whatever reason. But even worse: reporting spam does nothing immediately. The spam stays up until an admin wakes up, reviews the abuse reports and removes that user. Meanwhile, the spammer can happily keep filing issues against the project. Overall, it is not a particularly great situation. With hookiedookie and damspam, we're now better equipped to stand against the tide of spam. Anyone who can assign labels can help fight spam and the effect is immediate. And it's - for our use-cases - safe enough: if you trust someone to be a developer on your project, we can trust them to not willy-nilly remove issues pretending they're spam. In fact, they probably could've deleted issues beforehand already anyway if they wanted to make them disappear. Other instances While we're definitely aiming at gitlab.freedesktop.org, there's nothing in particular that requires this instance. If you're the admin for a public gitlab instance feel free to talk to Benjamin Tissoires or me to check whether this could be useful for you too, and what changes would be necessary.
  • Maira Canal: Adding a Timeout feature to Rustgem (2023/03/22 00:00)
    After my last blogpost, I kept developing the Rust version of the VGEM driver, also known as rustgem for now. Previously, I had developed two important features of the driver: the ability to attach a fence and the ability to signal a fence. Still one important feature is still missing: the ability to prevent hangs. Currently, if the fence is not signaled, the driver will simply hang. So, we can create a callback that signals the fence when the fence is not signaled by the user for more than 10 seconds. In order to create this callback, we need to have a Timer that will trigger it after the specified amount of time. Gladly, the Linux kernel provides us with a Timer that can be set with a callback and a timeout. But, to use it in the Rust code, we need to have a safe abstraction, that will ensure that the code is safe under some assumptions. First Attempt: writing a Timer abstraction Initially, I was developing an abstraction on my own as I checked the RfL tree and there were no Timer abstractions available. The most important question here is “how can we guarantee access to other objects inside the callback?”. The callback only has receives a pointer to the struct timer_list as its single argument. Naturally, we can think about using a container_of macro. In order to make the compatibility layer between Rust and the C callback, I decided to store the object inside the Timer. Yep, I didn’t like that a lot, but it was the solution I came up with at the time. The struct looked something like this: /// A driver-specific Timer Object // // # Invariants // timer is a valid pointer to a struct timer_list and we own a reference to it. [repr(C)] pub struct UniqueTimer<T: TimerOps<Inner = D>, D> { timer: *bindings::timer_list, inner: D, _p: PhantomData<T>, } Moreover, the second important question I had was “how can the user pass a callback function to the timer?”. There were two possibilities: using a closure and using a Trait. I decided to go through the trait path. Things would be kind of similar if I decided to go into the closure path. /// Trait which must be implemented by driver-specific timer objects. pub trait TimerOps: Sized { /// Type of the Inner data inside the Timer type Inner; /// Timer callback fn timer_callback(timer: &UniqueTimer<Self, Self::Inner>); } With those two questions solved, it seems that we are all set and good to go. So, we can create methods to initialize the timer and modify the timer’s timeout, implement the Drop trait, and use the following callback by default: unsafe extern "C" fn timer_callback<T: TimerOps<Inner = D>, D: Sized>( timer: *mut bindings::timer_list, ) { let timer = crate::container_of!(timer, UniqueTimer<T, D>, timer) as *mut UniqueTimer<T, D>; // SAFETY: The caller is responsible for passing a valid timer_list subtype T::timer_callback(unsafe { &mut *timer }); } All should work, right? Well… No, I didn’t really mention how I was allocating memory. And let’s say I was initially allocating it wrongly and therefore, the container_of macro was pointing to the wrong memory space. Initially, I was allocating only timer with the kernel memory allocator krealloc and allocating the rest of the struct with Rust’s memory allocator. By making such a mess, container_of wasn’t able to point to the right memory address. I had to change things a bit to allocate the whole struct UniqueTimer with the kernel’s memory allocator. However, krealloc returns a raw pointer and it would be nice for the final user to get a raw pointer to the object. I wrapped up inside another struct that could be dereferenced into the UniqueTimer object. /// A generic Timer Object /// /// This object should be instantiated by the end user, as it holds /// a unique reference to the UniqueTimer struct. The UniqueTimer /// methods can be used through it. pub struct Timer<T: TimerOps<Inner = D>, D>(*mut UniqueTimer<T, D>); impl<T: TimerOps<Inner = D>, D> Timer<T, D> { /// Create a timer for its first use pub fn setup(inner: D) -> Self { let t = unsafe { bindings::krealloc( core::ptr::null_mut(), core::mem::size_of::<UniqueTimer<T, D>>(), bindings::GFP_KERNEL | bindings::__GFP_ZERO, ) as *mut UniqueTimer<T, D> }; // SAFETY: The pointer is valid, so pointers to members are too. // After this, all fields are initialized. unsafe { addr_of_mut!((*t).inner).write(inner); bindings::timer_setup(addr_of_mut!((*t).timer), Some(timer_callback::<T, D>), 0) }; Self(t) } } And then the container_of macro started working! Now, I could setup a Timer for each fence and keep the fence inside the timer. Finally, I could use the fence inside the timer to signal it when it was not signaled by the user for more than 10 seconds. impl TimerOps for VgemFenceOps { type Inner = UniqueFence<Self>; fn timer_callback(timer: &UniqueTimer<Self, UniqueFence<Self>>) { let _ = timer.inner().signal(); } } So, I tested the driver with IGT using the vgem_slow test and it was now passing! All IGT tests were passing and it looked like the driver was practically completed (some FIXME problems notwithstanding). But, let’s see if this abstraction is really safe… Second Attempt: using a Timer abstraction First, let’s inspect the struct timer_list in the C code. struct timer_list { struct hlist_node entry; unsigned long expires; void (*function)(struct timer_list *); u32 flags; }; By looking at this struct, we can see a problem in my abstraction: a timer can point to a timer through a list. If you are not familiar with Rust, this can seem normal, but self-referential types can lead to undefined behavior (UB). Let’s say we have an example type with two fields: u32 and a pointer to this u32 value. Initially, everything looks fine, the pointer field points to the value field in memory address A, which contains a valid u32, and all pointers are valid. But Rust has the freedom to move values around memory. For example, if we pass this struct into another function, it might get moved to a different memory address. So, the once valid pointer is no longer valid, because when we move the struct, the struct’s fields change their address, but not their value. Now, the pointer fields still point to the memory address A, although the value field is located at the memory address B now. This is really bad and can lead to UB. The solution is to make timer_list implement the !Unpin trait. This means that to use this type safely, we can’t use regular pointers for self-reference. Instead, we use special pointers that “pin” their values into place, ensuring they can’t be moved. Still looking at the struct timer_list, it is possible to notice that a timer can queue itself in the timer function. This functionality is not covered by my current abstraction. Moreover, I was using jiffies to modify the timeout duration and I was adding a Duration to the jiffies. This is problematic, because it can cause a data races. Reading jiffies and adding a duration to them should be an atomic operation. Huge thanks to the RfL folks that pointed the errors in my implementation! With all these problems pointed out, it is time to fix them! I could have reimplemented my safe abstraction, but the RfL folks pointed me to a Timer abstraction that they are developing in a downstream tree. Therefore, I decided to use their Timer abstraction. There were two options to implement a Timer abstraction: To implement the Timeout trait to the VgemFence struct To use the FnTimer abstraction In the end, I decided to go with the second approach. The FnTimer receives a closure that will be executed at the timeout. The closure can return an enum that indicated if the timer is done or if it should be rescheduled. When implementing the timer, I had a lot of borrow checker problems. See… I need to use the Fence object inside the callback and also move the Fence object at the end of the function. So, I got plenty of “cannot move out of fence because it is borrowed” errors. Also, I needed the Timer to be dropped at the same time as the fence, so I needed to store the Timer inside the VgemFence struct. The solution to the problems: smart pointers! I boxed the FnTimer and the closure inside the FnTimer so that I could store it inside the VgemFence struct. Then, the second problem got fixed. But, I still cannot use the fence inside the closure, because it wasn’t encapsulated inside a smart pointer. So, I used an Arc to box Fence, clone it, and move it to the scope of the closure. pub(crate) struct VgemFence { fence: Arc<UniqueFence<Fence>>, _timer: Box<FnTimer<Box<dyn FnMut() -> Result<Next> + Sync>>>, } impl VgemFence { pub(crate) fn create() -> Result<Self> { let fence_ctx = FenceContexts::new(1, QUEUE_NAME, &QUEUE_CLASS_KEY)?; let fence = Arc::try_new(fence_ctx.new_fence(0, Fence {})?)?; // SAFETY: The caller calls [`FnTimer::init_timer`] before using the timer. let t = Box::try_new(unsafe { FnTimer::new(Box::try_new({ let fence = fence.clone(); move || { let _ = fence.signal(); Ok(Next::Done) } })? as Box<_>) })?; // SAFETY: As FnTimer is inside a Box, it won't be moved. let ptr = unsafe { core::pin::Pin::new_unchecked(&*t) }; timer_init!(ptr, 0, "vgem_timer"); // SAFETY: Duration.as_millis() returns a valid total number of whole milliseconds. let timeout = unsafe { bindings::msecs_to_jiffies(Duration::from_secs(10).as_millis().try_into()?) }; // We force the fence to expire within 10s to prevent driver hangs ptr.raw_timer().schedule_at(jiffies_later(timeout)); Ok(Self { fence, _timer: t }) } } You can observe in this code that the initialization of the FnTimer uses an unsafe operation. This happens because we still don’t have Safe Pinned Initialization. But the RfL folks are working hard to land this feature and improve ergonomics when using Pin. Now, running again the vgem_slow IGT test, you can see that all IGT tests are now passing! Next Steps During this time, many improvements landed in the driver: all the objects are being properly dropped, including the DRM device; all error cases are returning the correct error; the SAFETY comments are properly written and most importantly, the timeout feature was introduced. With that, all IGT tests are passing and the driver is functional! Now, the driver is in a good shape, apart from one FIXME problem: currently, the IOCTL abstraction doesn’t support any drivers that the IOCTLs don’t start in 0x00 and the VGEM driver starts its IOCTLs with 0x01. I don’t know yet how to bypass this problem without adding a dummy IOCTL as 0x00, but I hope to get a solution to it soon. The progress of this project can be followed in this PR and I hope to see this project being integrated upstream in the future.
  • Maira Canal: Adding a Timeout feature to Rustgem (2023/03/22 00:00)
    After my last blogpost, I kept developing the Rust version of the VGEM driver, also known as rustgem for now. Previously, I had developed two important features of the driver: the ability to attach a fence and the ability to signal a fence. Still one important feature is still missing: the ability to prevent hangs. Currently, if the fence is not signaled, the driver will simply hang. So, we can create a callback that signals the fence when the fence is not signaled by the user for more than 10 seconds. In order to create this callback, we need to have a Timer that will trigger it after the specified amount of time. Gladly, the Linux kernel provides us with a Timer that can be set with a callback and a timeout. But, to use it in the Rust code, we need to have a safe abstraction, that will ensure that the code is safe under some assumptions. First Attempt: writing a Timer abstraction Initially, I was developing an abstraction on my own as I checked the RfL tree and there were no Timer abstractions available. The most important question here is “how can we guarantee access to other objects inside the callback?”. The callback only has receives a pointer to the struct timer_list as its single argument. Naturally, we can think about using a container_of macro. In order to make the compatibility layer between Rust and the C callback, I decided to store the object inside the Timer. Yep, I didn’t like that a lot, but it was the solution I came up with at the time. The struct looked something like this: /// A driver-specific Timer Object // // # Invariants // timer is a valid pointer to a struct timer_list and we own a reference to it. [repr(C)] pub struct UniqueTimer<T: TimerOps<Inner = D>, D> { timer: *bindings::timer_list, inner: D, _p: PhantomData<T>, } Moreover, the second important question I had was “how can the user pass a callback function to the timer?”. There were two possibilities: using a closure and using a Trait. I decided to go through the trait path. Things would be kind of similar if I decided to go into the closure path. /// Trait which must be implemented by driver-specific timer objects. pub trait TimerOps: Sized { /// Type of the Inner data inside the Timer type Inner; /// Timer callback fn timer_callback(timer: &UniqueTimer<Self, Self::Inner>); } With those two questions solved, it seems that we are all set and good to go. So, we can create methods to initialize the timer and modify the timer’s timeout, implement the Drop trait, and use the following callback by default: unsafe extern "C" fn timer_callback<T: TimerOps<Inner = D>, D: Sized>( timer: *mut bindings::timer_list, ) { let timer = crate::container_of!(timer, UniqueTimer<T, D>, timer) as *mut UniqueTimer<T, D>; // SAFETY: The caller is responsible for passing a valid timer_list subtype T::timer_callback(unsafe { &mut *timer }); } All should work, right? Well… No, I didn’t really mention how I was allocating memory. And let’s say I was initially allocating it wrongly and therefore, the container_of macro was pointing to the wrong memory space. Initially, I was allocating only timer with the kernel memory allocator krealloc and allocating the rest of the struct with Rust’s memory allocator. By making such a mess, container_of wasn’t able to point to the right memory address. I had to change things a bit to allocate the whole struct UniqueTimer with the kernel’s memory allocator. However, krealloc returns a raw pointer and it would be nice for the final user to get a raw pointer to the object. I wrapped up inside another struct that could be dereferenced into the UniqueTimer object. /// A generic Timer Object /// /// This object should be instantiated by the end user, as it holds /// a unique reference to the UniqueTimer struct. The UniqueTimer /// methods can be used through it. pub struct Timer<T: TimerOps<Inner = D>, D>(*mut UniqueTimer<T, D>); impl<T: TimerOps<Inner = D>, D> Timer<T, D> { /// Create a timer for its first use pub fn setup(inner: D) -> Self { let t = unsafe { bindings::krealloc( core::ptr::null_mut(), core::mem::size_of::<UniqueTimer<T, D>>(), bindings::GFP_KERNEL | bindings::__GFP_ZERO, ) as *mut UniqueTimer<T, D> }; // SAFETY: The pointer is valid, so pointers to members are too. // After this, all fields are initialized. unsafe { addr_of_mut!((*t).inner).write(inner); bindings::timer_setup(addr_of_mut!((*t).timer), Some(timer_callback::<T, D>), 0) }; Self(t) } } And then the container_of macro started working! Now, I could setup a Timer for each fence and keep the fence inside the timer. Finally, I could use the fence inside the timer to signal it when it was not signaled by the user for more than 10 seconds. impl TimerOps for VgemFenceOps { type Inner = UniqueFence<Self>; fn timer_callback(timer: &UniqueTimer<Self, UniqueFence<Self>>) { let _ = timer.inner().signal(); } } So, I tested the driver with IGT using the vgem_slow test and it was now passing! All IGT tests were passing and it looked like the driver was practically completed (some FIXME problems notwithstanding). But, let’s see if this abstraction is really safe… Second Attempt: using a Timer abstraction First, let’s inspect the struct timer_list in the C code. struct timer_list { struct hlist_node entry; unsigned long expires; void (*function)(struct timer_list *); u32 flags; }; By looking at this struct, we can see a problem in my abstraction: a timer can point to a timer through a list. If you are not familiar with Rust, this can seem normal, but self-referential types can lead to undefined behavior (UB). Let’s say we have an example type with two fields: u32 and a pointer to this u32 value. Initially, everything looks fine, the pointer field points to the value field in memory address A, which contains a valid u32, and all pointers are valid. But Rust has the freedom to move values around memory. For example, if we pass this struct into another function, it might get moved to a different memory address. So, the once valid pointer is no longer valid, because when we move the struct, the struct’s fields change their address, but not their value. Now, the pointer fields still point to the memory address A, although the value field is located at the memory address B now. This is really bad and can lead to UB. The solution is to make timer_list implement the !Unpin trait. This means that to use this type safely, we can’t use regular pointers for self-reference. Instead, we use special pointers that “pin” their values into place, ensuring they can’t be moved. Still looking at the struct timer_list, it is possible to notice that a timer can queue itself in the timer function. This functionality is not covered by my current abstraction. Moreover, I was using jiffies to modify the timeout duration and I was adding a Duration to the jiffies. This is problematic, because it can cause a data races. Reading jiffies and adding a duration to them should be an atomic operation. Huge thanks to the RfL folks that pointed the errors in my implementation! With all these problems pointed out, it is time to fix them! I could have reimplemented my safe abstraction, but the RfL folks pointed me to a Timer abstraction that they are developing in a downstream tree. Therefore, I decided to use their Timer abstraction. There were two options to implement a Timer abstraction: To implement the Timeout trait to the VgemFence struct To use the FnTimer abstraction In the end, I decided to go with the second approach. The FnTimer receives a closure that will be executed at the timeout. The closure can return an enum that indicated if the timer is done or if it should be rescheduled. When implementing the timer, I had a lot of borrow checker problems. See… I need to use the Fence object inside the callback and also move the Fence object at the end of the function. So, I got plenty of “cannot move out of fence because it is borrowed” errors. Also, I needed the Timer to be dropped at the same time as the fence, so I needed to store the Timer inside the VgemFence struct. The solution to the problems: smart pointers! I boxed the FnTimer and the closure inside the FnTimer so that I could store it inside the VgemFence struct. Then, the second problem got fixed. But, I still cannot use the fence inside the closure, because it wasn’t encapsulated inside a smart pointer. So, I used an Arc to box Fence, clone it, and move it to the scope of the closure. pub(crate) struct VgemFence { fence: Arc<UniqueFence<Fence>>, _timer: Box<FnTimer<Box<dyn FnMut() -> Result<Next> + Sync>>>, } impl VgemFence { pub(crate) fn create() -> Result<Self> { let fence_ctx = FenceContexts::new(1, QUEUE_NAME, &QUEUE_CLASS_KEY)?; let fence = Arc::try_new(fence_ctx.new_fence(0, Fence {})?)?; // SAFETY: The caller calls [`FnTimer::init_timer`] before using the timer. let t = Box::try_new(unsafe { FnTimer::new(Box::try_new({ let fence = fence.clone(); move || { let _ = fence.signal(); Ok(Next::Done) } })? as Box<_>) })?; // SAFETY: As FnTimer is inside a Box, it won't be moved. let ptr = unsafe { core::pin::Pin::new_unchecked(&*t) }; timer_init!(ptr, 0, "vgem_timer"); // SAFETY: Duration.as_millis() returns a valid total number of whole milliseconds. let timeout = unsafe { bindings::msecs_to_jiffies(Duration::from_secs(10).as_millis().try_into()?) }; // We force the fence to expire within 10s to prevent driver hangs ptr.raw_timer().schedule_at(jiffies_later(timeout)); Ok(Self { fence, _timer: t }) } } You can observe in this code that the initialization of the FnTimer uses an unsafe operation. This happens because we still don’t have Safe Pinned Initialization. But the RfL folks are working hard to land this feature and improve ergonomics when using Pin. Now, running again the vgem_slow IGT test, you can see that all IGT tests are now passing! Next Steps During this time, many improvements landed in the driver: all the objects are being properly dropped, including the DRM device; all error cases are returning the correct error; the SAFETY comments are properly written and most importantly, the timeout feature was introduced. With that, all IGT tests are passing and the driver is functional! Now, the driver is in a good shape, apart from one FIXME problem: currently, the IOCTL abstraction doesn’t support any drivers that the IOCTLs don’t start in 0x00 and the VGEM driver starts its IOCTLs with 0x01. I don’t know yet how to bypass this problem without adding a dummy IOCTL as 0x00, but I hope to get a solution to it soon. The progress of this project can be followed in this PR and I hope to see this project being integrated upstream in the future.
  • Danylo Piliaiev: Command stream editing as an effective method to debug driver issues (2023/03/20 23:00)
    Table of Contents How the tool is used In previous posts, “Graphics Flight Recorder - unknown but handy tool to debug GPU hangs” and “Debugging Unrecoverable GPU Hangs”, I demonstrated a few tricks of how to identify the location of GPU fault. But what’s the next step once you’ve roughly pinpointed the issue? What if the problem is only sporadically reproducible and the only way to ensure consistent results is by replaying a trace of raw GPU commands? How can you precisely determine the cause and find a proper fix? Sometimes, you may have an inkling of what’s causing the problem, then and you can simply modify the driver’s code to see if it resolves the issue. However, there are instances where the root cause remains elusive or you only want to change a specific value without affecting the same register before and after it. The optimal approach in these situations is to directly modify the commands sent to the GPU. The ability to arbitrarily edit the command stream was always an obvious idea and has crossed my mind numerous times (and not only mine – proprietary driver developers seem to employ similar techniques). Finally, the stars aligned: my frustration with a recent bug, the kernel’s new support for user-space-defined GPU addresses for buffer objects, the tool I wrote to replay command stream traces not so long ago, and the realization that implementing a command stream editor was not as complicated as initially thought. The end result is a tool for Adreno GPUs (with msm kernel driver) to decompile, edit, and compile back command streams: “freedreno,turnip: Add tooling to edit command streams and use them in ‘replay’”. The primary advantage of this command stream editing tool lies the ability to rapidly iterate over hypotheses. Another highly valuable feature (which I have plans for) would be the automatic bisection of the command stream, which would be particularly beneficial in instances where only the bug reporter has the necessary hardware to reproduce the issue at hand. How the tool is used # Decompile one command stream from the trace ./rddecompiler -s 0 gpu_trace.rd > generate_rd.c # Compile the executable which would output the command stream meson setup . build ninja -C build # Override the command stream with the commands from the generator ./replay gpu_trace.rd --override=0 --generator=./build/generate_rd Reading dEQP-VK.renderpass.suballocation.formats.r5g6b5_unorm_pack16.clear.clear.rd... gpuid: 660 Uploading iova 0x100000000 size = 0x82000 Uploading iova 0x100089000 size = 0x4000 cmdstream 0: 207 dwords generating cmdstream './generate_rd --vastart=21441282048 --vasize=33554432 gpu_trace.rd' Uploading iova 0x4fff00000 size = 0x1d4 override cmdstream: 117 dwords skipped cmdstream 1: 248 dwords skipped cmdstream 2: 223 dwords The decompiled code isn’t pretty: /* pkt4: GRAS_SC_SCREEN_SCISSOR[0].TL = { X = 0 | Y = 0 } */ pkt4(cs, REG_A6XX_GRAS_SC_SCREEN_SCISSOR_TL(0), (2), 0); /* pkt4: GRAS_SC_SCREEN_SCISSOR[0].BR = { X = 32767 | Y = 32767 } */ pkt(cs, 2147450879); /* pkt4: VFD_INDEX_OFFSET = 0 */ pkt4(cs, REG_A6XX_VFD_INDEX_OFFSET, (2), 0); /* pkt4: VFD_INSTANCE_START_OFFSET = 0 */ pkt(cs, 0); /* pkt4: SP_FS_OUTPUT[0].REG = { REGID = r0.x } */ pkt4(cs, REG_A6XX_SP_FS_OUTPUT_REG(0), (1), 0); /* pkt4: SP_TP_RAS_MSAA_CNTL = { SAMPLES = MSAA_FOUR } */ pkt4(cs, REG_A6XX_SP_TP_RAS_MSAA_CNTL, (2), 2); /* pkt4: SP_TP_DEST_MSAA_CNTL = { SAMPLES = MSAA_FOUR } */ pkt(cs, 2); /* pkt4: GRAS_RAS_MSAA_CNTL = { SAMPLES = MSAA_FOUR } */ pkt4(cs, REG_A6XX_GRAS_RAS_MSAA_CNTL, (2), 2); Shader assembly is editable: const char *source = R"( shps #l37 getone #l37 cov.u32f32 r1.w, c504.z cov.u32f32 r2.x, c504.w cov.u32f32 r1.y, c504.x .... end )"; upload_shader(&ctx, 0x100200d80, source); emit_shader_iova(&ctx, cs, 0x100200d80); However, not everything is currently editable, such as descriptors. Despite this limitations, the existing functionality is sufficient for the majority of cases.
  • Simon Ser: Status update, March 2023 (2023/03/19 22:00)
    Hi all! In the past week or so I’ve focused on a NPotM: go-imap, an IMAP library for Go. “But Simon, a New Project of the Month is supposed to be new!” Right, right, the NPotM is a lie… But only a half-lie: I’m rewriting it from scratch. go-imap was one of the first Go projects I’ve written, and I couldn’t recite the IMAP4rev1 RFC by heart at the time. This is just a roundabout way to say that mistakes were made. IMAP extensions — a lot of which provide important functionality — were designed to be implemented out-of-tree in separate Go modules. However many extensions change the behavior of existing commands, so trying to design a modular system is a fool’s errand which only results in a more complicated API. Go channels were (ab)overused in the public API. The internals were not designed with goroutine safety in mind, so races were ducktaped after the fact. It’s not possible to run multiple IMAP commands concurrently: each time a command is sent, the caller gets to twiddle their thumbs until the reply comes back before sending a new one, paying the full price of the roundtrip. The parser has a weird intermediate representation based on interface{} Go values. Many functions and types are exported in the public API but really shouldn’t be. For all of these reasons, I’ve decided to start from scratch rather than trying to incrementally improve the library. This turned out to be a good decision: in one week, I had a working client which has less bugs and more features than go-imap v1. I based my work on the newer IMAP4rev2 RFC, which provides a better base feature set than IMAP4rev1. I’ve ported alps to the new API to make sure I didn’t miss anything. I still need to write the server part and tests. In IRC news, the soju database message store submitted by delthas has finally been merged. Now, the message history can be stored in the database rather than in plain-text files. This enables features such as full-text search and retaining IRCv3 message tags. The goguma mobile client now has a gallery view for images, supports replies via the reply client tag (in a non-intrusive fashion), and scrolls to the unread indicator when a conversation is opened. As usual, I worked on many other smaller tasks in other projects. The wlroots output layers have been merged, but are still opt-in and require compositor support. lists.sr.ht now uses go-emailthreads to display replies in the patch view. hut pages publish can now take a directory as input, and will generate the tarball on-the-fly. There are many other tiny improvements I could mention, but it’d get boring, let’s wrap up this status update. See you next month!
  • Dave Airlie (blogspot): vulkan video vp9 decode - radv update (2023/03/13 04:14)
    While going over the AV1 a few people commented on the lack of VP9 and a few people said it would be an easier place to start etc.Daniel Almeida at Collabora took a first pass at writing the spec up, and I decided to go ahead and take it to a working demo level.Lynne was busy, and they'd already said it should take an afternoon, so I decided to have a go at writing the ffmpeg side for it as well as finish off Daniel's radv code.About 2 mins before I finished for the weekend on Friday, I got a single frame to decode, and this morning I finished off the rest to get at least 2 test videos I downloaded to work.Branches are at [1] and [2]. There is only 8-bit support so far and I suspect some cleaning up is required.[1] https://github.com/airlied/FFmpeg/tree/vulkan-vp9-decode[2] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-decode-mesa-vp9
  • Robert McQueen: Flathub in 2023 (2023/03/07 11:00)
    It’s been quite a few months since the most recent updates about Flathub last year. We’ve been busy behind the scenes, so I’d like to share what we’ve been up to at Flathub and why—and what’s coming up from us this year. I want to focus on: Where Flathub is today as a strong ecosystem with 2,000 apps Our progress on evolving Flathub from a build service to an app store The economic barrier to growing the ecosystem, and its consequences What’s next to overcome our challenges with focused initiatives Today Flathub is going strong: we offer 2,000 apps from over 1,500 collaborators on GitHub. We’re averaging 700,000 app downloads a day, with 898 million HTTP requests totalling 88.3 TB served by our CDN each day (thank you Fastly!). Flatpak has, in my opinion, solved the largest technical issue which has held back the mainstream growth and acceptance of Linux on the desktop (or other personal computing devices) for the past 25 years: namely, the difficulty for app developers to publish their work in a way that makes it easy for people to discover, download (or sideload, for people in challenging connectivity environments), install and use. Flathub builds on that to help users discover the work of app developers and helps that work reach users in a timely manner. Initial results of this disintermediation are promising: even with its modest size so far, Flathub has hundreds of apps that I have never, ever heard of before—and that’s even considering I’ve been working in the Linux desktop space for nearly 20 years and spent many of those staring at the contents of dselect (showing my age a little) or GNOME Software, attending conferences, and reading blog posts, news articles, and forums. I am also heartened to see that many of our OS distributor partners have recognised that this model is hugely complementary and additive to the indispensable work they are doing to bring the Linux desktop to end users, and that “having more apps available to your users” is a value-add allowing you to focus on your core offering and not a zero-sum game that should motivate infighting. Ongoing Progress Getting Flathub into its current state has been a long ongoing process. Here’s what we’ve been up to behind the scenes: Development Last year, we concluded our first engagement with Codethink to build features into the Flathub web app to move from a build service to an app store. That includes accounts for users and developers, payment processing via Stripe, and the ability for developers to manage upload tokens for the apps they control. In parallel, James Westman has been working on app verification and the corresponding features in flat-manager to ensure app metadata accurately reflects verification and pricing, and to provide authentication for paying users for app downloads when the developer enables it. Only verified developers will be able to make direct uploads or access payment settings for their apps. Legal So far, the GNOME Foundation has acted as an incubator and legal host for Flathub even though it’s not purely a GNOME product or initiative. Distributing software to end users along with processing and forwarding payments and donations also has a different legal profile in terms of risk exposure and nonprofit compliance than the current activities of the GNOME Foundation. Consequently, we plan to establish an independent legal entity to own and operate Flathub which reduces risk for the GNOME Foundation, better reflects the independent and cross-desktop interests of Flathub, and provides flexibility in the future should we need to change the structure. We’re currently in the process of reviewing legal advice to ensure we have the right structure in place before moving forward. Governance As Flathub is something we want to set outside of the existing Linux desktop and distribution space—and ensure we represent and serve the widest community of Linux users and developers—we’ve been working on a governance model that ensures that there is transparency and trust in who is making decisions, and why. We have set up a working group with myself and Martín Abente Lahaye from GNOME, Aleix Pol Gonzalez, Neofytos Kolokotronis, and Timothée Ravier from KDE, and Jorge Castro flying the flag for the Flathub community. Thanks also to Neil McGovern and Nick Richards who were also more involved in the process earlier on. We don’t want to get held up here creating something complex with memberships and elections, so at first we’re going to come up with a simple/balanced way to appoint people into a board that makes key decisions about Flathub and iterate from there. Funding We have received one grant for 2023 of $100K from Endless Network which will go towards the infrastructure, legal, and operations costs of running Flathub and setting up the structure described above. (Full disclosure: Endless Network is the umbrella organisation which also funds my employer, Endless OS Foundation.) I am hoping to grow the available funding to $250K for this year in order to cover the next round of development on the software, prepare for higher operations costs (e.g., accounting gets more complex), and bring in a second full-time staff member in addition to Bartłomiej Piotrowski to handle enquiries, reviews, documentation, and partner outreach. We’re currently in discussions with NLnet about funding further software development, but have been unfortunately turned down for a grant from the Plaintext Group for this year; this Schmidt Futures project around OSS sustainability is not currently issuing grants in 2023. However, we continue to work on other funding opportunities. Remaining Barriers My personal hypothesis is that our largest remaining barrier to Linux desktop scale and impact is economic. On competing platforms—mobile or desktop—a developer can offer their work for sale via an app store or direct download with payment or subscription within hours of making a release. While we have taken the “time to first download” time down from months to days with Flathub, as a community we continue to have a challenging relationship with money. Some creators are lucky enough to have a full-time job within the FLOSS space, while a few “superstar” developers are able to nurture some level of financial support by investing time in building a following through streaming, Patreon, Kickstarter, or similar. However, a large proportion of us have to make do with the main payback from our labours being a stream of bug reports on GitHub interspersed with occasional conciliatory beers at FOSDEM (other beverages and events are available). The first and most obvious consequence is that if there is no financial payback for participating in developing apps for the free and open source desktop, we will lose many people in the process—despite the amazing achievements of those who have brought us to where we are today. As a result, we’ll have far fewer developers and apps. If we can’t offer access to a growing base of users or the opportunity to offer something of monetary value to them, the reward in terms of adoption and possible payment will be very small. Developers would be forgiven for taking their time and attention elsewhere. With fewer apps, our platform has less to entice and retain prospective users. The second consequence is that this also represents a significant hurdle for diverse and inclusive participation. We essentially require that somebody is in a position of privilege and comfort that they have internet, power, time, and income—not to mention childcare, etc.—to spare so that they can take part. If that’s not the case for somebody, we are leaving them shut out from our community before they even have a chance to start. My belief is that free and open source software represents a better way for people to access computing, and there are billions of people in the world we should hope to reach with our work. But if the mechanism for participation ensures their voices and needs are never represented in our community of creators, we are significantly less likely to understand and meet those needs. While these are my thoughts, you’ll notice a strong theme to this year will be leading a consultation process to ensure that we are including, understanding and reflecting the needs of our different communities—app creators, OS distributors and Linux users—as I don’t believe that our initiative will be successful without ensuring mutual benefit and shared success. Ultimately, no matter how beautiful, performant, or featureful the latest versions of the Plasma or GNOME desktops are, or how slick the newly rewritten installer is from your favourite distribution, all of the projects making up the Linux desktop ecosystem are subdividing between ourselves an absolutely tiny market share of the global market of personal computers. To make a bigger mark on the world, as a community, we need to get out more. What’s Next? After identifying our major barriers to overcome, we’ve planned a number of focused initiatives and restructuring this year: Phased Deployment We’re working on deploying the work we have been doing over the past year, starting first with launching the new Flathub web experience as well as the rebrand that Jakub has been talking about on his blog. This also will finally launch the verification features so we can distinguish those apps which are uploaded by their developers. In parallel, we’ll also be able to turn on the Flatpak repo subsets that enable users to select only verified and/or FLOSS apps in the Flatpak CLI or their desktop’s app center UI. Consultation We would like to make sure that the voices of app creators, OS distributors, and Linux users are reflected in our plans for 2023 and beyond. We will be launching this in the form of Flathub Focus Groups at the Linux App Summit in Brno in May 2023, followed up with surveys and other opportunities for online participation. We see our role as interconnecting communities and want to be sure that we remain transparent and accountable to those we are seeking to empower with our work. Whilst we are being bold and ambitious with what we are trying to create for the Linux desktop community, we also want to make sure we provide the right forums to listen to the FLOSS community and prioritise our work accordingly. Advisory Board As we build the Flathub organisation up in 2023, we’re also planning to expand its governance by creating an Advisory Board. We will establish an ongoing forum with different stakeholders around Flathub: OS vendors, hardware integrators, app developers and user representatives to help us create the Flathub that supports and promotes our mutually shared interests in a strong and healthy Linux desktop community. Direct Uploads Direct app uploads are close to ready, and they enable exciting stuff like allowing Electron apps to be built outside of flatpak-builder, or driving automatic Flathub uploads from GitHub actions or GitLab CI flows; however, we need to think a little about how we encourage these to be used. Even with its frustrations, our current Buildbot ensures that the build logs and source versions of each app on Flathub are captured, and that the apps are built on all supported architectures. (Is 2023 when we add RISC-V? Reach out if you’d like to help!). If we hand upload tokens out to any developer, even if the majority of apps are open source, we will go from this relatively structured situation to something a lot more unstructured—and we fear many apps will be available on only 64-bit Intel/AMD machines. My sketch here is that we need to establish some best practices around how to integrate Flathub uploads into popular CI systems, encouraging best practices so that we promote the properties of transparency and reproducibility that we don’t want to lose. If anyone is a CI wizard and would like to work with us as a thought partner about how we can achieve this—make it more flexible where and how build tasks can be hosted, but not lose these cross-platform and inspectability properties—we’d love to hear from you. Donations and Payments Once the work around legal and governance reaches a decent point, we will be in the position to move ahead with our Stripe setup and switch on the third big new feature in the Flathub web app. At present, we have already implemented support for one-off payments either as donations or a required purchase. We would like to go further than that, in line with what we were describing earlier about helping developers sustainably work on apps for our ecosystem: we would also like to enable developers to offer subscriptions. This will allow us to create a relationship between users and creators that funds ongoing work rather than what we already have. Security For Flathub to succeed, we need to make sure that as we grow, we continue to be a platform that can give users confidence in the quality and security of the apps we offer. To that end, we are planning to set up infrastructure to help ensure developers are shipping the best products they possibly can to users. For example, we’d like to set up automated linting and security scanning on the Flathub back-end to help developers avoid bad practices, unnecessary sandbox permissions, outdated dependencies, etc. and to keep users informed and as secure as possible. Sponsorship Fundraising is a forever task—as is running such a big and growing service. We hope that one day, we can cover our costs through some modest fees built into our payments—but until we reach that point, we’re going to be seeking a combination of grant funding and sponsorship to keep our roadmap moving. Our hope is very much that we can encourage different organisations that buy into our vision and will benefit from Flathub to help us support it and ensure we can deliver on our goals. If you have any suggestions of who might like to support Flathub, we would be very appreciative if you could reach out and get us in touch. Finally, Thank You! Thanks to you all for reading this far and supporting the work of Flathub, and also to our major sponsors and donors without whom Flathub could not exist: GNOME Foundation, KDE e.V., Mythic Beasts, Endless Network, Fastly, and Equinix Metal via the CNCF Community Cluster. Thanks also to the tireless work of the Freedesktop SDK community to give us the runtime platform most Flatpaks depend on, particularly Seppo Yli-Olli, Codethink and others. I wanted to also give my personal thanks to a handful of dedicated people who keep Flathub working as a service and as a community: Bartłomiej Piotrowski is keeping the infrastructure working essentially single-handedly (in his spare time from keeping everything running at GNOME); Kolja Lampe and Bart built the new web app and backend API for Flathub which all of the new functionality has been built on, and Filippe LeMarchand maintains the checker bot which helps keeps all of the Flatpaks up to date. And finally, all of the submissions to Flathub are reviewed to ensure quality, consistency and security by a small dedicated team of reviewers, with a huge amount of work from Hubert Figuière and Bart to keep the submissions flowing. Thanks to everyone­—named or unnamed—for building this vision of the future of the Linux desktop together with us. (originally posted to Flathub Discourse, head there if you have any questions or comments)
  • André Almeida: Installing kernel modules faster with multithread XZ (2023/03/03 00:00)
    As a kernel developer, everyday I need to compile and install custom kernels, and any improvement in this workflow means to be more productive. While installing my fresh compiled modules, I noticed that it would be stuck in amdgpu compression for some time: XZ /usr/lib/modules/6.2.0-tonyk/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.xz XZ format My target machine is the Steam Deck, that uses .xz for compressing the modules. Giving that we want gamers to be able to install as many games as possible, the OS shouldn’t waste much disk space. amdgpu, when compiled with debug symbols can use a good hunk of space. Here’s the comparison of disk size of the module uncompressed, and then with .zst and .xz compression: 360M amdgpu.ko 61M amdgpu.ko.zst 38M amdgpu.ko.xz This more compact module comes with a cost: more CPU time for compression. Multithread compression When I opened htop, I saw that only a lonely thread was doing the hard work to compress amdgpu, even that compression is a task easily parallelizable. I then hacked scripts/Makefile.modinst so XZ would use as many threads as possible, with the option -T0. In my main build machine, modules_install was running 4 times faster! # before the patch $ time make modules_install -j16 Executed in 100.08 secs # after the patch $ time make modules_install -j16 Executed in 28.60 secs Then, I submitted a patch to make this default for everyone: [PATCH] kbuild: modinst: Enable multithread xz compression However, as Masahiro Yamada noticed, we shouldn’t be spawning numerous threads in the build system without the user request. Until today we specify manually how many threads we should run with make -jX. Hopefully, Nathan Chancellor suggested that the same results can be achieved using XZ_OPT=-T0, so we still can benefit from this without the patch. I experimented with different -TX and -jY values, but in my notebook the most efficient values were X = Y = nproc. You can check some results bellow: $ make modules_install 174.83 secs $ make modules_install -j8 100.55 secs $ make modules_install XZ_OPT=-T0 81.51 secs $ make modules_install -j8 XZ_OPT=-T0 53.22 sec
  • Lucas Fryzek: Journey Through Freedreno (2023/02/28 05:00)
    Android running Freedreno As part of my training at Igalia I’ve been attempting to write a new backend for Freedreno that targets the proprietary “KGSL” kernel mode driver. For those unaware there are two “main” kernel mode drivers on Qualcomm SOCs for the GPU, there is the “MSM”, and “KGSL”. “MSM” is DRM compliant, and Freedreno already able to run on this driver. “KGSL” is the proprietary KMD that Qualcomm’s proprietary userspace driver targets. Now why would you want to run freedreno against KGSL, when MSM exists? Well there are a few ones, first MSM only really works on an up-streamed kernel, so if you have to run a down-streamed kernel you can continue using the version of KGSL that the manufacturer shipped with your device. Second this allows you to run both the proprietary adreno driver and the open source freedreno driver on the same device just by swapping libraries, which can be very nice for quickly testing something against both drivers. When “DRM” isn’t just “DRM” When working on a new backend, one of the critical things to do is to make use of as much “common code” as possible. This has a number of benefits, least of all reducing the amount of code you have to write. It also allows reduces the number of bugs that will likely exist as you are relying on well tested code, and it ensures that the backend is mostly likely going to continue to work with new driver updates. When I started the work for a new backend I looked inside mesa’s src/freedreno/drm folder. This has the current backend code for Freedreno, and its already modularized to support multiple backends. It currently has support for the above mentioned MSM kernel mode driver as well as virtio (a backend that allows Freedreno to be used from within in a virtualized environment). From the name of this path, you would think that the code in this module would only work with kernel mode drivers that implement DRM, but actually there is only a handful of places in this module where DRM support is assumed. This made it a good starting point to introduce the KGSL backend and piggy back off the common code. For example the drm module has a lot of code to deal with the management of synchronization primitives, buffer objects, and command submit lists. All managed at a abstraction above “DRM” and to re-implement this code would be a bad idea. How to get Android to behave One of this big struggles with getting the KGSL backend working was figuring out how I could get Android to load mesa instead of Qualcomm blob driver that is shipped with the device image. Thankfully a good chunk of this work has already been figured out when the Turnip developers (Turnip is the open source Vulkan implementation for Adreno GPUs) figured out how to get Turnip running on android with KGSL. Thankfully one of my coworkers Danylo is one of those Turnip developers, and he gave me a lot of guidance on getting Android setup. One thing to watch out for is the outdated instructions here. These instructions almost work, but require some modifications. First if you’re using a more modern version of the Android NDK, the compiler has been replaced with LLVM/Clang, so you need to change which compiler is being used. Second flags like system in the cross compiler script incorrectly set the system as linux instead of android. I had success using the below cross compiler script. Take note that the compiler paths need to be updated to match where you extracted the android NDK on your system. [binaries] ar = '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-ar' c = ['ccache', '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang'] cpp = ['ccache', '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android29-clang++', '-fno-exceptions', '-fno-unwind-tables', '-fno-asynchronous-unwind-tables', '-static-libstdc++'] c_ld = 'lld' cpp_ld = 'lld' strip = '/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-strip' # Android doesn't come with a pkg-config, but we need one for Meson to be happy not # finding all the optional deps it looks for. Use system pkg-config pointing at a # directory we get to populate with any .pc files we want to add for Android pkgconfig = ['env', 'PKG_CONFIG_LIBDIR=/home/lfryzek/Documents/projects/igalia/freedreno/android-ndk-r25b-linux/android-ndk-r25b/pkgconfig:/home/lfryzek/Documents/projects/igalia/freedreno/install-android/lib/pkgconfig', '/usr/bin/pkg-config'] [host_machine] system = 'android' cpu_family = 'arm' cpu = 'armv8' endian = 'little' Another thing I had to figure out with Android, that was different with these instructions, was how I would get Android to load mesa versions of mesa libraries. That’s when my colleague Mark pointed out to me that Android is open source and I could just check the source code myself. Sure enough you have find the OpenGL driver loader in Android’s source code. From this code we can that Android will try to load a few different files based on some settings, and in my case it would try to load 3 different shaded libraries in the /vendor/lib64/egl folder, libEGL_adreno.so ,libGLESv1_CM_adreno.so, and libGLESv2.so. I could just replace these libraries with the version built from mesa and voilà, you’re now loading a custom driver! This realization that I could just “read the code” was very powerful in debugging some more android specific issues I ran into, like dealing with gralloc. Something cool that the opensource Freedreno & Turnip driver developers figured out was getting android to run test OpenGL applications from the adb shell without building android APKs. If you check out the freedreno repo, they have an ndk-build.sh script that can build tests in the tests-* folder. The nice benefit of this is that it provides an easy way to run simple test cases without worrying about the android window system integration. Another nifty feature about this repo is the libwrap tool that lets trace the commands being submitted to the GPU. What even is Gralloc? Gralloc is the graphics memory allocated in Android, and the OS will use it to allocate the surface for “windows”. This means that the memory we want to render the display to is managed by gralloc and not our KGSL backend. This means we have to get all the information about this surface from gralloc, and if you look in src/egl/driver/dri2/platform_android.c you will see existing code for handing gralloc. You would think “Hey there is no work for me here then”, but you would be wrong. The handle gralloc provides is hardware specific, and the code in platform_android.c assumes a DRM gralloc implementation. Thankfully the turnip developers had already gone through this struggle and if you look in src/freedreno/vulkan/tu_android.c you can see they have implemented a separate path when a Qualcomm msm implementation of gralloc is detected. I could copy this detection logic and add a separate path to platform_android.c. Working with the Freedreno community When working on any project (open-source or otherwise), it’s nice to know that you aren’t working alone. Thankfully the #freedreno channel on irc.oftc.net is very active and full of helpful people to answer any questions you may have. While working on the backend, one area I wasn’t really sure how to address was the synchronization code for buffer objects. The backend exposed a function called cpu_prep, This function was just there to call the DRM implementation of cpu_prep on the buffer object. I wasn’t exactly sure how to implement this functionality with KGSL since it doesn’t use DRM buffer objects. I ended up reaching out to the IRC channel and Rob Clark on the channel explained to me that he was actually working on moving a lot of the code for cpu_prep into common code so that a non-drm driver (like the KGSL backend I was working on) would just need to implement that operation as NOP (no operation). Dealing with bugs & reverse engineering the blob I encountered a few different bugs when implementing the KGSL backend, but most of them consisted of me calling KGSL wrong, or handing synchronization incorrectly. Thankfully since Turnip is already running on KGSL, I could just more carefully compare my code to what Turnip is doing and figure out my logical mistake. Some of the bugs I encountered required the backend interface in Freedreno to be modified to expose per a new per driver implementation of that backend function, instead of just using a common implementation. For example the existing function to map a buffer object into userspace assumed that the same fd for the device could be used for the buffer object in the mmap call. This worked fine for any buffer objects we created through KGSL but would not work for buffer objects created from gralloc (remember the above section on surface memory for windows comming from gralloc). To resolve this issue I exposed a new per backend implementation of “map” where I could take a different path if the buffer object came from gralloc. While testing the KGSL backend I did encounter a new bug that seems to effect both my new KGSL backend and the Turnip KGSL backend. The bug is an iommu fault that occurs when the surface allocated by gralloc does not have a height that is aligned to 4. The blitting engine on a6xx GPUs copies in 16x4 chunks, so if the height is not aligned by 4 the GPU will try to write to pixels that exists outside the allocated memory. This issue only happens with KGSL backends since we import memory from gralloc, and gralloc allocates exactly enough memory for the surface, with no alignment on the height. If running on any other platform, the fdl (Freedreno Layout) code would be called to compute the minimum required size for a surface which would take into account the alignment requirement for the height. The blob driver Qualcomm didn’t seem to have this problem, even though its getting the exact same buffer from gralloc. So it must be doing something different to handle the none aligned height. Because this issue relied on gralloc, the application needed to running as an Android APK to get a surface from gralloc. The best way to fix this issue would be to figure out what the blob driver is doing and try to replicate this behavior in Freedreno (assuming it isn’t doing something silly like switch to sysmem rendering). Unfortunately it didn’t look like the libwrap library worked to trace an APK. The libwrap library relied on a linux feature known as LD_PRELOAD to load libwrap.so when the application starts and replace the system functions like open and ioctl with their own implementation that traces what is being submitted to the KGSL kernel mode driver. Thankfully android exposes this LD_PRELOAD mechanism through its “wrap” interface where you create a propety called wrap.<app-name> with a value LD_PRELOAD=<path to libwrap.so>. Android will then load your library like would be done in a normal linux shell. If you tried to do this with libwrap though you find very quickly that you would get corrupted traces. When android launches your APK, it doesn’t only launch your application, there are different threads for different android system related functions and some of them can also use OpenGL. The libwrap library is not designed to handle multiple threads using KGSL at the same time. After discovering this issue I created a MR that would store the tracing file handles as TLS (thread local storage) preventing the clobbering of the trace file, and also allowing you to view the traces generated by different threads separately from each other. With this is in hand one could begin investing what the blob driver is doing to handle this unaligned surfaces. What’s next? Well the next obvious thing to fix is the aligned height issue which is still open. I’ve also worked on upstreaming my changes with this WIP MR. Freedreno running 3d-mark
  • Maira Canal: Rust for VGEM (2023/02/28 00:00)
    In the last blog post, I pointed out that I didn’t know exactly what it would be my next steps for the near future. Gladly, I had the amazing opportunity to start a new Igalia Coding Experience with a new project. This time Melissa Wen pitched me with the idea to play around with Rust for Linux in order to rewrite the VGEM driver in Rust. The Rust for Linux project is growing fast with new bindings and abstractions being introduced in the downstream RfL kernel. Also, some basic functionalities were introduced in Linux 6.1. Therefore, it seems like a great timing to start exploring Rust in the DRM subsystem! Why Rust? As mentioned by the Rust website, using Rust means Performance, Reliability, and Productivity. Rust is a blazingly fast and memory-efficient language with its powerful ownership model. No more looking for use-after-free and memory leaks, as Rust guarantees memory safety and thread safety, eliminating a handful of bugs at compile-time. Moreover, Rust provides a new way of programming. The language provides beautiful features such as traits, enums, and error handling, that can make us feel empowered by the language. We can use a lot of concepts from functional programming and mix them with concepts from OOP, for example. Although I’m an absolute beginner in Rust, I can see the major advantages of the Rust programming language. From the start, it was a bit tough to enjoy the language, as I was fighting with the compiler most of the time. But now that I have a more firm foundation on Rust, it is possible to appreciate the beauty in Rust and I don’t see myself starting a new project in C++ for a long while. Bringing Rust to the Linux Kernel is a ambitious idea, but it can lead to great changes. We can think about a world where no developers are looking for memory leaks and use-after-free bugs due to the safety that Rust can provide us. Rust on DRM Now, what about Rust for DRM? I mean, I’m not the first one to think about it. Asahi Lina is making a fantastic work on the Apple M1 GPU and things are moving quite fast there. She already had great safe abstractions for the DRM bindings and provides us the very basis for anyone who is willing to start a new DRM driver in Rust, which is my case. That said, why not make use of Lina’s excellent bindings to build a new driver? Rust for VGEM VGEM (Virtual GEM Provider) is a minimal non-hardware backed GEM (Graphics Execution Manager) service. It is used with non-native 3D hardware for buffer sharing between the X server and DRI. It is a fairly simple driver with about 400 lines of code and it uses the DMA Fence API to handle attaching and signaling the fences. So, to rewrite VGEM in Rust, some bindings are needed, e.g. bindings for platform device, for XArray, and for dealing with DMA fence and DMA reservations. Furthermore, many DRM abstractions are needed as well. In this sense, a lot of the DRM abstractions are already developed by Lina and also she is developing abstractions for DMA fence. So, in this project, I’ll be focusing on the bindings that Lina and the RfL folks haven’t developed yet. After developing the bindings, it is a matter of developing the driver, which it’ll be quite simple after all DMA abstractions are set, because most of the driver consists of fence manipulation. Current Status I have developed the main platform device registration of the driver. As VGEM is a virtual device, the standard probe initialization is not useful, as a virtual device cannot be probed by the pseudo-bus that holds the platform devices. So, as VGEM is not a usual hotplugged device, we need to use the legacy platform device initialization. This made me develop my first binding for legacy registration: /// Add a platform-level device and its resources pub fn register(name: &'static CStr, id: i32) -> Result<Self> { let pdev = from_kernel_err_ptr(unsafe { bindings::platform_device_register_simple(name.as_char_ptr(), id, core::ptr::null(), 0) })?; Ok(Self { ptr: pdev, used_resource: 0, is_registered: true, }) } For sure, the registration must follow the unregistration of the device, so I implemented a Drop trait for the struct Device in order to guarantee the proper device removal without explicitly calling it. impl Drop for Device { fn drop(&mut self) { if self.is_registered { // SAFETY: This path only runs if a previous call to `register` // completed successfully. unsafe { bindings::platform_device_unregister(self.ptr) }; } } } After those, I also developed bindings for a couple of more functions and together with Lina’s bindings, I could initialize the platform device and register the DRM device under a DRM minor! [ 38.825684] vgem: vgem_init: platform_device with id -1 [ 38.826505] [drm] Initialized vgem 1.0.0 20230201 for vgem on minor 0 [ 38.858230] vgem: Opening... [ 38.862377] vgem: Closing... [ 41.543416] vgem: vgem_exit: drop Next, I focused on the development of the two IOCTLs: drm_vgem_fence_attach and drm_vgem_fence_signal. The first is responsable for creating and attaching a fence to the VGEM handle, while the second signals and consumes a fence earlier attached to a VGEM handle. In order to add a fence, bindings to DMA reservation are needed. So, I started by creating a safe abstraction for struct dma_resv. /// A generic DMA Resv Object /// /// # Invariants /// ptr is a valid pointer to a dma_resv and we own a reference to it. pub struct DmaResv { ptr: *mut bindings::dma_resv, } impl DmaResv { [...] /// Add a fence to the dma_resv object pub fn add_fences( &self, fence: &dyn RawDmaFence, num_fences: u32, usage: bindings::dma_resv_usage, ) -> Result { unsafe { bindings::dma_resv_lock(self.ptr, core::ptr::null_mut()) }; let ret = self.reserve_fences(num_fences); match ret { Ok(_) => { // SAFETY: ptr is locked with dma_resv_lock(), and dma_resv_reserve_fences() // has been called. unsafe { bindings::dma_resv_add_fence(self.ptr, fence.raw(), usage); } } Err(_) => {} } unsafe { bindings::dma_resv_unlock(self.ptr) }; ret } } With that step, I could simply write the IOCTLs based on the new DmaResv abstraction and Lina’s fence abstractions. To test the IOCTLs, I used some already available IGT tests: dmabuf_sync_file and vgem_basic. Those tests use VGEM as it base, so if the tests pass, it means that the IOCTLs are working properly. And, after some debugging and rework in the IOCTLs, I managed to get most of the tests to pass! [root@fedora igt-gpu-tools]# ./build/tests/dmabuf_sync_file IGT-Version: 1.27-gaa16e812 (x86_64) (Linux: 6.2.0-rc3-asahi-02441-g6c8eda039cfb-dirty x86_64) Starting subtest: export-basic Subtest export-basic: SUCCESS (0.000s) Starting subtest: export-before-signal Subtest export-before-signal: SUCCESS (0.000s) Starting subtest: export-multiwait Subtest export-multiwait: SUCCESS (0.000s) Starting subtest: export-wait-after-attach Subtest export-wait-after-attach: SUCCESS (0.000s) You can check out the current progress of this project on this pull request. Next Steps Although most of the IGT tests are now passing, two tests aren’t working yet: vgem_slow, as I haven’t introduced the timeout yet, and vgem_basic@unload, as I still need to debug why the Drop trait from drm::drv::Registration is not being called. After bypassing those two problems, I still need to rework some of my code, as, for example, I’m using a dummy IOCTL as IOCTL number 0x00, as the current macro kernel::declare_drm_ioctl doesn’t support any drivers for which the IOCTL doesn’t start in 0x00. So, there is a lot of work yet to be done!
  • Maira Canal: Rust for VGEM (2023/02/28 00:00)
    In the last blog post, I pointed out that I didn’t know exactly what it would be my next steps for the near future. Gladly, I had the amazing opportunity to start a new Igalia Coding Experience with a new project. This time Melissa Wen pitched me with the idea to play around with Rust for Linux in order to rewrite the VGEM driver in Rust. The Rust for Linux project is growing fast with new bindings and abstractions being introduced in the downstream RfL kernel. Also, some basic functionalities were introduced in Linux 6.1. Therefore, it seems like a great timing to start exploring Rust in the DRM subsystem! Why Rust? As mentioned by the Rust website, using Rust means Performance, Reliability, and Productivity. Rust is a blazingly fast and memory-efficient language with its powerful ownership model. No more looking for use-after-free and memory leaks, as Rust guarantees memory safety and thread safety, eliminating a handful of bugs at compile-time. Moreover, Rust provides a new way of programming. The language provides beautiful features such as traits, enums, and error handling, that can make us feel empowered by the language. We can use a lot of concepts from functional programming and mix them with concepts from OOP, for example. Although I’m an absolute beginner in Rust, I can see the major advantages of the Rust programming language. From the start, it was a bit tough to enjoy the language, as I was fighting with the compiler most of the time. But now that I have a more firm foundation on Rust, it is possible to appreciate the beauty in Rust and I don’t see myself starting a new project in C++ for a long while. Bringing Rust to the Linux Kernel is a ambitious idea, but it can lead to great changes. We can think about a world where no developers are looking for memory leaks and use-after-free bugs due to the safety that Rust can provide us. Rust on DRM Now, what about Rust for DRM? I mean, I’m not the first one to think about it. Asahi Lina is making a fantastic work on the Apple M1 GPU and things are moving quite fast there. She already had great safe abstractions for the DRM bindings and provides us the very basis for anyone who is willing to start a new DRM driver in Rust, which is my case. That said, why not make use of Lina’s excellent bindings to build a new driver? Rust for VGEM VGEM (Virtual GEM Provider) is a minimal non-hardware backed GEM (Graphics Execution Manager) service. It is used with non-native 3D hardware for buffer sharing between the X server and DRI. It is a fairly simple driver with about 400 lines of code and it uses the DMA Fence API to handle attaching and signaling the fences. So, to rewrite VGEM in Rust, some bindings are needed, e.g. bindings for platform device, for XArray, and for dealing with DMA fence and DMA reservations. Furthermore, many DRM abstractions are needed as well. In this sense, a lot of the DRM abstractions are already developed by Lina and also she is developing abstractions for DMA fence. So, in this project, I’ll be focusing on the bindings that Lina and the RfL folks haven’t developed yet. After developing the bindings, it is a matter of developing the driver, which it’ll be quite simple after all DMA abstractions are set, because most of the driver consists of fence manipulation. Current Status I have developed the main platform device registration of the driver. As VGEM is a virtual device, the standard probe initialization is not useful, as a virtual device cannot be probed by the pseudo-bus that holds the platform devices. So, as VGEM is not a usual hotplugged device, we need to use the legacy platform device initialization. This made me develop my first binding for legacy registration: /// Add a platform-level device and its resources pub fn register(name: &'static CStr, id: i32) -> Result<Self> { let pdev = from_kernel_err_ptr(unsafe { bindings::platform_device_register_simple(name.as_char_ptr(), id, core::ptr::null(), 0) })?; Ok(Self { ptr: pdev, used_resource: 0, is_registered: true, }) } For sure, the registration must follow the unregistration of the device, so I implemented a Drop trait for the struct Device in order to guarantee the proper device removal without explicitly calling it. impl Drop for Device { fn drop(&mut self) { if self.is_registered { // SAFETY: This path only runs if a previous call to `register` // completed successfully. unsafe { bindings::platform_device_unregister(self.ptr) }; } } } After those, I also developed bindings for a couple of more functions and together with Lina’s bindings, I could initialize the platform device and register the DRM device under a DRM minor! [ 38.825684] vgem: vgem_init: platform_device with id -1 [ 38.826505] [drm] Initialized vgem 1.0.0 20230201 for vgem on minor 0 [ 38.858230] vgem: Opening... [ 38.862377] vgem: Closing... [ 41.543416] vgem: vgem_exit: drop Next, I focused on the development of the two IOCTLs: drm_vgem_fence_attach and drm_vgem_fence_signal. The first is responsable for creating and attaching a fence to the VGEM handle, while the second signals and consumes a fence earlier attached to a VGEM handle. In order to add a fence, bindings to DMA reservation are needed. So, I started by creating a safe abstraction for struct dma_resv. /// A generic DMA Resv Object /// /// # Invariants /// ptr is a valid pointer to a dma_resv and we own a reference to it. pub struct DmaResv { ptr: *mut bindings::dma_resv, } impl DmaResv { [...] /// Add a fence to the dma_resv object pub fn add_fences( &self, fence: &dyn RawDmaFence, num_fences: u32, usage: bindings::dma_resv_usage, ) -> Result { unsafe { bindings::dma_resv_lock(self.ptr, core::ptr::null_mut()) }; let ret = self.reserve_fences(num_fences); match ret { Ok(_) => { // SAFETY: ptr is locked with dma_resv_lock(), and dma_resv_reserve_fences() // has been called. unsafe { bindings::dma_resv_add_fence(self.ptr, fence.raw(), usage); } } Err(_) => {} } unsafe { bindings::dma_resv_unlock(self.ptr) }; ret } } With that step, I could simply write the IOCTLs based on the new DmaResv abstraction and Lina’s fence abstractions. To test the IOCTLs, I used some already available IGT tests: dmabuf_sync_file and vgem_basic. Those tests use VGEM as it base, so if the tests pass, it means that the IOCTLs are working properly. And, after some debugging and rework in the IOCTLs, I managed to get most of the tests to pass! [root@fedora igt-gpu-tools]# ./build/tests/dmabuf_sync_file IGT-Version: 1.27-gaa16e812 (x86_64) (Linux: 6.2.0-rc3-asahi-02441-g6c8eda039cfb-dirty x86_64) Starting subtest: export-basic Subtest export-basic: SUCCESS (0.000s) Starting subtest: export-before-signal Subtest export-before-signal: SUCCESS (0.000s) Starting subtest: export-multiwait Subtest export-multiwait: SUCCESS (0.000s) Starting subtest: export-wait-after-attach Subtest export-wait-after-attach: SUCCESS (0.000s) You can check out the current progress of this project on this pull request. Next Steps Although most of the IGT tests are now passing, two tests aren’t working yet: vgem_slow, as I haven’t introduced the timeout yet, and vgem_basic@unload, as I still need to debug why the Drop trait from drm::drv::Registration is not being called. After bypassing those two problems, I still need to rework some of my code, as, for example, I’m using a dummy IOCTL as IOCTL number 0x00, as the current macro kernel::declare_drm_ioctl doesn’t support any drivers for which the IOCTL doesn’t start in 0x00. So, there is a lot of work yet to be done!
  • Iago Toral: SuperTuxKart Vulkan vs OpenGL and Zink status on Raspberry Pi 4 (2023/02/20 10:30)
    SuperTuxKart Vulkan vs OpenGL The latest SuperTuxKart release comes with an experimental Vulkan renderer and I was eager to check it out on my Raspbery Pi 4 and see how well it worked. The short story is that while I have only tested a few tracks it seems to perform really well overall. In my tests, even with a debug build of Mesa I saw the FPS ranging from 60 to 110 depending on the track. I think the game might be able to produce more than 110 fps actually, since various tracks were able to reach exactly 110 fps I think the limiting factor here was the display. I was then naturally interested in comparing this to the GL renderer and I was a bit surprised to see that, with the same settings, the GL renderer would be somewhere in the 8-20 fps range for the same tracks. The game was clearly hitting a very bad path in the GL driver so I had to fix that before I could make a fair comparison between both. A perf session quickly pointed me to the issue: Mesa has code to transparently translate vertex attribute formats that are not natively supported to a supported format. While this is great for compatibility it is obviously going to be very slow. In particular, SuperTuxKart uses rgba16f and rg16f with some vertex buffers and Mesa was silently translating these to 32-bit counterparts because the GL driver was not advertising support for the 16-bit variants. The hardware does support 16-bit floating point vertex attributes though, so this was very easy to fix. The Vulkan driver was exposing support for this already, which explains the dramatic difference in performance between both drivers. Indeed, with that change SuperTuxKart now plays smooth on OpenGL too, with framerates always above 30 fps and up to 110 fps depending on the track. We should probably have an option in Mesa to make this kind of under-the-hood compatibility translations more obvious to users so we can catch silly issues like this more easily. With that said, even if GL is now a lot better, Vulkan is still ahead by quite a lot, producing 35-50% better framerate than OpenGL depending on the track, at least for the tracks that don’t hit the 110 fps mark, which as I said above, looks like it is a display maximum, at least with my setup. Zink During my presentation at XDC last year I mentioned Zink wasn’t supported on Raspberry Pi 4 any more due to feature requirements we could not fulfill. In the past, Zink used to abort when it detected unsupported features, but it seems this policy has been changed and now it simply drops a warning and points to the possibility of incorrect rendering as a result. Also, I have been talking to zmike about one of the features we could not support natively: scalarBlockLayout. Particularly, the issue with this is that we can’t make it work with vectors in all cases and the only alternative for us would be to scalarize everything through a lowering, which would probably have a performance impact. However, zmike confirmed that Zink is already doing this, so in practice we would not see vector load/stores from Zink, in which case it should work fine . So with all that in mind, I did give Zink a go and indeed, I get the warning that we don’t support scalar block layouts (and some other feature I don’t remember now) but otherwise it mostly works. It is not as stable as the native driver and some things that work with the native driver don’t work with Zink at present, some examples I saw include the WebGL Aquarium demo in Chromium or SuperTuxKart. As far as performance goes, it has been a huge leap from when I tested it maybe 2 years ago. With VkQuake3‘s OpenGL renderer performance with Zink used to be ~40% of the native OpenGL driver, but is now on par with it, even if not a tiny bit better, so kudos to zmike and all the other contributors to Zink for all the work they put into this over the last 2 years, it really shows. With all that said, I didn’t do too much testing with Zink myself so if anyone here decides to give it a more thorough go, please let me know how it went in the comments.
  • Simon Ser: Status update, February 2023 (2023/02/15 22:00)
    Hi! Earlier this month I went to FOSDEM with the rest of the SourceHut staff! It was great meeting face-to-face all of the people I work with. I discussed with lots of folks involved in Wayland, IRC, SourceHut and many other interesting projects. This was my first post-pandemic offline conference Last week we’ve released wlroots 0.16.2 and Sway 1.8.1. We’ve spent a fair bit of time trying to square away regressions, and I think we’ve addressed almost all of them. This doesn’t mean we haven’t made any progress on new features and improvements, quite the contrary. We’ve merged Kenny Levinsen’s patches for the new fractional-scaling-v1 protocol, which allows clients to render at fractional scales rather than being forced to use the next integer scale. I’ve continued working on the new wlr_renderer API, and I’ve started experimenting with Vulkan compute. I’m still not sure this is the right path forward, we’ll see where this takes us. I’ve made a lot of progress on libliftoff integration in wlroots. I’ve extended the wlr_output_layer API to include a feedback mechanism so that clients can re-allocate their buffers on-the-fly to enable direct scan-out on overlay planes. I’ve wired this up to a new libliftoff API to query which planes would be good candidates for direct scan-out. I’ve fixed the remaining wlroots bugs, optimized libliftoff… What’s left is another testing and review round, but we’re getting close! By the way, the wlroots IRC channel has moved. We were (ab)using #sway-devel up until now, but now wlroots has its own separate #wlroots channel. Make sure to join it if you’ve been idling in #sway-devel! In other Wayland news, I’ve landed a patch to add two new wl_surface events to indicate the preferred scale and transform a client should use. No more guesswork via wl_output! I’ve also sent out the schedule for the next Wayland release, if all goes well we’ll ship it in two months. libdisplay-info 0.1.0 has been released! After months of work, this initial release includes full support for EDID, partial support for CTA-861-H, and very basic support for DisplayID 1.3. Having a release out will allow us to leverage the library in more projects: it’s already used in DXVK and gamescope, I have a patch to use it in wlroots, and there are plans to use it in Mutter and Weston. The NPotM is pixfmtdb. It’s a simple website which describes the in-memory layout of pixel formats from various graphics APIs. It also provides compatibility information: for each format, equivalent formats coming from other APIs are listed. This can be handy when wiring up multiple APIs together, for instance Cairo and Wayland, or Vulkan and KMS. Under the hood, the Khronos Data Format Specification is used to describe pixel formats in a standard way. Recently delthas has been hard at work and has landed a lot of soju patches. The new user run BouncerServ command can be used to run a command as another user, which can be handy for administrators. soju now supports Unix admin sockets to run any BouncerServ command from the shell. And support for external authentication has been merged (right now, PAM and OAuth 2.0 are supported). That’s all for now! See you next month.
  • Arthur Grillo: My First Patchset (2023/02/15 14:57)
    Hello everyone! In this Post, I will describe my experience writing my first patch set to the Linux Kernel, more specifically in the Direct Rendering Manager (DRM) subsystem. Why I’m Doing This Two years ago, when I started university, I became excited about open-source software. Before university, I had heard about Linux, Vim, and other open-source projects, but never used or read much about them. I’ve found that using those technologies made my workflow much better. When I entered an extracurricular group called Zenith, I met some friends there, like Maíra Canal, that were already interested in contributing. My first contribution was a patch started in a hackathon, made by LKCAMP, to the Linux Kernel that I co-developed. The experience of creating it was quite interesting. After it, I contributed a couple more times, once to the LLVM and again to the Linux Kernel, where I made my first own patch. Now I feel that the GSoC 2023 will be a perfect opportunity to work even more in the open-source community, something that I really like to do. This patchset is part of my initiation on the GSoC project Idea about Increasing Code Coverage on the DRM code. The main idea is to increase the test coverage of the DRM core helpers by adding more KUnit tests to it. Patchset Idea The idea is to decrease number of warnings generated when compiling the AMD GPU drivers with W=1. This is a pretty simple task to start working on, as is not needed an specific hardware to test it. Sometimes this work could be viewed as not useful, but it is a much-needed job that sometimes is put aside on behalf of more urgent problems. It makes the code less error-prone, more readable, and simpler, helping others in developing. Set Up To start you can clone the drm-misc Linux Kernel tree as it was recommend. (If you want, you can use the amd-staging-drm-next branch too ;) ). git clone git://anongit.freedesktop.org/drm/drm-misc Or you can add the remote to an existing local repository of the Linux Kernel. (That was what I did :P). git remote add drm-misc git://anongit.freedesktop.org/drm/drm-misc And create a local branch based on the drm-misc-next branch. git checkout drm-misc/drm-misc-next git checkout -b fix-warning After that I used the defconfig and added the AMDGPU config as a module with the menuconfig. make defconfig make menuconfig # Add AMDGPU as a module Workflow The idea is trying to resolve the warnings that appear when building the AMDGPU code. Considering the kernel uses the -Werror flag, the build will stop after any warning, which can make the development process a bit tricky. To speed up the process it is possible to add -Wno-error to KCFLAGS in the make command with the W=1. This way you be able to resolve multiple warnings in one go. make W=1 KCFLAGS='-Wno-error' modules This will compile the module and generate all the warnings, which can be a bit overwhelming as all types of warnings appear together. To focus into one warning you can remove the W=1 flag and add the specific to KCFLAGS, so the final command would be: make KCFLAGS='-Wno-error -W<specific-warning>' modules What was done The patch set focused in four types of warnings: -Wunused-but-set-variables Warn when a variable has its value changed but never used in any other way. The warning could be indicative that the code its not doing what it was meant to do. It also makes the code less readable with dead code. -Wenum-conversion Warn when a value of enumerated type is implicitly converted to a different enumerated type. -Wmissing-prototypes Warn if a global function is defined without a previous prototype declaration. Excess arguments on Kernel docs Warn when arguments are described on the kernel doc but not present on the function definition. At first the set was composed only by 4 patches, one for each type of warning, but after some advice from Maíra Canal they were split even more as different tactics were used to fix the same type of problem in different places. How I Sent For a more in depth procedure, please read the kernel documentation about this topic. First, I used the check-patch utility to check all the commits for any code style error ./scripts/check-patch --strict -g drm-misc/drm-misc-next Arguments: --strict: enable more subjective tests -g drm-misc/drm-misc-next: check all the commit until the drm-misc-next branch To format the patchset I used the git format-patch utility # On the fix-warning branch git format-patch drm-misc/drm-misc-next -o outgoing --cover-letter Arguments: drm-misc/drm-misc-next: get all the commits until the drm-misc-next branch -o outgoing: place the patches on the outgoing directory --cover-letter: generate a first patch file with a cover letter to write about the whole patch After I wrote my cover letter and made various sanity checks such as: compiling the kernel one more time and reading the patch files to spot any errors. I used the get_maintainer utility to find the maintainers and mailing lists to which I had to send my patches. ./scripts/get_maintainer.pl outgoing/* Arguments: outgoing/*: read all the patches inside the outgoing directory After that I used the git send-email to send the patches to all the emails that were maintainers or supporters on the get_maintainer output and to the dri-devel and amd-gfx mailing lists. git send-email \ --to='amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org' \ --cc='<emails-of-maintainers-and-subscribers>' outgoing/* To a tutorial on how to setup git send-email, I recommend the sourcehut tutorial. After all this work I needed to wait for responses and hope that no dumb errors were present :). How the Patch was Received Fortunately the patches had a pretty good response! The first version of the patch set had 10 patches and six of them were accepted with no requests for changes. The other four received different responses: One was thought unnecessary, another hasn’t had any response yet, and the other two were requested some modification. Those last couple ones will be sent on the second version of this patch. What To Do Next After this first version I’ll send a v2 with the 2 patches that were asked revision and the patch that did not receive any response on the v1. After that, I intend to keep preparing myself for GSoC 2023! I’m thinking of exploring the unit test on the DRM side with Kunit. But this will be explained better on a next blog post. See ya! :)
  • Samuel Iglesias: X.Org Developers Conference 2023 in A Coruña, Spain (2023/02/09 13:15)
    Some weeks ago, Igalia announced publicly that we will host X.Org Developers Conference 2023 (XDC 2023) in A Coruña, Spain. If you remember, we also organized XDC 2018 in this beautiful city in the northwest of Spain (I hope you enjoyed it!) Since the announcement, I can now confirm that the conference will be in the awesome facilities of Palexco conference center, at the city center of A Coruña, Spain, from 17th to 19th of October 2023. We are going to setup soon the website, and prepare everything to open the Call For Papers in the coming weeks. Stay tuned! XDC 2023 is a three-day conference full of talks and workshops related to the open-source graphics stack: from Wayland to X11, from DRM/KMS to Mesa drivers, toolkits, libraries… you name it! This is the go-to conference if you are involved in the development of any part of the open-source graphics stack. Don’t miss it!
  • Dave Airlie (blogspot): vulkan video: status update (anv + radv) (2023/02/08 04:07)
     Okay just a short status update.radv H264/H265 decode: The radv h264/h265 support has been merged to mesa main branch. It is still behind RADV_PERFTEST=video_decode flag, and should work for basics from VI/GFX8+. It still has not passed all the CTS tests.anv H264 decode:The anv h264 decode support has been merged to mesa main branch. It has been tested from Skylake up to DG2. It has no enable flag, just make sure to build with h264dec video-codec support. It passes all current CTS tests.hasvk H264 decode:I ported the anv h264 decoder to hasvk the vulkan driver for Ivybridge/Haswell. This in a draft MR (HASVK H264). I haven't given this much testing yet, it has worked in the past. I'll get to testing it before trying to get it merged.radv AV1 decode:I created an MR for spec discussion (radv av1). I've also cleaned up the radv AV1 decode code.anv AV1 decode: I've started on anv AV1 decode support for DG2. I've gotten one very simple frame to decode. I will attempt to do more. I think filmgrain is not going to be supported in the short term. I'll fill in more details on this when it's working better. I think there are a few things that might need to be changed in the AV1 decoder provisional spec for Intel, there are some derived values that ffmpeg knows that it would be nice to not derive again, and there are also some hw limits around tiles and command buffers that will need to be figured out.
  • Lucas Fryzek: 2022 Graphics Team Contributions at Igalia (2023/02/02 05:00)
    This year I started a new job working with Igalia’s Graphics Team. For those of you who don’t know Igalia they are a “worker-owned, employee-run cooperative model consultancy focused on open source software”. As a new member of the team, I thought it would be a great idea to summarize the incredible amount of work the team completed in 2022. If you’re interested keep reading! Vulkan 1.2 Conformance on RPi 4 One of the big milestones for the team in 2022 was achieving Vulkan 1.2 conformance on the Raspberry Pi 4. The folks over at the Raspberry Pi company wrote a nice article about the achievement. Igalia has been partnering with the Raspberry Pi company to bring build and improve the graphics driver on all versions of the Raspberry Pi. The Vulkan 1.2 spec ratification came with a few extensions that were promoted to Core. This means a conformant Vulkan 1.2 driver needs to implement those extensions. Alejandro Piñeiro wrote this interesting blog post that talks about some of those extensions. Vulkan 1.2 also came with a number of optional extensions such as VK_KHR_pipeline_executable_properties. My colleague Iago Toral wrote an excellent blog post on how we implemented that extension on the Raspberry Pi 4 and what benefits it provides for debugging. Vulkan 1.3 support on Turnip Igalia has been heavily supporting the Open-Source Turnip Vulkan driver for Qualcomm Adreno GPUs, and in 2022 we helped it achieve Vulkan 1.3 conformance. Danylo Piliaiev on the graphics team here at Igalia, wrote a great blog post on this achievement! One of the biggest challenges for the Turnip driver is that it is a completely reverse-engineered driver that has been built without access to any hardware documentation or reference driver code. With Vulkan 1.3 conformance has also come the ability to run more commercial games on Adreno GPUs through the use of the DirectX translation layers. If you would like to see more of this check out this post from Danylo where he talks about getting “The Witcher 3”, “The Talos Principle”, and “OMD2” running on the A660 GPU. Outside of Vulkan 1.3 support he also talks about some of the extensions that were implemented to allow “Zink” (the OpenGL over Vulkan driver) to run Turnip, and bring OpenGL 4.6 support to Adreno GPUs. Vulkan Extensions Several developers on the Graphics Team made several key contributions to Vulkan Extensions and the Vulkan conformance test suite (CTS). My colleague Ricardo Garcia made an excellent blog post about those contributions. Below I’ve listed what Igalia did for each of the extensions: VK_EXT_image_2d_view_of_3d We reviewed the spec and are listed as contributors to this extension VK_EXT_shader_module_identifier We reviewed the spec, contributed to it, and created tests for this extension VK_EXT_attachment_feedback_loop_layout We reviewed, created tests and contributed to this extension VK_EXT_mesh_shader We contributed to the spec and created tests for this extension VK_EXT_mutable_descriptor_type We reviewed the spec and created tests for this extension VK_EXT_extended_dynamic_state3 We wrote tests and reviewed the spec for this extension AMDGPU kernel driver contributions Our resident “Not an AMD expert” Melissa Wen made several contributions to the AMDGPU driver. Those contributions include connecting parts of the pixel blending and post blending code in AMD’s DC module to DRM and fixing a bug related to how panel orientation is set when a display is connected. She also had a presentation at XDC 2022, where she talks about techniques you can use to understand and debug AMDGPU, even when there aren’t hardware docs available. André Almeida also completed and submitted work on enabled logging features for the new GFXOFF hardware feature in AMD GPUs. He also created a userspace application (which you can find here), that lets you interact with this feature through the debugfs interface. Additionally, he submitted a patch for async page flips (which he also talked about in his XDC 2022 presentation) which is still yet to be merged. Modesetting without Glamor on RPi Christopher Michael joined the Graphics Team in 2022 and along with Chema Casanova made some key contributions to enabling hardware acceleration and mode setting on the Raspberry Pi without the use of Glamor which allows making more video memory available to graphics applications running on a Raspberry Pi. The older generation Raspberry Pis (1-3) only have a maximum of 256MB of memory available for video memory, and using Glamor will consume part of that video memory. Christopher wrote an excellent blog post on this work. Both him and Chema also had a joint presentation at XDC 2022 going into more detail on this work. Linux Format Magazine Column Our very own Samuel Iglesias had a column published in Linux Format Magazine. It’s a short column about reaching Vulkan 1.1 conformance for v3dv & Turnip Vulkan drivers, and how Open-Source GPU drivers can go from a “hobby project” to the defacto driver for the platform. Check it out on page 7 of issue #288! XDC 2022 X.Org Developers Conference is one of the big conferences for us here at the Graphics Team. Last year at XDC 2022 our Team presented 5 talks in Minneapolis, Minnesota. XDC 2022 took place towards the end of the year in October, so it provides some good context on how the team closed out the year. If you didn’t attend or missed their presentation, here’s a breakdown: “Replacing the geometry pipeline with mesh shaders” (Ricardo García) Ricardo presents what exactly mesh shaders are in Vulkan. He made many contributions to this extension including writing 1000s of CTS tests for this extension with a blog post on his presentation that should check out! “Status of Vulkan on Raspberry Pi” (Iago Toral) Iago goes into detail about the current status of the Raspberry Pi Vulkan driver. He talks about achieving Vulkan 1.2 conformance, as well as some of the challenges the team had to solve due to hardware limitations of the Broadcom GPU. “Enable hardware acceleration for GL applications without Glamor on Xorg modesetting driver” (Jose María Casanova, Christopher Michael) Chema and Christopher talk about the challenges they had to solve to enable hardware acceleration on the Raspberry Pi without Glamor. “I’m not an AMD expert, but…” (Melissa Wen) In this non-technical presentation, Melissa talks about techniques developers can use to understand and debug drivers without access to hardware documentation. “Async page flip in atomic API” (André Almeida) André talks about the work that has been done to enable asynchronous page flipping in DRM’s atomic API with an introduction to the topic by explaining about what exactly is asynchronous page flip, and why you would want it. FOSDEM 2022 Another important conference for us is FOSDEM, and last year we presented 3 of the 5 talks in the graphics dev room. FOSDEM took place in early February 2022, these talks provide some good context of where the team started in 2022. The status of Turnip driver development (Hyunjun Ko) Hyunjun presented the current state of the Turnip driver, also talking about the difficulties of developing a driver for a platform without hardware documentation. He talks about how Turnip developers reverse engineer the behaviour of the hardware, and then implement that in an open-source driver. He also made a companion blog post to checkout along with his presentation. v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4 (Alejandro Piñeiro) Igalia has been presenting the status of the v3dv driver since December 2019 and in this presentation, Alejandro talks about the status of the v3dv driver in early 2022. He talks about achieving conformance, the extensions that had to be implemented, and the future plans of the v3dv driver. Fun with border colors in Vulkan (Ricardo Garcia) Ricardo presents about the work he did on the VK_EXT_border_color_swizzle extension in Vulkan. He talks about the specific contributions he did and how the extension fits in with sampling color operations in Vulkan. GSoC & Igalia CE Last year Melissa & André co-mentored contributors working on introducing KUnit tests to the AMD display driver. This project was hosted as a “Google Summer of Code” (GSoC) project from the X.Org Foundation. If you’re interested in seeing their work Tales da Aparecida, Maíra Canal, Magali Lemes, and Isabella Basso presented their work at the Linux Plumbers Conference 2022 and across two talks at XDC 2022. Here you can see their first presentation and here you can see their second second presentation. André & Melissa also mentored two “Igalia Coding Experience” (CE) projects, one related to IGT GPU test tools on the VKMS kernel driver, and the other for IGT GPU test tools on the V3D kernel driver. If you’re interested in reading up on some of that work, Maíra Canal wrote about her experience being part of the Igalia CE. Ella Stanforth was also part of the Igalia Coding Experience, being mentored by Iago & Alejandro. They worked on the VK_KHR_sampler_ycbcr_conversion extension for the v3dv driver. Alejandro talks about their work in his blog post here. What’s Next? The graphics team is looking forward to having a jam-packed 2023 with just as many if not more contributions to the Open-Source graphics stack! I’m super excited to be part of the team, and hope to see my name in our 2023 recap post! Also, you might have heard that Igalia will be hosting XDC 2023 in the beautiful city of A Coruña! We hope to see you there where there will be many presentations from all the great people working on the Open-Source graphics stack, and most importantly where you can dream in the Atlantic! Photo of A Coruña
  • Nicolai Hähnle: Diff modulo base, a CLI tool to assist with incremental code reviews (2023/01/21 16:19)
    One of the challenges of reviewing a lot of code is that many reviews require multiple iterations. I really don't want to do a full review from scratch on the second and subsequent rounds. I need to be able to see what has changed since last time.I happen to work on projects that care about having a useful Git history. This means that authors of (without loss of generality) pull requests use amend and rebase to change commits and force-push the result. I would like to see the only the changes they made since my last review pass. Especially when the author also rebased onto a new version of the main branch, existing code review tools tend to break down.Git has a little-known built-in subcommand, git range-diff, which I had been using for a while. It's pretty cool, really: It takes two ranges of commits, old and new, matches old and new commits, and then shows how they changed. The rather huge problem is that its output is a diff of diffs. Trying to make sense of those quickly becomes headache-inducing.I finally broke down at some point late last year and wrote my own tool, which I'm calling diff-modulo-base. It allows you to look at the difference of the repository contents between old and new in the history below, while ignoring all the changes that are due to differences in the respective base versions A and B. As a bonus, it actually does explicitly show differences between A and B that would have caused merge conflicts during rebase. This allows a fairly comfortable view of how merge conflicts were resolved.I've been using this tool for a while now. While there are certainly still some rough edges and to dos, I did put a bunch more effort into it over the winter holidays and am now quite happy with it. I'm making it available for all to try at https://git.sr.ht/~nhaehnle/diff-modulo-base. Let me know if you find it useful!Better integration with the larger code review flow?One of the rough edges is that it would be great to integrate tightly with the GitHub notifications workflow. That workflow is surprisingly usable in that you can essentially treat the notifications as an inbox in which you can mark notifications as unread or completed, and can "mute" issues and pull requests all with keyboard shortcut.What's missing in my workflow is a reliable way to remember the most recent version of a pull request that I have reviewed. My somewhat passable workaround for now is to git fetch before I do a round of reviews, and rely on the local reflog of remote refs. A Git alias allows me to saygit dmb-origin $pull_request_idand have that becomegit diff-modulo-base origin/main origin/pull/$pull_request_id/head@{1} origin/pull/$pull_request_id/headwhich is usually what I want.Ideally, I'd have a fully local way of interacting with GitHub notifications, which could then remember the reviewed version in a more reliable way. This ought to also fix the terrible lagginess of the web interface. But that's a rant for another time.RustThis is the first serious piece of code I've written in Rust. I have to say that experience has really been quite pleasant so far. Rust's tooling is pretty great, mostly thanks to the rust-analyzer LSP server.The one thing I'd wish is that the borrow checker was able to better understand  "partial" borrows. I find it occasionally convenient to tie a bunch of data structures together in a general context structure, and helper functions on such aggregates can't express that they only borrow part of the structure. This can usually be worked around by changing data types, but the fact that I have to do that is annoying. It feels like having to solve a puzzle that isn't part of the inherent complexity of the underlying problem that the code is trying to solve.And unlike, say, circular references or graph structures in general, where it's clear that expressing and proving the sort of useful lifetime facts that developers might intuitively reason about quickly becomes intractable, improving the support for partial borrows feels like it should be a tractable problem.
  • Dave Airlie (blogspot): vulkan video decoding: anv status update (2023/01/19 03:53)
    After hacking the Intel media-driver and ffmpeg I managed to work out how the anv hardware mostly works now for h264 decoding.I've pushed a branch [1] and a MR[2] to mesa. The basics of h264 decoding are working great on gen9 and compatible hardware. I've tested it on my one Lenovo WhiskeyLake laptop.I have ported the code to hasvk as well, and once we get moving on this I'll polish that up and check we can h264 decode on IVB/HSW devices.The one feature I know is missing is status reporting, radv can't support that from what I can work out due to firmware, but anv should be able to so I might dig into that a bit.[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-decode[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20782
  • Peter Hutterer: X servers no longer allow byte-swapped clients (by default) (2023/01/17 22:20)
    In the beginning, there was the egg. Then fictional people started eating that from different ends, and the terms of "little endians" and "Big Endians" was born. Computer architectures (mostly) come with one of either byte order: MSB first or LSB first. The two are incompatible of course, and many a bug was introduced trying to convert between the two (or, more common: failing to do so). The two byte orders were termed Big Endian and little endian, because that hilarious naming scheme at least gives us something to laugh about while contemplating throwing it all away and considering a future as, I don't know, a strawberry plant. Back in the mullet-infested 80s when the X11 protocol was designed both little endian and big endian were common enough. And back then running the X server on a different host than the client was common too - the X terminals back then had less processing power than a smart toilet seat today so the cpu-intensive clients were running on some mainfraime. To avoid overtaxing the poor mainframe already running dozens of clients for multiple users, the job of converting between the two byte orders was punted to the X server. So to this day whenever a client connects, the first byte it sends is a literal "l" or "B" to inform the server of the client's byte order. Where the byte order doesn't match the X server's byte order, the client is a "swapped client" in X server terminology and all 16, 32, and 64-bit values must be "byte-swapped" into the server's byte order. All of those values in all requests, and then again back to the client's byte order in all outgoing replies and events. Forever, till a crash do them part. If you get one of those wrong, the number is no longer correct. And it's properly wrong too, the difference between 0x1 and 0x01000000 is rather significant. [0] Which has the hilarious side-effect of... well, pretty much anything. But usually it ranges from crashing the server (thus taking all other clients down in commiseration) to leaking random memory locations. The list of security issues affecting the various SProcFoo implementations (X server naming scheme for Swapped Procedure for request Foo) is so long that I'm too lazy to pull out the various security advisories and link to them. Just believe me, ok? *jedi handwave* These days, encountering a Big Endian host is increasingly niche, letting it run an X client that connects to your local little-endian X server is even more niche [1]. I think the only regular real-world use-case for this is running X clients on an s390x, connecting to your local intel-ish (and thus little endian) workstation. Not something most users do on a regular basis. So right now, the byte-swapping code is mainly a free attack surface that 99% of users never actually use for anything real. So... let's not do that? I just merged a PR into the X server repo that prohibits byte-swapped clients by default. A Big Endian client connecting to an X server will fail the connection with an error message of "Prohibited client endianess, see the Xserver man page". [2] Thus, a whole class of future security issues avoided - yay! For the use-cases where you do need to let Big Endian clients connect to your little endian X server, you have two options: start your X server (Xorg, Xwayland, Xnest, ...) with the +byteswappedclients commandline option. Alternatively, and this only applies for Xorg: add Option "AllowByteSwappedClients" "on" to the xorg.conf ServerFlags section. Both of these will change the default back to the original setting. Both are documented in the Xserver(1) and xorg.conf(5) man pages, respectively. Now, there's a drawback: in the Wayland stack, the compositor is in charge of starting Xwayland which means the compositor needs to expose a way of passing +byteswappedclients to Xwayland. This is compositor-specific, bugs are filed for mutter (merged for GNOME 44), kwin and wlroots. Until those are addressed, you cannot easily change this default (short of changing /usr/bin/Xwayland into a wrapper script that passes the option through). There's no specific plan yet which X releases this will end up in, primarily because the release cycle for X is...undefined. Probably xserver-23.0 if and when that happens. It'll probably find its way into the xwayland-23.0 release, if and when that happens. Meanwhile, distributions interested in this particular change should consider backporting it to their X server version. This has been accepted as a Fedora 38 change. [0] Also, it doesn't help that much of the X server's protocol handling code was written with the attitude of "surely the client wouldn't lie about that length value" [1] little-endian client to Big Endian X server is so rare that it's barely worth talking about. But suffice to say, the exact same applies, just with little and big swapped around. [2] That message is unceremoniously dumped to stderr, but that bit is unfortunately a libxcb issue.
  • Dave Airlie (blogspot): vulkan video decoding: av1 (yes av1) status update (2023/01/17 07:54)
    Needless to say h264/5 weren't my real goals in life for video decoding. Lynne and myself decided to see what we could do to drive AV1 decode forward by creating our own extensions called VK_MESA_video_decode_av1. This is a radv only extension so far, and may expose some peculiarities of AMD hardware/firmware.Lynne's blog entry[1] has all the gory details, so go read that first. (really read it first).Now that you've read and understood all that, I'll just rant here a bit. Figuring out the DPB management and hw frame ref and curr_pic_idx fields was a bit of a nightmare. I spent a few days hacking up a lot of wrong things before landing on the thing we agreed was the least wrong which was having the ffmpeg code allocate a frame index in the same fashion as the vaapi radeon implementation did. I had another hacky solution that involved overloading the slotIndex value to mean something that wasn't DPB slot index, but it wasn't really any better. I think there may be something about the hw I don't understand so hopefully we can achieve clarity later.[1] https://lynne.ee/vk_mesa_video_decode_av1.html
  • Peter Hutterer: libinput and the custom pointer acceleration function (2023/01/17 04:47)
    After 8 months of work by Yinon Burgansky, libinput now has a new pointer acceleration profile: the "custom" profile. This profile allows users to tweak the exact response of their device based on their input speed. A short primer: the pointer acceleration profile is a function that multiplies the incoming deltas with a given factor F, so that your input delta (x, y) becomes (Fx, Fy). How this is done is specific to the profile, libinput's existing profiles had either a flat factor or an adaptive factor that roughly resembles what Xorg used to have, see the libinput documentation for the details. The adaptive curve however has a fixed behaviour, all a user could do was scale the curve up/down, but not actually adjust the curve. Input speed to output speed The new custom filter allows exactly that: it allows a user to configure a completely custom ratio between input speed and output speed. That ratio will then influence the current delta. There is a whole new API to do this but simplified: the profile is defined via a series of points of (x, f(x)) that are linearly interpolated. Each point is defined as input speed in device units/ms to output speed in device units/ms. For example, to provide a flat acceleration equivalent, specify [(0.0, 0.0), (1.0, 1.0)]. With the linear interpolation this is of course a 45-degree function, and any incoming speed will result in the equivalent output speed. Noteworthy: we are talking about the speed here, not any individual delta. This is not exactly the same as the flat acceleration profile (which merely multiplies the deltas by a constant factor) - it does take the speed of the device into account, i.e. device units moved per ms. For most use-cases this is the same but for particularly slow motion, the speed may be calculated across multiple deltas (e.g. "user moved 1 unit over 21ms"). This avoids some jumpyness at low speeds. But because the curve is speed-based, it allows for some interesting features too: the curve [(0.0, 1.0), (1.0, 1.0)] is a horizontal function at 1.0. Which means that any input speed results in an output speed of 1 unit/ms. So regardless how fast the user moves the mouse, the output speed is always constant. I'm not immediately sure of a real-world use case for this particular case (some accessibility needs maybe) but I'm sure it's a good prank to play on someone. Because libinput is written in C, the API is not necessarily immediately obvious but: to configure you pass an array of (what will be) y-values and set the step-size. The curve then becomes: [(0 * step-size, array[0]), (1 * step-size, array[1]), (2 * step-size, array[2]), ...]. There are some limitations on the number of points but they're high enough that they should not matter. Note that any curve is still device-resolution dependent, so the same curve will not behave the same on two devices with different resolution (DPI). And since the curves uploaded by the user are hand-polished, the speed setting has no effect - we cannot possibly know how a custom curve is supposed to scale. The setting will simply update with the provided value and return that but the behaviour of the device won't change in response. Motion types Finally, there's another feature in this PR - the so-called "movement type" which must be set when defining a curve. Right now, we have two types, "fallback" and "motion". The "motion" type applies to, you guessed it, pointer motion. The only other type available is fallback which applies to everything but pointer motion. The idea here is of course that we can apply custom acceleration curves for various different device behaviours - in the future this could be scrolling, gesture motion, etc. And since those will have a different requirements, they can be configure separately. How to use this? As usual, the availability of this feature depends on your Wayland compositor and how this is exposed. For the Xorg + xf86-input-libinput case however, the merge request adds a few properties so that you can play with this using the xinput tool: # Set the flat-equivalent function described above $ xinput set-prop "devname" "libinput Accel Custom Motion Points" 0.0 1.0 # Set the step, i.e. the above points are on 0 u/ms, 1 u/ms, ... # Can be skipped, 1.0 is the default anyway $ xinput set-prop "devname" "libinput Accel Custom Motion Points" 1.0 # Now enable the custom profile $ xinput set-prop "devname" "libinput Accel Profile Enabled" 0 0 1 The above sets a custom pointer accel for the "motion" type. Setting it for fallback is left as an exercise to the reader (though right now, I think the fallback curve is pretty much only used if there is no motion curve defined). Happy playing around (and no longer filing bug reports if you don't like the default pointer acceleration ;) Availability This custom profile will be available in libinput 1.23 and xf86-input-libinput-1.3.0. No release dates have been set yet for either of those.
  • Maira Canal: January Update: Finishing my Igalia CE (2023/01/17 00:00)
    2022 really passed by fast and after I completed the GSoC 2022, I’m now completing another milestone: my project in the Igalia Coding Experience and I had the best experience during those four months. I learned tremendously about the Linux graphics stack and now I can say for sure that I would love to keep working in the DRM community. While GSoC was, for me, an experience to get a better understanding of what open source is, Igalia CE was an opportunity for me to mature my knowledge of technical concepts. So, this is a summary report of my journey at the Igalia CE. IGT tests to V3D Initially, V3D only had three basic IGT tests: v3d_get_bo_offset, v3d_get_param, and v3d_mmap. So, the basic goal of my CE project was to add more tests to the V3D driver. V3D is the driver that supports the Broadcom V3D 3.3 and 4.1 OpenGL ES GPUs, and is the driver that provides 3D rendering to the Raspberry Pi 4. V3D is composed of a tiled renderer, a TFU (Texture Formatting Unit), and a CSD (Compute Shader Dispatch). During the CE, I was able to develop tests for almost all eleven V3D ioctls (except v3d_submit_tfu). I began writing tests to the v3d_create_bo ioctl and Performance Monitor (perfmon) related ioctls. I developed tests that check the basic functionality of the ioctls and I inspected the kernel code to understand situations where the ioctl should fail. After those tests, I got the biggest challenge that I had on my CE project: performing a Mesa’s no-op job on IGT. A no-op job is one of the simplest jobs that can be submitted to the V3D. It is a 3D rendering job, so it is a job submitted through the v3d_submit_cl ioctl, and performing this job on IGT was fundamental to developing good tests for the v3d_submit_cl ioctl. The main problem I faced on submitting a no-op job on IGT was: I would copy many and many Mesa files to IGT. And I took a while fighting against this idea, looking for other ways to submit a job to V3D. But, as some experience developers pointed out, packeting is the best option for it. So indeed, the final solution I came in with was to copy a couple of files from Mesa, but just three of them, which sounds reasonable. So, after some time, I was able to bring the Mesa structure to IGT with minimal overhead. But, I was still not able to run a successful no-op job as the job’s fence wasn’t being signaled by the end of the job. Then, Melissa Wen guided me to experiment running CTS tests to inspect the no-op job. With the CTS tests, I was able to hexdump the contents of the packet and understand what was going on wrong in my no-op job. Running the CTS in the Raspberry Pi 4 was a fun side-quest of the project and ended up resulting in a commit to the CTS repository, as CTS wasn’t handling appropriately the wayland-scanner for cross-compiling. CTS was picking the wayland-scanner from the host computer instead of picking the wayland-scanner executable available in the target sysroot. This was fixed with this simple patch: Allow override of wayland_scanner executable When I finally got a successful no-op job, I was able to write the tests for the v3d_submit_cl and v3d_wait_bo ioctls. On these tests, I tested primarily job synchronization with single syncobjs and multiple syncobjs. In this part of the project, I had the opportunity to learn a lot about syncobjs and different forms of synchronization in the kernel and userspace. Having done the v3d_submit_cl tests, I developed the v3d_submit_csd tests in a similar way, as the job submission process is kind of similar. For submitting a CSD job, it is necessary to make a valid submission with a pipeline assembly shader and as IGT doesn’t have a shader compiler, so I hard-coded the assembly of an empty shader in the code. In this way, I was able to get a simple CSD job submitted, and having done that, I could now play around with mixing CSD and CL jobs. In these tests, I could test the synchronization between two job queues and see, for example, if they were proceeding independently. So, by the end of the review process, I will add 66 new sub-tests to V3D, having in total 72 IGT sub-tests! Those tests are checking invalid parameters, synchronization, and the proper behavior of the functionalities. Patch/Series Status [PATCH 0/7] V3D IGT Tests Updates Accepted [PATCH 0/2] Tests for V3D/VC4 Mmap BO IOCTLs Accepted [PATCH 0/4] Make sure v3d/vc4 support performance monitor In Review [PATCH 0/6] V3D Job Submission Tests In Review [PATCH 0/3] V3D Mixed Job Submission Tests In Review Mesa Apart from reading a lot of kernel code, I also started to explore some of the Mesa code, especially the v3dv driver. On Mesa, I was trying to understand the userspace use of the ioctls in order to create useful tests. While I was exploring the v3dv, I was able to make two very simple contributions to Mesa: fixing typos and initializing a variable in order to assure proper error handling. Patch Status v3dv: fix multiple typos Accepted v3dv: initialize fd variable for proper error handling Accepted IGT tests to VC4 VC4 and V3D share some similarities in their basic 3D rendering implementation. VC4 contains a 3D engine, and a display output pipeline that supports different outputs. The display part of the VC4 is used on the Raspberry Pi 4 together with the V3D driver. Although my main focus was on the V3D tests, as the VC4 and V3D drivers are kind of similar, I was able to bring some improvements to the VC4 tests as well. I added tests for perfmons and the vc4_mmap ioctl and improved a couple of things in the tests, such as moving it a separate folder and creating a check to skip the VC4 tests if they are running on a Raspberry Pi 4. Patch/Series Status [PATCH 0/5] VC4 IGT Tests Updates Accepted [PATCH 0/2] Tests for V3D/VC4 Mmap BO IOCTLs Accepted [PATCH 0/4] Make sure v3d/vc4 support performance monitor In Review tests/vc4_purgeable_bo: Fix conditional assertion In Review Linux Kernel V3D/VC4 drivers During this process of writing tests to IGT, I ended up reading a lot of kernel code from V3D in order to evaluate possible userspace scenarios. While inspecting some of the V3D code, I could find a couple of small things that could be improved, such as using the DRM-managed API for mutexes and replacing open-coded implementations with their DRM counterparts. Patch Status drm/v3d: switch to drmm_mutex_init Accepted drm/v3d: add missing mutex_destroy Accepted drm/v3d: replace open-coded implementation of drm_gem_object_lookup Accepted Although I didn’t explore the VC4 driver as much as the V3D driver, I also took a look at the driver, and I was able to detect a small thing that could be improved: using the DRM-core helpers instead of open-code. Moreover, after a report on the mailing list, I bisected a deadlock and I was able to fix it after some study about the KMS locking system. Patch Status drm/vc4: drop all currently held locks if deadlock happens Accepted drm/vc4: replace drm_gem_dma_object for drm_gem_object in vc4_exec_info In Review drm/vc4: replace obj lookup steps with drm_gem_objects_lookup In Review The debugfs side-quest The debugfs side-quest was a total coincidence during this project. I had some spare time and was looking for something to develop. While looking at the DRM TODO list, I bumped into the debugfs clean-up task and found it interesting to work on. So, I started to work on this task based on the previous work from Wambui Karuga, who was a Outreachy mentee and worked on this feature during her internship. By chance, when I talked to Melissa about it, she told me that she had knowledge of this project due to a past Outreachy internship that she was engaged on, and she was able to help me figure out the last pieces of this side-quest. After submitting the first patch, introducing the debugfs device-centered functions, and converting a couple of drivers to the new structure, I decided to remove the debugfs_init hook from a couple of drivers in order to get closer to the goal of removing the debugfs_init hook. Moreover, during my last week in the CE, I tried to write a debugfs infrastructure for the KMS objects, which was another task in the TODO list, although I still need to do some rework on this series. Patch/Series Status [PATCH 0/7] Introduce debugfs device-centered functions Accepted drm/debugfs: use octal permissions instead of symbolic permissions Accepted drm/debugfs: add descriptions to struct parameters Accepted [PATCH 0/7] Convert drivers to the new debugfs device-centered functions Accepted [PATCH 0/13] drm/debugfs: Create a debugfs infrastructure for kms objects In Review More side-quests By the end of the CE, I was on my summer break from university, so I had some time to take a couple of side-quests in this journey. The first side-quest that I got into originated in a failed IGT test on the VC4, the “addfb25-bad-modifier” IGT test. Initially, I proposed a fix only for the VC4, but after some discussion in the mailing list, I decided to move forward with the idea to create the check for valid modifiers in the DRM core. The series is still in review, but I had some great interactions during the iterations. The second side-quest was to understand why the IGT test kms_writeback was causing a kernel oops in vkms. After some bisecting and some study about KMS’s atomic API, I was able to detect the problem and write a solution for it. It was pretty exciting to deal with vkms for the first time and to get some notion about the display side of things. Patch/Series Status drm/tests: Split drm_test_dp_mst_calc_pbn_mode into parameterized tests Accepted drm/tests: Split drm_test_dp_mst_sideband_msg_req_decode into parameterized tests Accepted tests/kms_addfb_basic: Avoid open-coded expressions Accepted [PATCH 0/3] Check for valid framebuffer’s format In Review drm/vkms: reintroduce prepare_fb and cleanup_fb functions Accepted Next Steps A bit different from the end of GSoC, I’m not really sure what are going to be my next steps in the next couple of months. The only thing I know for sure is that I will keep contributing to the DRM subsystem and studying more about DRI, especially the 3D rendering and KMS. The DRI infrastructure is really fascinating and there is so much to be learn! Although I feel that I improved a lot in the last couple of months, I still feel like a newbie in the community. I still want to have more knowledge of the DRM core helpers and understand how everything glues together. Apart from the DRM subsystem, I’m also trying to take some time to program more in Rust and maybe contribute to other open-source projects, like Mesa. Acknowledgment I would like to thank my great mentors Melissa Wen and André Almeida for helping me through this journey. I wouldn’t be able to develop this project without their great support and encouragement. They were an amazing duo of mentors and I thank them for answering all my questions and helping me with all the challenges. Also, I would like to thank the DRI community for reviewing my patches and giving me constructive feedback. Especially, I would like to thank Daniel Vetter for answering patiently every single question that I had about the debugfs clean-up and to thank Jani Nikula, Maxime Ripard, Thomas Zimmermann, Javier Martinez Canillas, Emma Anholt, Simon Ser, Iago Toral, Kamil Konieczny and many others that took their time to review my patches, answer my questions and provide me constructive feedback.
  • Maira Canal: January Update: Finishing my Igalia CE (2023/01/17 00:00)
    2022 really passed by fast and after I completed the GSoC 2022, I’m now completing another milestone: my project in the Igalia Coding Experience and I had the best experience during those four months. I learned tremendously about the Linux graphics stack and now I can say for sure that I would love to keep working in the DRM community. While GSoC was, for me, an experience to get a better understanding of what open source is, Igalia CE was an opportunity for me to mature my knowledge of technical concepts. So, this is a summary report of my journey at the Igalia CE. IGT tests to V3D Initially, V3D only had three basic IGT tests: v3d_get_bo_offset, v3d_get_param, and v3d_mmap. So, the basic goal of my CE project was to add more tests to the V3D driver. V3D is the driver that supports the Broadcom V3D 3.3 and 4.1 OpenGL ES GPUs, and is the driver that provides 3D rendering to the Raspberry Pi 4. V3D is composed of a tiled renderer, a TFU (Texture Formatting Unit), and a CSD (Compute Shader Dispatch). During the CE, I was able to develop tests for almost all eleven V3D ioctls (except v3d_submit_tfu). I began writing tests to the v3d_create_bo ioctl and Performance Monitor (perfmon) related ioctls. I developed tests that check the basic functionality of the ioctls and I inspected the kernel code to understand situations where the ioctl should fail. After those tests, I got the biggest challenge that I had on my CE project: performing a Mesa’s no-op job on IGT. A no-op job is one of the simplest jobs that can be submitted to the V3D. It is a 3D rendering job, so it is a job submitted through the v3d_submit_cl ioctl, and performing this job on IGT was fundamental to developing good tests for the v3d_submit_cl ioctl. The main problem I faced on submitting a no-op job on IGT was: I would copy many and many Mesa files to IGT. And I took a while fighting against this idea, looking for other ways to submit a job to V3D. But, as some experience developers pointed out, packeting is the best option for it. So indeed, the final solution I came in with was to copy a couple of files from Mesa, but just three of them, which sounds reasonable. So, after some time, I was able to bring the Mesa structure to IGT with minimal overhead. But, I was still not able to run a successful no-op job as the job’s fence wasn’t being signaled by the end of the job. Then, Melissa Wen guided me to experiment running CTS tests to inspect the no-op job. With the CTS tests, I was able to hexdump the contents of the packet and understand what was going on wrong in my no-op job. Running the CTS in the Raspberry Pi 4 was a fun side-quest of the project and ended up resulting in a commit to the CTS repository, as CTS wasn’t handling appropriately the wayland-scanner for cross-compiling. CTS was picking the wayland-scanner from the host computer instead of picking the wayland-scanner executable available in the target sysroot. This was fixed with this simple patch: Allow override of wayland_scanner executable When I finally got a successful no-op job, I was able to write the tests for the v3d_submit_cl and v3d_wait_bo ioctls. On these tests, I tested primarily job synchronization with single syncobjs and multiple syncobjs. In this part of the project, I had the opportunity to learn a lot about syncobjs and different forms of synchronization in the kernel and userspace. Having done the v3d_submit_cl tests, I developed the v3d_submit_csd tests in a similar way, as the job submission process is kind of similar. For submitting a CSD job, it is necessary to make a valid submission with a pipeline assembly shader and as IGT doesn’t have a shader compiler, so I hard-coded the assembly of an empty shader in the code. In this way, I was able to get a simple CSD job submitted, and having done that, I could now play around with mixing CSD and CL jobs. In these tests, I could test the synchronization between two job queues and see, for example, if they were proceeding independently. So, by the end of the review process, I will add 66 new sub-tests to V3D, having in total 72 IGT sub-tests! Those tests are checking invalid parameters, synchronization, and the proper behavior of the functionalities. Patch/Series Status [PATCH 0/7] V3D IGT Tests Updates Accepted [PATCH 0/2] Tests for V3D/VC4 Mmap BO IOCTLs Accepted [PATCH 0/4] Make sure v3d/vc4 support performance monitor In Review [PATCH 0/6] V3D Job Submission Tests In Review [PATCH 0/3] V3D Mixed Job Submission Tests In Review Mesa Apart from reading a lot of kernel code, I also started to explore some of the Mesa code, especially the v3dv driver. On Mesa, I was trying to understand the userspace use of the ioctls in order to create useful tests. While I was exploring the v3dv, I was able to make two very simple contributions to Mesa: fixing typos and initializing a variable in order to assure proper error handling. Patch Status v3dv: fix multiple typos Accepted v3dv: initialize fd variable for proper error handling Accepted IGT tests to VC4 VC4 and V3D share some similarities in their basic 3D rendering implementation. VC4 contains a 3D engine, and a display output pipeline that supports different outputs. The display part of the VC4 is used on the Raspberry Pi 4 together with the V3D driver. Although my main focus was on the V3D tests, as the VC4 and V3D drivers are kind of similar, I was able to bring some improvements to the VC4 tests as well. I added tests for perfmons and the vc4_mmap ioctl and improved a couple of things in the tests, such as moving it a separate folder and creating a check to skip the VC4 tests if they are running on a Raspberry Pi 4. Patch/Series Status [PATCH 0/5] VC4 IGT Tests Updates Accepted [PATCH 0/2] Tests for V3D/VC4 Mmap BO IOCTLs Accepted [PATCH 0/4] Make sure v3d/vc4 support performance monitor In Review tests/vc4_purgeable_bo: Fix conditional assertion In Review Linux Kernel V3D/VC4 drivers During this process of writing tests to IGT, I ended up reading a lot of kernel code from V3D in order to evaluate possible userspace scenarios. While inspecting some of the V3D code, I could find a couple of small things that could be improved, such as using the DRM-managed API for mutexes and replacing open-coded implementations with their DRM counterparts. Patch Status drm/v3d: switch to drmm_mutex_init Accepted drm/v3d: add missing mutex_destroy Accepted drm/v3d: replace open-coded implementation of drm_gem_object_lookup Accepted Although I didn’t explore the VC4 driver as much as the V3D driver, I also took a look at the driver, and I was able to detect a small thing that could be improved: using the DRM-core helpers instead of open-code. Moreover, after a report on the mailing list, I bisected a deadlock and I was able to fix it after some study about the KMS locking system. Patch Status drm/vc4: drop all currently held locks if deadlock happens Accepted drm/vc4: replace drm_gem_dma_object for drm_gem_object in vc4_exec_info In Review drm/vc4: replace obj lookup steps with drm_gem_objects_lookup In Review The debugfs side-quest The debugfs side-quest was a total coincidence during this project. I had some spare time and was looking for something to develop. While looking at the DRM TODO list, I bumped into the debugfs clean-up task and found it interesting to work on. So, I started to work on this task based on the previous work from Wambui Karuga, who was a Outreachy mentee and worked on this feature during her internship. By chance, when I talked to Melissa about it, she told me that she had knowledge of this project due to a past Outreachy internship that she was engaged on, and she was able to help me figure out the last pieces of this side-quest. After submitting the first patch, introducing the debugfs device-centered functions, and converting a couple of drivers to the new structure, I decided to remove the debugfs_init hook from a couple of drivers in order to get closer to the goal of removing the debugfs_init hook. Moreover, during my last week in the CE, I tried to write a debugfs infrastructure for the KMS objects, which was another task in the TODO list, although I still need to do some rework on this series. Patch/Series Status [PATCH 0/7] Introduce debugfs device-centered functions Accepted drm/debugfs: use octal permissions instead of symbolic permissions Accepted drm/debugfs: add descriptions to struct parameters Accepted [PATCH 0/7] Convert drivers to the new debugfs device-centered functions Accepted [PATCH 0/13] drm/debugfs: Create a debugfs infrastructure for kms objects In Review More side-quests By the end of the CE, I was on my summer break from university, so I had some time to take a couple of side-quests in this journey. The first side-quest that I got into originated in a failed IGT test on the VC4, the “addfb25-bad-modifier” IGT test. Initially, I proposed a fix only for the VC4, but after some discussion in the mailing list, I decided to move forward with the idea to create the check for valid modifiers in the DRM core. The series is still in review, but I had some great interactions during the iterations. The second side-quest was to understand why the IGT test kms_writeback was causing a kernel oops in vkms. After some bisecting and some study about KMS’s atomic API, I was able to detect the problem and write a solution for it. It was pretty exciting to deal with vkms for the first time and to get some notion about the display side of things. Patch/Series Status drm/tests: Split drm_test_dp_mst_calc_pbn_mode into parameterized tests Accepted drm/tests: Split drm_test_dp_mst_sideband_msg_req_decode into parameterized tests Accepted tests/kms_addfb_basic: Avoid open-coded expressions Accepted [PATCH 0/3] Check for valid framebuffer’s format In Review drm/vkms: reintroduce prepare_fb and cleanup_fb functions Accepted Next Steps A bit different from the end of GSoC, I’m not really sure what are going to be my next steps in the next couple of months. The only thing I know for sure is that I will keep contributing to the DRM subsystem and studying more about DRI, especially the 3D rendering and KMS. The DRI infrastructure is really fascinating and there is so much to be learn! Although I feel that I improved a lot in the last couple of months, I still feel like a newbie in the community. I still want to have more knowledge of the DRM core helpers and understand how everything glues together. Apart from the DRM subsystem, I’m also trying to take some time to program more in Rust and maybe contribute to other open-source projects, like Mesa. Acknowledgment I would like to thank my great mentors Melissa Wen and André Almeida for helping me through this journey. I wouldn’t be able to develop this project without their great support and encouragement. They were an amazing duo of mentors and I thank them for answering all my questions and helping me with all the challenges. Also, I would like to thank the DRI community for reviewing my patches and giving me constructive feedback. Especially, I would like to thank Daniel Vetter for answering patiently every single question that I had about the debugfs clean-up and to thank Jani Nikula, Maxime Ripard, Thomas Zimmermann, Javier Martinez Canillas, Emma Anholt, Simon Ser, Iago Toral, Kamil Konieczny and many others that took their time to review my patches, answer my questions and provide me constructive feedback.
Enter your comment. Wiki syntax is allowed:
U G Q M G
 
  • news/planet/freedesktop.txt
  • Last modified: 2021/10/30 11:41
  • by 127.0.0.1