Freedesktop Planet - Latest News

  • Simon Ser: Status update, May 2024 (2024/05/20 22:00)
    Hi! Sadly, I need to start this status update with bad news: SourceHut has decided to terminate my contract. At this time, I’m still in the process of figuring out what I’ll do next. I’ve marked some SourceHut-specific projects as unmaintained, such as (feel free to fork of course). I’ve handed over hut maintenance to xenrox, and I’ve started migrating a few projects to other forges (more to follow). I will continue to maintain projects that I still use such as soju to the extent that my free time allows. On a more positive note, this month Igalia’s display next hackfest took place. Although I couldn’t attend in-person, it was great to discuss in real time with other engineers in the community about focused topics. We discussed about color management, HDR, adaptive sync, testing, real-time scheduling, power usage implications of the color pipeline, improved uAPI to handle KMS atomic commit failures, hardware plane offloading, display muxes, backlight, scaling and sharpening filters… And I probably missed a few other things. We’ve released wlroots 0.17.3 with a bunch of new bug fixes (thanks to Simon Zeni). The patches to add support for ICC profiles from M. Stoeckl have been merged. I’ve continued working on the new ext-screencopy-v1 protocol but there are a few remaining issues to address before this is ready. The display hackfest has motivated me to work on libliftoff. Apart from a few bug fixes, a new API to set a timeout for the libliftoff algorithm has been added, and some optimizations are about to get merged (one thanks to Leo Li). The Wayland release cycle has started, we’ve merged patches to generate validators for enum values and added a new deprecated-since XML attribute to mark a request, event or enum as deprecated. Thanks to Ferdinand Bachmann, kanshi has gained output defaults and aliases (useful for sharing output configurations across multiple profiles). mako 1.9 has been released with a new flag to toggle modes, another new flag to bypass history when dismissing a notification, and support for compositor-side cursor images. In IRC news, goguma now uses Material 3 (please report any regression), has gained support for messages only visible to channel operators (STATUSMSG), and I’ve spent a fair bit of time investigating the infamous duplicate message bug. I have a better understanding of the issue now, but still need a bit more time to come up with a proper fix. Thanks to old patches sent by sitting33 that I took way too long to review, gamja now only marks messages as read when it’s focused, shows the number of unread highlights in the tab title, and hides the internal WHO reply chatter from the user. Last, I’ve released go-imap 2.0.0 beta 3 with a whole bunch of bug fixes. Ksenia Roshchina has contributed a client implementation of the ACL IMAP extension. That’s all for now, see you next month!
  • Alejandro Piñeiro: First time on the Embedded Open Source Summit: talking about the rpi5 (2024/05/14 21:31)
    Some weeks ago I attended for the first time the Embedded Open Source Summit. Igalia had a booth that allowed to showcase the work that we have been doing during the past years. Several igalians also gave talks there. I gave a talk titled “Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driver for a New GPU”, where I provided an introduction about Igalia contributions to maintain the OpenGL/Vulkan stack for the Raspberry Pi, focusing on the challenges to implement the Mesa support for the Raspberry Pi 5, the last device from that series, that was release on October 2023. In you are interested, the video and slides of my presentation is now available: And as a bonus, you can see here a video showing the RPI5 running some Unreal Engine 4 Demos, and other applications:
  • Peter Hutterer: libwacom and Huion/Gaomon devices (2024/05/09 00:01)
    TLDR: Thanks to José Exposito, libwacom 2.12 will support all [1] Huion and Gaomon devices when running on a 6.10 kernel. libwacom, now almost 13 years old, is a C library that provides a bunch of static information about graphics tablets that is not otherwise available by looking at the kernel device. Basically, it's a set of APIs in the form of libwacom_get_num_buttons and so on. This is used by various components to be more precise about initializing devices, even though libwacom itself has no effect on whether the device works. It's only a library for historical reasons [2], if I were to rewrite it today, I'd probably ship libwacom as a set of static json or XML files with a specific schema. Here are a few examples on how this information is used: libinput uses libwacom to query information about tablet tools.The kernel event node always supports tilt but the individual tool that is currently in proximity may not. libinput can get the tool ID from the kernel, query libwacom and then initialize the tool struct correctly so the compositor and Wayland clients will get the right information. GNOME Settings uses libwacom's information to e.g. detect if a tablet is built-in or an external display (to show you the "Map to Monitor" button or not, if builtin), GNOME's mutter uses the SVGs provided by libwacom to show you an OSD where you can assign keystrokes to the buttons. All these features require that the tablet is supported by libwacom. Huion and Gamon devices [3] were not well supported by libwacom because they re-use USB ids, i.e. different tablets from seemingly different manufacturers have the same vendor and product ID. This is understandable, the 16-bit product id only allows for 65535 different devices and if you're a company that thinks about more than just the current quarterly earnings you realise that if you release a few devices every year (let's say 5-7), you may run out of product IDs in about 10000 years. Need to think ahead! So between the 140 Huion and Gaomon devices we now have in libwacom I only counted 4 different USB ids. Nine years ago we added name matching too to work around this (i.e. the vid/pid/name combo must match) but, lo and behold, we may run out of unique strings before the heat death of the universe so device names are re-used too! [4] Since we had no other information available to userspace this meant that if you plugged in e.g. a Gaomon M106 and it was detected as S620 and given wrong button numbers, a wrong SVG, etc. A while ago José got himself a tablet and started contributing to DIGIMEND (and upstreaming a bunch of things). At some point we realised that the kernel actually had the information we needed: the firmware version string from the tablet which conveniently gave us the tablet model too. With this kernel patch scheduled for 6.10 this is now exported as the uniq property (HID_UNIQ in the uevent) and that means it's available to userspace. After a bit of rework in libwacom we can now match on the trifecta of vid/pid/uniq or the quadrella of vid/pid/name/uniq. So hooray, for the first time we can actually detect Huion and Gaomon devices correctly. The second thing Jose did was to extract all model names from the .deb packages Huion and Gaomon provide and auto-generate all libwacom descriptions for all supported devices. Which meant, in one pull request we added around 130 devices. Nice! As said above, this requires the future kernel 6.10 but you can apply the patches to your current kernel if you want. If you do have one of the newly added devices, please verify the .tablet file for your device and let us know so we can remove the "this is autogenerated" warnings and fix any issues with the file. Some of the new files may now take precedence over the old hand-added ones so over time we'll likely have to merge them. But meanwhile, for a brief moment in time, things may actually work. [1] fsvo of all but should be all current and past ones provided they were supported by Huions driver [2] anecdote: in 2011 Jason Gerecke from Wacom and I sat down to and decided on a generic tablet handling library independent of the xf86-input-wacom driver. libwacom was supposed to be that library but it never turned into more than a static description library, libinput is now what our original libwacom idea was. [3] and XP Pen and UCLogic but we don't yet have a fix for those at the time of writing [4] names like "HUION PenTablet Pen"...
  • Melissa Wen: Get Ready to 2024 Linux Display Next Hackfest in A Coruña! (2024/05/07 14:33)
    We’re excited to announce the details of our upcoming 2024 Linux Display Next Hackfest in the beautiful city of A Coruña, Spain! This year’s hackfest will be hosted by Igalia and will take place from May 14th to 16th. It will be a gathering of minds from a diverse range of companies and open source projects, all coming together to share, learn, and collaborate outside the traditional conference format. Who’s Joining the Fun? We’re excited to welcome participants from various backgrounds, including: GPU hardware vendors; Linux distributions; Linux desktop environments and compositors; Color experts, researchers and enthusiasts; This diverse mix of backgrounds are represented by developers from several companies working on the Linux display stack: AMD, Arm, BlueSystems, Bootlin, Collabora, Google, GravityXR, Igalia, Intel, LittleCMS, Qualcomm, Raspberry Pi, RedHat, SUSE, and System76. It’ll ensure a dynamic exchange of perspectives and foster collaboration across the Linux Display community. Please take a look at the list of participants for more info. What’s on the Agenda? The beauty of the hackfest is that the agenda is driven by participants! As this is a hybrid event, we decided to improve the experience for remote participants by creating a dedicated space for them to propose topics and some introductory talks in advance. From those inputs, we defined a schedule that reflects the collective interests of the group, but is still open for amendments and new proposals. Find the schedule details in the official event webpage. Expect discussions on: KMS Color/HDR Proposal with new DRM object type: Brief presentation of GPU-vendor features; Status update of plane color management pipeline per vendor on Linux; HDR/Color Use-cases: HDR gainmap images and how should we think about HDR; Google/ChromeOS GFX view about HDR/per-plane color management, VKMS and lessons learned; Post-blending Color Pipeline. Color/HDR testing/CI VKMS status update; Chamelium boards, video capture. Wayland protocols color-management protocol status update; color-representation and video playback. Display control HDR signalling status update; backlight status update; EDID and DDC/CI. Strategy for video and gaming use-cases Multi-plane support in compositors Underlay, overlay, or mixed strategy for video and gaming use-cases; KMS Plane UAPI to simplify the plane arrangement problem; Shared plane arrangement algorithm desired. HDR video and hardware overlay Frame timing and VRR Frame timing: Limitations of uAPI; Current user space solutions; Brainstorm better uAPI; Cursor/overlay plane updates with VRR; KMS commit and buffer-readiness deadlines; Power Saving vs Color/Latency ABM (adaptive backlight management); PSR1 latencies; Power optimization vs color accuracy/latency requirements. Content-Adaptive Scaling & Sharpening Content-Adaptive Scalers on display hardware; New drm_colorop for content adaptive scaling; Proprietary algorithms. Display Mux Laptop muxes for switching of the embedded panel between the integrated GPU and the discrete GPU; Seamless/atomic hand-off between drivers on Linux desktops. Real time scheduling & async KMS API Potential benefits: lower latency input feedback, better VRR handling, buffer synchronization, etc. Issues around “async” uAPI usage and async-call handling. In-person, but also geographically-distributed event This year Linux Display Next hackfest is a hybrid event, hosted onsite at the Igalia offices and available for remote attendance. In-person participants will find an environment for networking and brainstorming in our inspiring and collaborative office space. Additionally, A Coruña itself is a gem waiting to be explored, with stunning beaches, good food, and historical sites. Semi-structured structure: how the 2024 Linux Display Next Hackfest will work Agenda: Participants proposed the topics and talks for discussing in sessions. Interactive Sessions: Discussions, workshops, introductory talks and brainstorming sessions lasting around 1h30. There is always a starting point for discussions and new ideas will emerge in real time. Immersive experience: We will have coffee-breaks between sessions and lunch time at the office for all in-person participants. Lunches and coffee-breaks are sponsored by Igalia. This will keep us sharing knowledge and in continuous interaction. Spaces for all group sizes: In-person participants will find different room sizes that match various group sizes at Igalia HQ. Besides that, there will be some devices for showcasing and real-time demonstrations. Social Activities: building connections beyond the sessions To make the most of your time in A Coruña, we’ll be organizing some social activities: First-day Dinner: In-person participants will enjoy a Galician dinner on Tuesday, after a first day of intensive discussions in the hackfest. Getting to know a little of A Coruña: Finding out a little about A Coruña and current local habits. On Thursday afternoon, we will close the 2024 Linux Display Next hackfest with a guided tour of the Museum of Galicia’s favorite beer brand, Estrella Galicia. The guided tour covers the eight sectors of the museum and ends with beer pouring and tasting. After this experience, a transfer bus will take us to the Maria Pita square. At Maria Pita square we will see the charm of some historical landmarks of A Coruña, explore the casual and vibrant style of the city center and taste local foods while chatting with friends. Sponsorship Igalia sponsors lunches and coffee-breaks on hackfest days, Tuesday’s dinner, and the social event on Thursday afternoon for in-person participants. We can’t wait to welcome hackfest attendees to A Coruña! Stay tuned for further details and outcomes of this unconventional and unique experience.
  • Tomeu Vizoso: Etnaviv NPU update 18: Getting the driver to work on the Amlogic S905D3 SoC (2024/05/07 12:46)
    With new releases of the Linux kernel and Mesa drivers poised to be packaged by Linux distributions, the TensorFlow Lite driver for the NPU in the Amlogic A311D SoC will be available to users with minimal effort.With that work bearing its fruits, I have been looking at how this driver could be of use with other hardware.Philipp Zabel of Pengutronix has been looking at adding support for the NPU in the NXP i.MX 8M Plus SoC, and he has made great progress on reverse engineering the in-memory format of the weights tensor, which is different from that used in the A311D.I started by probing what would entail supporting the NPU in the S905D3 SoC from Amlogic, and I found it not that different from what is currently supported, besides it also using a new format for the weights tensor.Weights, the other kind of them.Looked a bit further, and found that this format is very similar to what Philip had been reverse engineering and implementing support for.After a couple of weeks staring at memory dumps and writing a python tool to decode them, I realized that the run-length and Huffman encodings were the same, with only a few differences such as where and how the bias values were stored.With a few changes to Philip's work-in-progress branch I got my first tests passing on the Libre Computer Solitude SBC board.Next I will look at supporting more weights tensor dimensions and fixing bugs in how the weights and other values are encoded.The command stream programming seems to be very similar to that of the A311D, so I don't expect much work to be needed there.Once everything is working at the same level as with the A311D, I will move to determine the optimal values for the zero run-length and Huffman symbol maps, for maximum compression and thus performance (as NPUs are so fast at arithmetic that they tend to be memory starved).Big thanks to Pengutronix for supporting Philip's work, and to Libre Computer for having supported the development of the driver so far.
  • Nicolai Hähnle: A new kind of git history (2024/05/04 16:30)
    Discussions about rebase vs. merge are familiar territory for anybody with an interest in version control in general and git in particular. I want to finally give a more permanent home to an idea that I have expressed in the past and that I've occasionally seen others hint at in those discussions as well.There are multiple camps in these discussions that have slightly different ideas about how and for what purposes git should be used.The first major axis of disagreement is whether history needs to be git bisect-able. Outside of my own little hobby projects, I've always worked on projects for which bisectability was important. This has generally been because their scope was such that CI simply had no chance to cover all uses of the software. Bug reports that can be traced to regressions from weeks or even months ago are not frequent per se, but they have always been frequent enough to matter. git bisect is an essential tool for finding those regression points when they happen. Not all projects are like that, but for projects which are, the notion of an "atomic" change to the project's main development branch (or branches) is important.The second major axis of disagreement is whether the development history of those "atomic" changes is important enough to preserve. The original git development workflow does not consider this to be important: developers send around and review multiple iterations of a change, but only the final version of the change goes into the permanent record of the git repository. I tend to agree with that view. I have very occasionally found it useful to go back and read through the comments on a pull request that lead to a change months ago (or the email thread in projects that use an email workflow), but I have never found it useful to look at older versions of a change.Some people seem to really care about this kind of history, though. They're the people who argue for a merge-based workflow for pull requests on GitHub (but against force-pushes to the same) and who have built hacks for bisectability and readability of history like --first-parent. I'm calling that a hack because it does not compose well. It works for projects whose atomic change history is essentially linear, but it breaks down once the history becomes more complex. What if the project occasionally has a genuine merge? Now you'd want to apply --first-parent for most merge commits but not all. Things get messy.One final observation. Even "my" camp, which generally prefers to discard development history leading up to the atomic change in a main development branch, does want to preserve a kind of history that is currently not captured by git's graph. git revert inserts the hash of the commit that was reverted into the commit message. Similarly, git cherry-pick optionally inserts the hash of the commit that was cherry-picked into the commit message.In other words, there is a kind of history for whose preservation at least in some cases there seems to be a broad consensus. This kind of history is distinct from the history that is captured by commit parent links. Looked at in this light, the idea is almost obvious: make this history an explicit part of git commit metadata.The gist of it would be this. Every commit has a (often empty) list of historical commit references explaining the origins of the diff that is implicitly represented by the commit; let's call them diff-parents. The diff-parents are an ordered list of references to commits, each of them with a "reverted" bit that can optionally be set.The history of a revert can be encoded by making the reverted commit a diff-parent with the "reverted" bit set. The history of a cherry-pick can be encoded similarly, with the "reverted" bit clear. When we perform a simple rebase, each new commit has an obvious diff-parent. When commits are squashed during a rebase, the sequence of squashed commits becomes the list of diff-parents of the newly formed commit. GitHub users who like to preserve all development history can use the "squash" option when landing pull requests and have the history be preserved via the list of diff-parents. git commit --amend can similarly record the original commit as diff-parent.This is an idea and not a fully fleshed-out plan. There are obviously a whole bunch of tricky questions to answer. For example: How does this all fit into git's admittedly often byzantine CLI? Can merge commits be diff-parents, and how would that work? Can we visualize the difference between a commit and its diff-parents? (Hint: Here's an idea)Diff-parents are a source of potential information leaks. This is not a problem specific to the idea of diff-parents; it is a general problem with the idea of preserving all history. Imagine some developer accidentally commits some credentials in their local clone of a repository and then uses git commit --amend to remove them again. Whoops, the commit that contains the credentials is still referenced as a diff-parent. Will it (and therefore the credentials) be published to the world for all to see when the developers pushes their branch to GitHub? This needs to be taken seriously.So there are a whole bunch of issues that would have to be addressed for this idea to work well. I believe those issues to be quite surmountable in principle, but given the state of git development (where GitHub, which to many is almost synonymous with git, doesn't even seem to be able to understand how git was originally meant to be used) I am not particularly optimistic. Still, I think it's a good idea, and I'd love to see it or something like it in git.
  • Hans de Goede: Moving GPU drivers out of the initramfs (2024/04/29 13:46)
    The firmware which drm/kms drivers need is becoming bigger and bigger and there is a push to move to generating a generic initramfs on distro's builders and signing the initramfs with the distro's keys for security reasons. When targetting desktops/laptops (as opposed to VMs) this means including firmware for all possible GPUs which leads to a very big initramfs.This has made me think about dropping the GPU drivers from the initramfs  and instead make plymouth work well/better with simpledrm (on top of efifb). A while ago I discussed making this change for Fedora with the Red Hat graphics team spoiler: For now nothing is going to change.Let me repeat that: For now there are no plans to implement this idea so if you believe you would be impacted by such a change: Nothing is going to change.Still this is something worthwhile to explore further.Advantages:1. Smaller initramfs size:* E.g. a host specific initramfs with amdgpu goes down from 40MB to 20MB* No longer need to worry about Nvidia GSP firmware size in initrd* This should also significantly shrink the initrd used in liveimages2. Faster boot times:* Loading + unpacking the initrd can take a surprising amount of time. E.g. on my old AMD64 embedded PC (with BobCat cores) the reduction of 40MB -> 20MB in initrd size shaves approx. 3 seconds of initrd load time + 0.6s seconds from the time it takes to unpack the initrd*  Probing drm connectors can be slow and plymouth blocks the initrd -> rootfs transition while it is busy probing3. Earlier showing of splash. By using simpledrm for the splash the splash can be shown earlier, avoiding the impression the machine is hanging during boot. An extreme example of this is my old AMD64 embedded PC, where the time to show the first frame of the splash goes down from 47 to 9 seconds.4. One less thing to worry about when trying to create a uniform desktop pre-generated and signed initramfs (these would still need support for nvme + ahci and commonly used rootfs + lvm + luks).Disadvantages:Doing this will lead to user visible changes in the boot process:1. Secondary monitors not lit up by the efifb will stay black during full-disk encryption password entry, since the GPU drivers will now only load after switching to the encrypted root. This includes any monitors connected to the non boot GPU in dual GPU setups.Generally speaking this is not really an issue, the secondary monitors will light up pretty quickly after the switch to the real rootfs. However when booting a docked laptop, with the lid closed and the only visible monitor(s) are connected to the non boot GPU, then the full-disk encryption password dialog will simply not be visible at all.This is the main deal-breaker for not implementing this change.Note because of the strict version lock between kernel driver and userspace with nvidia binary drivers, the nvidia binary drivers are usually already not part of the initramfs, so this problem already exists and moving the GPU drivers out of the initramfs does not really make this worse.2. With simpledrm plymouth does not get the physical size of the monitor, so plymouth will need to switch to using heuristics on the resolution instead of DPI info to decide whether or not to use hidpi (e.g. 2x size) rendering and even when switching to the real GPU driver plymouth needs to stay with its initial heuristics based decision to avoid the scaling changing when switching to the real driver which would lead to a big visual glitch / change halfway through the boot.This may result in a different scaling factor for some setups, but I do not expect this really to be an issue.3. On some (older) systems the efifb will not come up in native mode, but rather in 800x600 or 1024x768.This will lead to a pretty significant discontinuity in the boot experience when switching from say 800x600 to 1920x1080 while plymouth was already showing the spinner at 800x600.One possible workaround here is to add: 'video=efifb:auto' to the kernel commandline which will make the efistub switch to the highest available resolution before starting the kernel. But it seems that the native modes are simply not there on systems which come up at 800x600 / 1024x768 so this does not really help.This does not actually break anything but it does look a bit ugly. So we will just need to document this as an unfortunate side-effect of the change and then we (and our users) will have to live with this (on affected hardware).4. On systems where a full modeset is done the monitor going briefly black from the modeset will move from being just before plymouth starts to the switch from simpledrm drm to the real driver. So that is slightly worse. IMHO the answer here is to try and get fast modesets working on more systems.5. On systems where the efifb comes up in the panel's native mode and a fast modeset can be done, the spinner will freeze for a (noticeable) fraction of a second as the switch to the real driver happens.Preview:To get an impression what this will look / feel like on your own systems, you can implement this right now on Fedora 40 with some manual configuration changes:1. Create /etc/dracut.conf.d/omit-gpu-drivers.conf with:omit_drivers+=" amdgpu radeon nouveau i915 "And then run "sudo dracut -f" to regenerate your current initrd.2. Add to kernel commandline: "plymouth.use-simpledrm"3. Edit /etc/selinux/config, set SELINUX=permissive this is necessary because ATM plymouth has issues with accessing drm devices after the chroot from the initrd to the rootfs.Note this all assumes EFI booting with efifb used to show the plymouth boot splash. For classic BIOS booting it is probably best to stick with having the GPU drivers inside the initramfs.
  • Robert McQueen: Update from the GNOME board (2024/04/26 10:39)
    It’s been around 6 months since the GNOME Foundation was joined by our new Executive Director, Holly Million, and the board and I wanted to update members on the Foundation’s current status and some exciting upcoming changes. Finances As you may be aware, the GNOME Foundation has operated at a deficit (nonprofit speak for a loss – ie spending more than we’ve been raising each year) for over three years, essentially running the Foundation on reserves from some substantial donations received 4-5 years ago. The Foundation has a reserves policy which specifies a minimum amount of money we have to keep in our accounts. This is so that if there is a significant interruption to our usual income, we can preserve our core operations while we work on new funding sources. We’ve now “hit the buffers” of this reserves policy, meaning the Board can’t approve any more deficit budgets – to keep spending at the same level we must increase our income. One of the board’s top priorities in hiring Holly was therefore her experience in communications and fundraising, and building broader and more diverse support for our mission and work. Her goals since joining – as well as building her familiarity with the community and project – have been to set up better financial controls and reporting, develop a strategic plan, and start fundraising. You may have noticed the Foundation being more cautious with spending this year, because Holly prepared a break-even budget for the Board to approve in October, so that we can steady the ship while we prepare and launch our new fundraising initiatives. Strategy & Fundraising The biggest prerequisite for fundraising is a clear strategy – we need to explain what we’re doing and why it’s important, and use that to convince people to support our plans. I’m very pleased to report that Holly has been working hard on this and meeting with many stakeholders across the community, and has prepared a detailed and insightful five year strategic plan. The plan defines the areas where the Foundation will prioritise, develop and fund initiatives to support and grow the GNOME project and community. The board has approved a draft version of this plan, and over the coming weeks Holly and the Foundation team will be sharing this plan and running a consultation process to gather feedback input from GNOME foundation and community members. In parallel, Holly has been working on a fundraising plan to stabilise the Foundation, growing our revenue and ability to deliver on these plans. We will be launching a variety of fundraising activities over the coming months, including a development fund for people to directly support GNOME development, working with professional grant writers and managers to apply for government and private foundation funding opportunities, and building better communications to explain the importance of our work to corporate and individual donors. Board Development Another observation that Holly had since joining was that we had, by general nonprofit standards, a very small board of just 7 directors. While we do have some committees which have (very much appreciated!) volunteers from outside the board, our officers are usually appointed from within the board, and many board members end up serving on multiple committees and wearing several hats. It also means the number of perspectives on the board is limited and less representative of the diverse contributors and users that make up the GNOME community. Holly has been working with the board and the governance committee to reduce how much we ask from individual board members, and improve representation from the community within the Foundation’s governance. Firstly, the board has decided to increase its size from 7 to 9 members, effective from the upcoming elections this May & June, allowing more voices to be heard within the board discussions. After that, we’re going to be working on opening up the board to more participants, creating non-voting officer seats to represent certain regions or interests from across the community, and take part in committees and board meetings. These new non-voting roles are likely to be appointed with some kind of application process, and we’ll share details about these roles and how to be considered for them as we refine our plans over the coming year. Elections We’re really excited to develop and share these plans and increase the ways that people can get involved in shaping the Foundation’s strategy and how we raise and spend money to support and grow the GNOME community. This brings me to my final point, which is that we’re in the run up to the annual board elections which take place in the run up to GUADEC. Because of the expansion of the board, and four directors coming to the end of their terms, we’ll be electing 6 seats this election. It’s really important to Holly and the board that we use this opportunity to bring some new voices to the table, leading by example in growing and better representing our community. Allan wrote in the past about what the board does and what’s expected from directors. As you can see we’re working hard on reducing what we ask from each individual board member by increasing the number of directors, and bringing additional members in to committees and non-voting roles. If you’re interested in seeing more diverse backgrounds and perspectives represented on the board, I would strongly encourage you consider standing for election and reach out to a board member to discuss their experience. Thanks for reading! Until next time. Best Wishes,RobPresident, GNOME Foundation Update 2024-04-27: It was suggested in the Discourse thread that I clarify the interaction between the break-even budget and the 1M EUR committed by the STF project. This money is received in the form of a contract for services rather than a grant to the Foundation, and must be spent on the development areas agreed during the planning and application process. It’s included within this year’s budget (October 23 – September 24) and is all expected to be spent during this fiscal year, so it doesn’t have an impact on the Foundation’s reserves position. The Foundation retains a small % fee to support its costs in connection with the project, including the new requirement to have our accounts externally audited at the end of the financial year. We are putting this money towards recruitment of an administrative assistant to improve financial and other operational support for the Foundation and community, including the STF project and future development initiatives. (also posted to GNOME Discourse, please head there if you have any questions or comments)
  • Mike Blumenkrantz: Startup (2024/04/25 00:00)
    It Happened Again. I’ve been seeing a lot of ultra technical posts fly past my news feed lately and I’m tired of it. There’s too much information out there, too many analyses of vague hardware capabilities, too much handwaving in the direction of compiler internals. It’s too much. Take it out. I know you’ve got it with you. I know all my readers carry them at all times. That’s right. It’s time to make some pasta. Everyone understands pasta. Target Locked Today I’ll be firing up the pasta maker on this ticket that someone nerdsniped me with. This is the sort of simple problem that any of us smoothbrains can understand: app too slow. Here at SGC, we’re all experts at solving app too slow by now, so let’s take a gander at the problem area. I’m in a hurry to get to the gym today, so I’ll skip over some of the less interesting parts of my analysis. Instead, let’s look at some artisanal graphics. This is an image, but let’s pretend it’s a graph of the time between when an app is started to when it displays its first frame: At the start is when the user launched the app, the body of the arrow is what happens during “startup”, and the head of the arrow is when the app has displayed its first frame to the user. The “startup” period is what the user perceives as latency. More technical blogs would break down here into discussions and navel-gazing about “time to first light” and “photon velocity” or whatever, but we’re keeping things simple. If SwapBuffers is called, the app has displayed its frame. Where are we at with this now? Initial Findings I did my testing on an Intel Icelake CPU/GPU because I’m lazy. Also because the original ticket was for Intel systems. Also because deal with it, this isn’t an AMD blog. The best way to time this is to: add an exit call at the end of SwapBuffers run the app in a while loop using time evaluate the results On iris, the average startup time for gtk4-demo was between 190-200ms. On zink, the average startup time was between 350-370ms. Uh-oh. More Graphics (The Fun Kind) Initial analysis revealed something very stupid for the zink case: a lot of time was being spent on shaders. Now, I’m not saying a lot of time was spent compiling shaders. That would be smart. Shaders have to be compiled, and it’s not like that can be skipped or anything. A cold run of this app that compiles shaders takes upwards of 1.0 seconds on any driver, and I’m not looking to improve that case since it’s rare. And hard. And also I gotta save some work for other people who want to make good blog posts. The problem here is that when creating shaders, zink blocks while it does some initial shader rewrites and optimizations. This is like if you’re going to make yourself a sandwich, before you put smoked brisket on the bread you have to first slice the bread so it’s ready when you want to put the brisket on it. Sure, you could slice it after you’ve assembled your pile of pulled pork and slaw, but generally you slice the bread, you leave the bread sitting somewhere while you find/make/assemble the burnt ends for your sandwich, and then you finish making your sandwich. Compiling shaders is basically the same as making a sandwich. But slicing bread takes time. And when you’re slicing the bread, you’re not doing anything else. You can’t. You’re holding a knife and a loaf of bread. You’re physically incapable of doing anything else until you finish slicing. Similarly, zink can’t do anything else while it’s doing that shader creation. It’s sitting there creating the shaders. And while it’s doing that, the rest of the app (or just the main GL thread if glthread is active) is blocked. It can’t do anything else. It’s waiting on zink to finish, and it cannot make forward progress until the shader creation has completed. Now this process happens dozens or hundreds of times during app startup, and every time it happens, the app blocks. Its own initialization routines–reading configuration data, setting up global structs and signal handlers, making display server connections, etc–cannot proceed until GL stops blocking. If you’re unsure where I’m going with this, it’s a bad thing that zink is slicing all this bread while the app is trying to make sandwiches. Improvement The year is whatever year you’re reading this, and in that year we have very powerful CPUs. CPUs so powerful that you can do lots of things at once. Instead of having only two hands to hold the bread and slice it, you have your own hands and then the hands of another 10+ of your clones which are also able to hold bread and slice it. So if you tell one of those clones “slice some bread for me”, you can do other stuff and come back to some nicely sliced bread. When exactly that bread arrives is another issue depensynchronizationding on how well you understand the joke here. But this is me, so I get all the jokes, and that means I can do something like this: By moving all that bread slicing into a thread, the rest of the startup operations can proceed without blocking. This frees up the app to continue with its own lengthy startup routines. After the change, zink starts up in a average of 260-280ms, a 25% improvement. I know not everyone wants pasta on their sandwiches, but that’s where we ended up today. Not The End That changeset is the end of this post, but it’s not the end of my investigation. There’s still mysteries to uncover here. Like why the farfalle is this app calling glXInitialize and eglInitialize? Can zink get closer to iris’s startup time? We’ll find out in a future installment of Who Wants To Eat Lunch?
  • Tomeu Vizoso: Rockchip NPU update 3: Real-time object detection on RK3588 (2024/04/19 08:17)
    ProgressYesterday I managed to implement in my open-source driver all the remaining operations so the SSDLite MobileDet model can run on Rockchip's NPU in the RK3588 SoC.Performance is pretty good at 30 frames per second when using just one of the 3 cores that the NPU contains. I uploaded the generated video to YouTube at:You can get the source code at my branch here. Next stepsNow that we got to this level of usefulness, I'm going to switch to writing a kernel driver suited for inclusion into the Linux kernel, to the drivers/accel subsystem.There is still lots of work to do, but progress is going pretty fast, though as I write more drivers for different NPUs I will have to split my time among them. At least, until we get more contributors! :)
  • Peter Hutterer: udev-hid-bpf: quickstart tooling to fix your HID devices with eBPF (2024/04/18 04:17)
    For the last few months, Benjamin Tissoires and I have been working on and polishing a little tool called udev-hid-bpf [1]. This is the scaffolding required quickly and easily write, test and eventually fix your HID input devices (mouse, keyboard, etc.) via a BPF program instead of a full-blown custom kernel driver or a semi-full-blown kernel patch. To understand how it works, you need to know two things: HID and BPF [2]. Why BPF for HID? HID is the Human Interface Device standard and the most common way input devices communicate with the host (HID over USB, HID over Bluetooth, etc.). It has two core components: the "report descriptor" and "reports", both of which are byte arrays. The report descriptor is a fixed burnt-in-ROM byte array that (in rather convoluted terms) tells us what we'll find in the reports. Things like "bits 16 through to 24 is the delta x coordinate" or "bit 5 is the binary button state for button 3 in degrees celcius". The reports themselves are sent at (usually) regular intervals and contain the data in the described format, as the devices perceives reality. If you're interested in more details, see Understanding HID report descriptors. BPF or more correctly eBPF is a Linux kernel technology to write programs in a subset of C, compile it and load it into the kernel. The magic thing here is that the kernel will verify it, so once loaded, the program is "safe". And because it's safe it can be run in kernel space which means it's fast. eBPF was originally written for network packet filters but as of kernel v6.3 and thanks to Benjamin, we have BPF in the HID subsystem. HID actually lends itself really well to BPF because, well, we have a byte array and to fix our devices we need to do complicated things like "toggle that bit to zero" or "swap those two values". If we want to fix our devices we usually need to do one of two things: fix the report descriptor to enable/disable/change some of the values the device pretends to support. For example, we can say we support 5 buttons instead of the supposed 8. Or we need to fix the report by e.g. inverting the y value for the device. This can be done in a custom kernel driver but a HID BPF program is quite a lot more convenient. HID-BPF programs For illustration purposes, here's the example program to flip the y coordinate. HID BPF programs are usually device specific, we need to know that the e.g. the y coordinate is 16 bits and sits in bytes 3 and 4 (little endian): SEC("fmod_ret/hid_bpf_device_event") int BPF_PROG(hid_y_event, struct hid_bpf_ctx *hctx) { s16 y; __u8 *data = hid_bpf_get_data(hctx, 0 /* offset */, 9 /* size */); if (!data) return 0; /* EPERM check */ y = data[3] | (data[4] << 8); y = -y; data[3] = y & 0xFF; data[4] = (y >> 8) & 0xFF; return 0; } That's it. HID-BPF is invoked before the kernel handles the HID report/report descriptor so to the kernel the modified report looks as if it came from the device. As said above, this is device specific because where the coordinates is in the report depends on the device (the report descriptor will tell us). In this example we want to ensure the BPF program is only loaded for our device (vid/pid of 04d9/a09f), and for extra safety we also double-check that the report descriptor matches. // The bpf.o will only be loaded for devices in this list HID_BPF_CONFIG( HID_DEVICE(BUS_USB, HID_GROUP_GENERIC, 0x04D9, 0xA09F) ); SEC("syscall") int probe(struct hid_bpf_probe_args *ctx) { /* * The device exports 3 interfaces. * The mouse interface has a report descriptor of length 71. * So if report descriptor size is not 71, mark as -EINVAL */ ctx->retval = ctx->rdesc_size != 71; if (ctx->retval) ctx->retval = -EINVAL; return 0; } Obviously the check in probe() can be as complicated as you want. This is pretty much it, the full working program only has a few extra includes and boilerplate. So it mostly comes down to compiling and running it, and this is where udev-hid-bpf comes in. udev-hid-bpf as loader udev-hid-bpf is a tool to make the development and testing of HID BPF programs simple, and collect HID BPF programs. You basically run meson compile and meson install and voila, whatever BPF program applies to your devices will be auto-loaded next time you plug those in. If you just want to test a single bpf.o file you can udev-hid-bpf install /path/to/foo.bpf.o and it will install the required udev rule for it to get loaded whenever the device is plugged in. If you don't know how to compile, you can grab a tarball from our CI and test the pre-compiled bpf.o. Hooray, even simpler. udev-hid-bpf is written in Rust but you don't need to know Rust, it's just the scaffolding. The BPF programs are all in C. Rust just gives us a relatively easy way to provide a static binary that will work on most tester's machines. The documentation for udev-hid-bpf is here. So if you have a device that needs a hardware quirk or just has an annoying behaviour that you always wanted to fix, well, now's the time. Fixing your device has never been easier! [3]. [1] Yes, the name is meh but you're welcome to come up with a better one and go back in time to suggest it a few months ago. [2] Because I'm lazy the terms eBPF and BPF will be used interchangeably in this article. Because the difference doesn't really matter in this context, it's all eBPF anyway but nobody has the time to type that extra "e". [3] Citation needed
  • Simon Ser: Status update, April 2024 (2024/04/15 22:00)
    Hi! The X.Org Foundation results are in, and I’m now officially part of the Board of Directors. I hope I can be of use to the community on more organizational issues! Speaking of which, I’ve spent quite a bit of time dealing with Code of Conduct matters lately. Of course I can’t disclose details for privacy, but hopefully our actions can gradually improve the contribution experience for FreeDesktop.Org projects. New extensions have been merged in wayland-protocols. linux-drm-syncobj-v1 enables explicit synchronization which is a better architecture than what we have today (implicit synchronization) and will improve NVIDIA support. alpha-modifier-v1 allows Wayland clients to set an alpha channel multiplier on its surfaces, it can be used to implement effects such as fade-in or fade-out without redrawing, and can even be offloaded to KMS. The tablet-v2 protocol we’ve used for many years has been stabilized. In other Wayland news, a new API has been added to dynamically resize libwayland’s internal buffer. By default, the server-side buffer size is still 4 KiB but the client-side buffer will grow as needed. This should help with bursts (e.g. long format lists) and high poll rate mice. I’ve added a new wayland-scanner mode to generate headers with only enums to help libraries such as wlroots which use these in their public API. And I’ve sent an announcement for the next Wayland release, it should happen at the end of May if all goes well. With the help of Sebastian Wick, libdisplay-info has gained support for more bits, in particular DisplayID type II, III and VII timings, as well as CTA Video Format Preference blocks, Room Configuration blocks and Speaker Location blocks. I’ve worked on libicc to finish up the parser, next I’d like to add the math required to apply an ICC profile. gamja now has basic support for file uploads (only when pasting a file for now) and hides no-op nickname changes (e.g. from “emersion” to “emersion_” and back). See you next month!
  • Christian Gmeiner: hwdb - The only truth (2024/04/15 00:00)
    Trusting hardware, particularly the registers that describe its functionality, is fundamentally risky. tl;dr The etnaviv GPU stack is continuously improving and becoming more robust. This time, a hardware database was incorporated into Mesa, utilizing header files provided by the SoC vendors. If you are interested in the implementation details, I recommend checking out this Mesa MR. Are you employed at Versilicon and want to help? You could greatly simplify our work by supplying the community with a comprehensive header that includes all the models you offer. Last but not least: I deeply appreciate Igalia’s passion for open source GPU driver development, and I am grateful to be a part of the team. Their enthusiasm for open source work not only pushes the boundaries of technology but also builds a strong, collaborative community around it. The good old days Years ago, when I began dedicating time to hacking on etnaviv, the kernel driver in use would read a handful of registers and relay the gathered information to the user space blob. This blob driver was then capable of identifying the GPU (including model, revision, etc.), supported features (such as DXT texture compression, seamless cubemaps, etc.), and crucial limits (like the number of registers, number of varyings, and so on). For reverse engineering purposes, this interface is super useful. Image if you could change one of these feature bits on a target running the binary blob. With libvivhook it is possible to do exactly this. From time to time, I am running such an old vendor driver stack on an i.MX 6QuadPlus SBC, which features a Vivante GC3000 as its GPU. Somewhere, I have a collection of scripts that I utilized to acquire additional knowledge about unknown GPU states activated when a specific feature bit was set. To explore a simple example, let’s consider the case of misrepresenting a GPU’s identity as a GC2000. This involves modifying the information provided by the kernel driver to the user space, making the user space driver believe it is interacting with a GC2000 GPU. This scenario could be used for testing, debugging, or understanding how specific features or optimizations are handled differently across GPU models. export ETNAVIV_CHIP_MODEL="0x2000" export ETNAVIV_CHIP_REVISION="0x5108" export ETNAVIV_FEATURES0_CLEAR="0xFFFFFFFF" export ETNAVIV_FEATURES1_CLEAR="0xFFFFFFFF" export ETNAVIV_FEATURES2_CLEAR="0xFFFFFFFF" export ETNAVIV_FEATURES0_SET="0xe0296cad" export ETNAVIV_FEATURES1_SET="0xc9799eff" export ETNAVIV_FEATURES2_SET="0x2efbf2d9" LD_PRELOAD="/lib/" ./test-case If you capture the generated command stream and compare it with the one produced under the correct identity, you’ll observe many differences. This is super useful - I love it. Changing Tides: The Shift in ioctl() Interface At some point in time, Vivante changed their ioctl() interface and modified the gcvHAL_QUERY_CHIP_IDENTITY command. Instead of providing a very detailed chip identity, they reduced the data set to the following values: model revision product id eco id customer id This shift could indeed hinder reverse engineering efforts significantly. At a glance, it becomes impossible to alter any feature value, and understanding how the vendor driver processes these values is out of reach. Determining the function or impact of an unknown feature bit now seems unattainable. However, the kernel driver also requires a mechanism to verify the existing features of the GPU, as it needs to accommodate a wide variety of GPUs. Therefore, there must be some sort of system or method in place to ensure the kernel driver can effectively manage and support the diverse functionalities and capabilities of different GPUs. A New Approach: The Hardware Database Dilemma Let’s welcome: gc_feature_database.h, or hwdb for short. Vivante transitioned to using a database that stores entries for limit values and feature bits. This database is accessed by querying with model, revision, product id, eco id and customer id. There is some speculation why this move was done. My theory posits that they became frustrated with the recurring cycle of introducing feature bits to indicate the implementation of a feature, subsequently discovering problems with said feature, and then having to introduce additional feature bits to signal that the feature now truly operates as intended. It became far more straightforward to deactivate a malfunctioning feature by modifying information in the hardware database (hwdb). After they began utilizing the hwdb within the driver, updates to the feature registers in the hardware ceased. Here is a concrete example of such a case that can be found in the etnaviv gallium driver: screen->specs.tex_astc = VIV_FEATURE(screen, chipMinorFeatures4, TEXTURE_ASTC) && !VIV_FEATURE(screen, chipMinorFeatures6, NO_ASTC); Meanwhile, in the etnaviv world there was a hybrid in the making. We stuck with the detailed feature words and found a smart way to convert from Vivante’s hwdb entries to our own in-kernel database. There is even a full blown Vivante -> etnaviv hwdb convert. At that time, I did not fully understand all the consequences this approach would bring - more on that later. So, I dedicated my free time to reverse engineering and tweaking the user space driver, while letting the kernel developers do their thing. About a year after the initial hwdb landed in the kernel, I thought it might be a good idea to read out the extra id values, and provide them via sysfs to the user space. At that time, I already had the idea of moving the hardware database to user space in mind. However, I was preoccupied with other priorities that were higher on my to-do list, and I ended up forgetting about it. Challange accepted Tomeu Vizoso began to work on teflon and a Neural Processing Unit (NPU) driver within Mesa, leveraging a significant amount of the existing codebase and concepts, including the same kernel driver for the GPU. During this process, he encountered a need for some NPU-specific limit values. To address this, he added an in-kernel hwdb entry and made the limit values accessible to user space. That’s it — the kernel supplies all the values the NPU driver requires. We’re finished, aren’t we? It turns out, that there are many more NPU related values that need to be exposed in the same manner, with seemingly no end in sight. One of the major drawbacks when the hardware database (hwdb) resides in the kernel is the considerable amount of time it takes for hwdb patches to be written, reviewed, and eventually merged into Linus’s git tree. This significantly slows down the development of user space drivers. For end users, this means they must either run a bleeding-edge kernel or backport the necessary changes on their own. For me personally, the in-kernel hardware database should never have been implemented in its current form. If I could go back in time, I would have voiced my concerns. As a result, moving the hardware database (hwdb) to user space quickly became a top priority on my to-do list, and I began working on it. However, during the testing phase of my proof of concept (PoC), I had to pause my work due to a kernel issue that made it unreliable for user space to trust the ID values provided by the kernel. Once my fix for this issue began to be incorporated into stable kernel versions, it was time to finalize the user space hwdb. There is only one little but important detail we have not talked about yet. There are vendor specific versions of gc_feature_database.h based on different versions of the binary blob. For instance, there is one from NXP, ST, Amlogic and some more. Here is a brief look at the differences: nxp/gc_feature_database.h (autogenerated at 2023-10-24 16:06:00, 861 struct members, 27 entries) stm/gc_feature_database.h (autogenerated at 2022-12-29 11:13:00, 833 struct members, 4 entries) amlogic/gc_feature_database.h (autogenerated at 2021-04-12 17:20:00, 733 struct members, 8 entries) We understand that these header files are generated and adhere to a specific structure. Therefore, all we need to do is write an intelligent Python script capable of merging the struct members into a single consolidated struct. This script will also convert the old struct entries to the new format and generate a header file that we can use. I’m consistently amazed by how swiftly and effortlessly Python can be used for such tasks. Ninety-nine percent of the time, there’s a ready-to-use Python module available, complete with examples and some documentation. To address the C header parsing challenge, I opted for pycparser. The final outcome is a generated hwdb.h file that looks and feels similar to those generated from the binary blob. Future proof This header merging approach offers several advantages: It simplifies the support for another SoC vendor. There’s no need to comprehend the significance of each feature bit. The source header files are supplied by Versilicon or the SoC vendor, ensuring accuracy. Updating the hwdb is straightforward — simply replace the files and rebuild Mesa. It allows for much quicker deployment of new features and hwdb updates since no kernel update is required. This method accelerates the development of user space drivers. While working on this topic I decided to do a bigger refactoring with the end goal to provide a struct etna_core_info that is located outside of the gallium driver. This makes the code future proof and moves the filling of struct etna_core_info directly into the lowest layer - libetnaviv_drm (src/etnaviv/drm). We have not yet talked about one important detail. What happens if there is no entry in the user space hwdb? The solution is straightforward: we fallback to the previous method and request all feature words from the kernel driver. However, in an ideal scenario, our user space hardware database should supply all necessary entries. If you find that an entry for your GPU/NPU is missing, please get in touch with me. What about the in-kernel hwdb? The existing system, despite its limitations, is set to remain indefinitely, with new entries being added to accommodate new GPUs. Although it will never contain as much information as the user space counterpart, this isn’t necessarily a drawback. For the purposes at hand, only a handful of feature bits are required.
  • Mike Blumenkrantz: Quick Post (2024/04/12 00:00)
    Super Fast Just a quick post to let everyone know that I have clicked merge on the vroom MR. Once it lands, you can test the added performance gains with ZINK_DEBUG=ioopt. I’ll be enabling this by default in the next month or so once a new GL CTS release happens that fixes all the hundreds of broken tests which would otherwise regress. With that said, I’ve tested it on a number of games and benchmarks, and everything works as expected. Have fun.
  • Mike Blumenkrantz: Descending (2024/04/04 00:00)
    Into The Spiral of Madness I know what you’re all thinking: there have not been enough blog posts this year. As always, my highly intelligent readers are right, and as always, you’re just gonna have to live with that because I’m not changing the way anything works. SGC happens when it happens. And today. As it snows in April. SGC. Is. Happening. Let’s begin. In The Beginning, A Favor Was Asked I was sitting at my battlestation doing some very ordinary REDACTED work for REDACTED, and friend of the blog, Samuel “Shader Objects” Pitoiset (he has legally changed his name, please be respectful), came to me with a simple request. He wanted to enable VK_EXT_shader_object for the radv-zink jobs in mesa CI as the final part of his year-long bringup for the extension. This meant that all the tests passing without shader objects needed to also pass with shader objects. This should’ve been easy; it was over a year ago that the Khronos blog famously and confusingly announced that pipelines were dead and nobody should ever use them again (paraphrased). A year is more than enough time for everyone to collectively get their shit together. Or so you might think. Turns out shader objects are hard. This simple ask sent me down a rabbithole the likes of which I had never imagined. It started normally enough. There were a few zink tests which failed when shader objects were enabled. Nobody was surprised; I wrote the zink usage before validation support had landed and also before anything but lavapipe supported it. As everyone is well aware, lavapipe is the best and most handsome Vulkan driver, and just by using it you eliminate all bugs that your application may have. RADV is not, and so there are bugs. A number of them were simple: invalid location assignment for patch variables harmless VVL spam about missing dynamic state setters Vulkan spec had broken corner cases reporting wrong value for max patch components also patch variables were differently broken in another way The list goes on, and longtime followers of the blog are nodding to themselves as they skim the issues, confirming that they would have applied all the same one-liner fixes. Then it started to get crazy. Locations, How Do They Work? I’m a genius, so obviously I know how this all works. That’s why I’m writing this blog. Right? Right. Good. So Samuel comes to me, and he hits me with this absolute brainbuster of an issue. An issue so tough that I have to perform an internet search to find a credible authority on the topic. I found this amazing and informative site that exactly described the issue Samuel had posted. I followed the staggering intellect of the formidable author and blah blah blah yeah obviously the only person I’d find writing about an issue I have to solve is past-me who was too fucking lazy to actually solve it. I started looking into this more deeply after taking a moment to fix a different issue related to location assignment that Samuel was too lazy to file a ticket for and thus has deprived the blog of potential tests that readers could run to examine and debug the issue for themselves. But the real work was happening elsewhere. Deeper Now we’re getting to the good stuff. I hope everyone has their regulation-thickness safety helmet strapped on and splatter guards raised to full height because you’ll need them both. As I said in Adventures In Linking, nir_assign_io_var_locations is the root of all evil. In the case where shaders have mismatched builtins, the assigned locations are broken. I decided to take the hammer to this. I mean I took the forbidden action, did the very thing that I railed about live at XDC. Sidebar: at this exact moment, Samuel told me his issue was already fixed. I added a new pipe cap. I know. It was a last resort, but I wanted the issue fixed. The result was this MR, which gave nir_assign_io_var_locations the ability to ignore builtins with regard to assigning locations. This would resolve the issue once and for all, as drivers which treat builtins differently could pass the appropriate param to the NIR pass and then get good results. Problem solved. Deeper. I got some review comments which were interesting, but ultimately the problem remained: lavapipe (and maybe some other vulkan drivers) use this pass to assign locations, and no amount of pipe caps will change that. It was a tough problem to solve, but someone had to do it. That’s why I dug in and began examining this MR from the only man who is both a Mesa expert and a Speed Force user, Marek Olšák, to enable his new NIR optimized linker for RadeonSI. This was a big, meaty triangles-go-brrr thing to sink my teeth into. I had to get into a different headspace to figure out what I was even doing anymore. The gist of opt_varyings is that you give all the shaders in a pipeline to Marek, and Marek says “trust me, buddy, this is gonna be way faster” and gives you back new shaders that do the same thing except only the vertex shader actually has any code. Read the design document if you want more info. Now I’m deep into it though, and I’m reading the commits, and I see there’s this new lower_mediump_io callback which lowers mediump I/O to 16bit. Which is allowed by GLSL. And I use GLSL, so naturally I could do this too. And I did, and I ran it in zink, and I put it through CTS and OH FUCK OH SHIT OH FUCK WHAT THE FUCK EVEN– mediump? More Like… Like… Medium… Stupid. Here’s the thing. In GLSL, you can have mediump I/O which drivers can translate to mean 16bit I/O, and this works great. In Vulkan, we have this knockoff brand, dumpster tier VK_KHR_16bit_storage extension which seems like it should be the same, except for one teeny tiny little detail: • VUID-StandaloneSpirv-Component-04920 The Component decoration value must not be greater than 3 Brilliant. So I can have up to four 16bit components at a given location. Two whole dwords. Very useful. Great. Just what I wanted. Thanks. Also, XFB is a thing, and, well, pardon my saying so, but mediump xfb? Fuck right off. Next Up: IO Lowering—FRONTEND EDITION With mediump safely ejected from the codebase and my life, I was free to pursue other things. I didn’t, but I was free to. And even with Samuel screaming somewhere distant that his issue was already long since fixed, I couldn’t stop. There were other people struggling to implement opt_varyings in their own drivers, and as we all know, half of driver performance is the speed with which they implement new features. That meant that, as expected, RadeonSI had a significant lead on me since I’m always just copying Marek’s homework anyway, but the hell if I was about to let some other driver copy homework faster than me. Fans of the blog will recall way, way, way, way back in Q3 ‘23 when I blogged about very dumb things. Specifically about how I was going to start using “lowered I/O” in zink. Well, I did that. And then I let the smoking rubble cool for a few months. And now it’s Q2 ‘24, and I’m older and unfathomably wiser, and I am about to put this rake into the wheel of my bicycle once more. In this case, the rake is nir_io_glsl_lower_derefs, which moves all the I/O lowering into the frontend rather than doing it manually. The result is the same: zink gets lowered I/O, and the only difference is that it happens earlier. It’s less code in zink, and… Of course there is no driver but RadeonSI which sets nir_io_glsl_lower_derefs. And, of course, RadeonSI doesn’t use any of the common Gallium NIR passes. But surely they’d still work. Surely at least some of them would work. Surely there wouldn’t be that many of them. Surely fucking all of themthe ones that didn’t work would be easy to fix. Surely they wouldn’t uncover any other, more complex, more time-consuming issues that would drag in the entire Mesa compiler ecosystem. Wouldn’t be worth mentioning at SGC if any of those were true, would it. SGC vs Old NIR Passes By now I was pretty deep into this project, which is to say that I had inexplicably vanished from several other tasks I was supposed to be accomplishing, and the only way out was through. But before I could delve into any of the legacy GL compatibility stuff, I had bigger problems. Namely everything was exploding because I failed to follow the directions and was holding opt_varyings wrong. In the fine print, the documentation for the pass very explicitly says that lower_to_scalar must be set in the compiler options. But did I read the directions? Obviously I did. If you’re asking whether I read them comprehensively, however, or whether I remembered what I had read once I was deep within the coding fugue of fixing this damn bug Samuel had given me way back wh With lower_to_scalar active, I actually came upon the big problem: my existing handling for lowered I/O was inadequate, and I needed to make my code better. Much better. Originally when I switched to lowered I/O, I wrote some passes to unclown I/O back to variables and derefs. There was one NIR pass that ran early on to generate variables based on the loads and stores, and there was a second that ran just before spirv translation to convert all the load/store intrinsics back to load/store derefs. This worked great. But it didn’t work great now! Obviously it wouldn’t, right? I mean, nothing in this entire compiler stack ever works, does it? It’s all just a giant jenga tower that’s one fat-finger away from total and utter—What? Oh, right, heh, yeah, no, I just got a little carried away remembering is all. No problem. Let’s keep going. We have to now that we’ve already come this far. Don’t we? I’ll stop writing if you stop reading, how about that. No? Well, heh, of course it’d be that way! This is… We’re SGC! So I had this rework_io_vars function, and it. was. BIG. I’m talking over a hundred lines with loops and switches and all kinds of cool control flow to handle all the weird corner cases I found at 4:14am when I was working on it. The way that it worked was pretty simple: scan through the shader looking for loads/stores using the load/store instruction’s component type/count, infer a variable pray that nothing with complex indirect access comes along It worked great. Really, there were no known bugs. The problem with this came with the scalarized frontend I/O lowering, which would create patterns like: store(location=1, component_count=1) store(location=0, component_count=1, array_size=4, array_offset=$val) In this scenario, there’s indirect access mixed with direct access for the same location, but it’s at an offset from the base of the array, and it kiiinda almost works except it totally doesn’t because the first instruction has no metadata hint about being part of the second instruction’s array. And since the pass iterates over the shader in instruction order, encountering the instructions in this order is a problem whereas encountering them in a different order potentially wouldn’t be a problem. I had two options available to me at that point. The first option was to add in some workarounds to enlarge the scalar to an array when encountering this pattern. And I tried that, and it worked. But then I came across a slightly different variant which didn't work. And that's when I chose the second option. Burn it all down. The whole thing. I mean, uh, just—just that one function. It’s not like I want to BURN THE WHOLE THING DOWN after staring into the abyss for so long, definitely not. The new pass! Right, the new pass. The new rework_io_vars pass that I wrote is a sequence of operations that ends up being far more robust than the original. It works something like this: First, rely only on the shader_info masks, e.g., outputs_written and inputs_read rework_io_vars is the base function with special-casing for VS inputs and FS outputs to create variables for those builtins separately With those done, check for the more common I/O builtins and create variables for those Now that all the builtins are done, scan for indirect access and create variables for that Finally, scan and create variables for ordinary, direct access The “scan” process ends up being a function called loop_io_var_mask which iterates a shader_info mask for a given input/output mode and scans the shader for instructions which occur on each location for that mode. The gathered info includes a component mask as well as array size and fbfetch info–all that stuff. Everything needed to create variables. After the shader is scanned, variables are created for the given location. By processing the indirect mask first, it becomes possible to always detect the above case and handle it correctly. Problem solved. Problems Only Multiply But that’s fine, and I am so sane right now you wouldn’t believe it if I told you. I wrote this great, readable, bulletproof variable generator, and it’s tremendous, but then I tried using it without nir_io_glsl_lower_derefs because I value bisectability, and obviously there was zero chance that would ever work so why would I ever even bother. XFB is totally broken, and there’s all kinds of other weird failures that I started examining and then had to go stand outside staring into the woods for a while, and it’s just not happening. And nir_io_glsl_lower_derefs doesn’t work without the new version either, which means it’s gonna be impossible to bisect anything between the two changes. Totally fine, I’m sure, just like me. By now, I had a full stack of zink compiler cleanups and fixes that I’d accumulated in the course of all this. Multiple stacks, really. So many stacks. Fortunately I was able to slip them into the repo without anyone noticing. And also without CI slowing to a crawl due to the freedreno farm yet again being in an absolute state. I was passing CTS again, which felt great. But then I ran piglit, and I remembered that I had skipped over all those Gallium compatibility passes. And I definitely had to go in and fix them. There were a lot of these passes to fix, and nearly all of them had the same two issues: they only worked with derefs they didn’t work with scalarized I/O This meant I had to add handling for lowered I/O without variables, and then I also had to add generic handling for scalarized versions of both codepaths. Great, great, great. So I did that. And one of them really needed a lot of work, but most of the others were reasonably straightforward. And then there’s lower_clip. lower_clip is a pass that rewrites shaders to handle user-specified clip planes when the underlying driver doesn’t support them. The pass does this by leveraging clipdistance. And here’s the thing about clipdistance: unlike the other builtins, it’s an array. But it’s treated like a vector. Except you can still access it indirectly like an array. So is it an array or is it a vector? Decades from now, graphics engineers will still be arguing about this stupidity, but now is the time when I need to solve this, and it’s not something that I as a mere, singular human, can possibly solve. Hah! There’s no way I’d be able to do that. I’d have to be crazy. And I’m… Uh-oh, what’s the right way to finish that statement? It’s probably fine! Everything’s fine! But when you’ve got an array that’s treated like a vector that’s really an array, things get confusing fast, and in NIR there’s the compact flag to indicate that you need to reassess your life choices. One of those choices needing reassessment is the use of nir_shader_gather_info, a simple function that populates shader_info with useful metadata after scanning the shader. And here’s a pop quiz that I’m sure everyone can pass with ease after reading this far. How many shader locations are consumed by gl_ClipDistance? Simple question, right? It’s a variably-sized float[] array-vector with up to 8 members, so it consumes up to two locations. Right? No, that’s a question, not a rhetorical—But you’re using nir_shader_gather_info, and it sees gl_ClipDistance, okay, so how many slots do you expect it to add to your outputs_written bitmask? Is it 8? Or is it 2? Does anybody really know? Regardless of what you thought, the answer is 8, and you’ll get 8, and you’ll be happy with 8. And if you’re trying to use outputs_written for anything, and you see any of the other builtins within 8 slots of gl_ClipDistance being used, then you should be able to just figure it out that this is clipdistance playing pranks again. Right? It’s all fun and games until someone gets too deep into clipdistance is a proverb oft-repeated among compiler developers. Personally, I went back and forth until I cobbled together something to sort of almost fix the problem, but I posed the issue to the community at large, and now we are having plans with headings and subheadings. You’re welcome. And that’s the end of it, right? Nope The problem with going in and fixing anything in core Mesa is that you end up breaking everything else. So while I was off fixing Gallium compatibility passes, specifically lower_clip, I ended up breaking freedreno and v3d. Someday maybe we’ll get to the bottom of that. But I’m fast-forwarding, because while I was working on this… What even is this anymore? Right, I was fixing Samuel’s bug. The one about not using opt_varyings. So I had my variable generator functioning, and I had the compat passes working (for me), and CTS and piglit were both passing. Then I decided to try out nir_io_glsl_opt_varyings. Just a little. Just to see what happened. I don’t have any more jokes here. It didn’t work good. A lot of things went boom-boom. There were some opt_varyings bugs like these, and some related bugs like this, and there was missing core NIR stuff for zink, and there were GLSL bugs, and also CTS was broken. Also a bunch of the earlier zink stacks of compiler patches were fixing bugs here. But eventually, over weeks, it started working. The Deepest Depths Other than verifying everything still works, I haven’t tested much. If you’re feeling brave, try out the MR with dependencies (or wait for rebase) and tell me how the perf looks. So far, all I’ve seen is about a 6000% improvement across the board. Finally, it’s over. Samuel, your bug is fixed. Never ask me for anything again.
  • Maira Canal: Linux 6.8: AMD HDR and Raspberry Pi 5 (2024/04/02 11:00)
    The Linux kernel 6.8 came out on March 10th, 2024, bringing brand-new features and plenty of performance improvements on different subsystems. As part of Igalia, I’m happy to be an active part of many features that are released in this version, and today I’m going to review some of them. Linux 6.8 is packed with a lot of great features, performance optimizations, and new hardware support. In this release, we can check the Intel Xe DRM driver experimentally, further support for AMD Zen 5 and other upcoming AMD hardware, initial support for the Qualcomm Snapdragon 8 Gen 3 SoC, the Imagination PowerVR DRM kernel driver, support for the Nintendo NSO controllers, and much more. Igalia is widely known for its contributions to Web Platforms, Chromium, and Mesa. But, we also make significant contributions to the Linux kernel. This release shows some of the great work that Igalia is putting into the kernel and strengthens our desire to keep working with this great community. Let’s take a deep dive into Igalia’s major contributions to the 6.8 release: AMD HDR & Color Management You may have seen the release of a new Steam Deck last year, the Steam Deck OLED. What you may not know is that Igalia helped bring this product to life by putting some effort into the AMD driver-specific color management properties implementation. Melissa Wen, together with Joshua Ashton (Valve), and Harry Wentland (AMD), implemented several driver-specific properties to allow Gamescope to manage color features provided by the AMD hardware to fit HDR content and improve gamers’ experience. She has explained all features implemented in the AMD display kernel driver in two blog posts and a 2023 XDC talk: AMD Driver-specific Properties for Color Management on Linux (Part 1) AMD Driver-specific Properties for Color Management on Linux (Part 2) The Rainbow Treasure Map Talk: Advanced color management on Linux with AMD/Steam Deck Async Flip André Almeida worked together with Simon Ser (SourceHut) to provide support for asynchronous page-flips in the atomic API. This feature targets users who want to present a new frame immediately, even if after missing a V-blank. This feature is particularly useful for applications with high frame rates, such as gaming. Raspberry Pi 5 Raspberry Pi 5 was officially released on October 2023 and Igalia was ready to bring top-notch graphics support for it. Although we still can’t use the RPi 5 with the mainline kernel, it is superb to see some pieces coming upstream. Iago Toral worked on implementing all the kernel support needed for the V3D 7.1.x driver. With the kernel patches, by the time the RPi 5 was released, it already included a fully 3.1 OpenGL ES and Vulkan 1.2 compliant driver implemented by Igalia. GPU stats and CPU jobs for the Raspberry Pi 4/5 Apart from the release of the Raspberry Pi 5, Igalia is still working on improving the whole Raspberry Pi environment. I worked, together with José Maria “Chema” Casanova, implementing the support for GPU stats on the V3D driver. This means that RPi 4/5 users now can access the usage percentage of the GPU and they can access the statistics by process or globally. I also worked, together with Melissa, implementing CPU jobs for the V3D driver. As the Broadcom GPU isn’t capable of performing some operations, the Vulkan driver uses the CPU to compensate for it. In order to avoid stalls in the job submission, now CPU jobs are part of the kernel and can be easily synchronized though with synchronization objects. If you are curious about the CPU job implementation, you can check this blog post. Other Contributions & Fixes Sometimes we don’t contribute to a major feature in the release, however we can help improving documentation and sending fixes. André also contributed to this release by documenting the different AMD GPU reset methods, making it easier to understand by future users. During Igalia’s efforts to improve the general users’ experience on the Steam Deck, Guilherme G. Piccoli noticed a message in the kernel log and readily provided a fix for this PCI issue. Outside of the Steam Deck world, we can check some of Igalia’s work on the Qualcomm Adreno GPUs. Although most of our Adreno-related work is located at the user-space, Danylo Piliaiev sent a couple of kernel fixes to the msm driver, fixing some hangs and some CTS tests. We also had contributions from our 2023 Igalia CE student, Nia Espera. Nia’s project was related to mobile Linux and she managed to write a couple of patches to the kernel in order to add support for the OnePlus 9 and OnePlus 9 Pro devices. If you are a student interested in open-source and would like to have a first exposure to the professional world, check if we have openings for the Igalia Coding Experience. I was a CE student myself and being mentored by a Igalian was a incredible experience. Check the complete list of Igalia’s contributions for the 6.8 release Authored (57): André Almeida (2) drm: Refuse to async flip with atomic prop changes drm/amd: Document device reset methods Danylo Piliaiev (2) drm/msm/a6xx: Add missing BIT(7) to REG_A6XX_UCHE_CLIENT_PF drm/msm/a690: Fix reg values for a690 Guilherme G. Piccoli (1) PCI: Only override AMD USB controller if required Iago Toral Quiroga (4) drm/v3d: update UAPI to match user-space for V3D 7.x drm/v3d: fix up register addresses for V3D 7.x dt-bindings: gpu: v3d: Add BCM2712’s compatible drm/v3d: add brcm,2712-v3d as a compatible V3D device Maíra Canal (17) drm/v3d: wait for all jobs to finish before unregistering drm/v3d: Implement show_fdinfo() callback for GPU usage stats drm/v3d: Expose the total GPU usage stats on sysfs MAINTAINERS: Add Maira to V3D maintainers drm/v3d: Don’t allow two multisync extensions in the same job drm/v3d: Decouple job allocation from job initiation drm/v3d: Use v3d_get_extensions() to parse CPU job data drm/v3d: Create tracepoints to track the CPU job drm/v3d: Enable BO mapping drm/v3d: Create a CPU job extension for a indirect CSD job drm/v3d: Create a CPU job extension for the timestamp query job drm/v3d: Create a CPU job extension for the reset timestamp job drm/v3d: Create a CPU job extension to copy timestamp query to a buffer drm/v3d: Create a CPU job extension for the reset performance query job drm/v3d: Create a CPU job extension for the copy performance query job drm/v3d: Fix support for register debugging on the RPi 4 drm/v3d: Free the job and assign it to NULL if initialization fails Melissa Wen (27) drm/v3d: Remove unused function header drm/v3d: Move wait BO ioctl to the v3d_bo file drm/v3d: Detach job submissions IOCTLs to a new specific file drm/v3d: Simplify job refcount handling drm/v3d: Add a CPU job submission drm/v3d: Detach the CSD job BO setup drm/drm_mode_object: increase max objects to accommodate new color props drm/drm_property: make replace_property_blob_from_id a DRM helper drm/drm_plane: track color mgmt changes per plane drm/amd/display: add driver-specific property for plane degamma LUT drm/amd/display: explicitly define EOTF and inverse EOTF drm/amd/display: document AMDGPU pre-defined transfer functions drm/amd/display: add plane 3D LUT driver-specific properties drm/amd/display: add plane shaper LUT and TF driver-specific properties drm/amd/display: add CRTC gamma TF driver-specific property drm/amd/display: add comments to describe DM crtc color mgmt behavior drm/amd/display: encapsulate atomic regamma operation drm/amd/display: decouple steps for mapping CRTC degamma to DC plane drm/amd/display: reject atomic commit if setting both plane and CRTC degamma drm/amd/display: add plane shaper LUT support drm/amd/display: add plane shaper TF support drm/amd/display: add plane 3D LUT support drm/amd/display: fix documentation for dm_crtc_additional_color_mgmt() drm/amd/display: fix bandwidth validation failure on DCN 2.1 drm/amd/display: cleanup inconsistent indenting in amdgpu_dm_color drm/amd/display: fix null-pointer dereference on edid reading drm/amd/display: check dc_link before dereferencing Nia Espera (4) dt-bindings: iio: adc: qcom: Add Qualcomm smb139x arm64: dts: qcom: sm8350: Fix DMA0 address arm64: dts: qcom: pm8350k: Remove hanging whitespace arm64: dts: qcom: sm8350: Fix remoteproc interrupt type Signed-off-by (88): André Almeida (4) drm: Refuse to async flip with atomic prop changes drm: allow DRM_MODE_PAGE_FLIP_ASYNC for atomic commits drm: introduce DRM_CAP_ATOMIC_ASYNC_PAGE_FLIP drm/amd: Document device reset methods Danylo Piliaiev (2) drm/msm/a6xx: Add missing BIT(7) to REG_A6XX_UCHE_CLIENT_PF drm/msm/a690: Fix reg values for a690 Guilherme G. Piccoli (1) PCI: Only override AMD USB controller if required Iago Toral Quiroga (4) drm/v3d: update UAPI to match user-space for V3D 7.x drm/v3d: fix up register addresses for V3D 7.x dt-bindings: gpu: v3d: Add BCM2712’s compatible drm/v3d: add brcm,2712-v3d as a compatible V3D device Jose Maria Casanova Crespo (2) drm/v3d: Implement show_fdinfo() callback for GPU usage stats drm/v3d: Expose the total GPU usage stats on sysfs Maíra Canal (28) drm/v3d: wait for all jobs to finish before unregistering drm/v3d: update UAPI to match user-space for V3D 7.x drm/v3d: fix up register addresses for V3D 7.x dt-bindings: gpu: v3d: Add BCM2712’s compatible drm/v3d: add brcm,2712-v3d as a compatible V3D device drm/v3d: Implement show_fdinfo() callback for GPU usage stats drm/v3d: Expose the total GPU usage stats on sysfs MAINTAINERS: Drop Emma Anholt from all M lines. MAINTAINERS: Add Maira to V3D maintainers drm/v3d: Remove unused function header drm/v3d: Move wait BO ioctl to the v3d_bo file drm/v3d: Detach job submissions IOCTLs to a new specific file drm/v3d: Simplify job refcount handling drm/v3d: Don’t allow two multisync extensions in the same job drm/v3d: Decouple job allocation from job initiation drm/v3d: Add a CPU job submission drm/v3d: Use v3d_get_extensions() to parse CPU job data drm/v3d: Create tracepoints to track the CPU job drm/v3d: Detach the CSD job BO setup drm/v3d: Enable BO mapping drm/v3d: Create a CPU job extension for a indirect CSD job drm/v3d: Create a CPU job extension for the timestamp query job drm/v3d: Create a CPU job extension for the reset timestamp job drm/v3d: Create a CPU job extension to copy timestamp query to a buffer drm/v3d: Create a CPU job extension for the reset performance query job drm/v3d: Create a CPU job extension for the copy performance query job drm/v3d: Fix support for register debugging on the RPi 4 drm/v3d: Free the job and assign it to NULL if initialization fails Melissa Wen (43) drm/v3d: Remove unused function header drm/v3d: Move wait BO ioctl to the v3d_bo file drm/v3d: Detach job submissions IOCTLs to a new specific file drm/v3d: Simplify job refcount handling drm/v3d: Add a CPU job submission drm/v3d: Detach the CSD job BO setup drm/v3d: Create a CPU job extension for a indirect CSD job drm/drm_mode_object: increase max objects to accommodate new color props drm/drm_property: make replace_property_blob_from_id a DRM helper drm/drm_plane: track color mgmt changes per plane drm/amd/display: add driver-specific property for plane degamma LUT drm/amd/display: add plane degamma TF driver-specific property drm/amd/display: explicitly define EOTF and inverse EOTF drm/amd/display: document AMDGPU pre-defined transfer functions drm/amd/display: add plane HDR multiplier driver-specific property drm/amd/display: add plane 3D LUT driver-specific properties drm/amd/display: add plane shaper LUT and TF driver-specific properties drm/amd/display: add plane blend LUT and TF driver-specific properties drm/amd/display: add CRTC gamma TF driver-specific property drm/amd/display: add comments to describe DM crtc color mgmt behavior drm/amd/display: encapsulate atomic regamma operation drm/amd/display: add CRTC gamma TF support drm/amd/display: set sdr_ref_white_level to 80 for out_transfer_func drm/amd/display: mark plane as needing reset if color props change drm/amd/display: decouple steps for mapping CRTC degamma to DC plane drm/amd/display: add plane degamma TF and LUT support drm/amd/display: reject atomic commit if setting both plane and CRTC degamma drm/amd/display: add dc_fixpt_from_s3132 helper drm/amd/display: add HDR multiplier support drm/amd/display: add plane shaper LUT support drm/amd/display: add plane shaper TF support drm/amd/display: add plane 3D LUT support drm/amd/display: handle empty LUTs in __set_input_tf drm/amd/display: add plane blend LUT and TF support drm/amd/display: allow newer DC hardware to use degamma ROM for PQ/HLG drm/amd/display: copy 3D LUT settings from crtc state to stream_update drm/amd/display: add plane CTM driver-specific property drm/amd/display: add plane CTM support drm/amd/display: fix documentation for dm_crtc_additional_color_mgmt() drm/amd/display: fix bandwidth validation failure on DCN 2.1 drm/amd/display: cleanup inconsistent indenting in amdgpu_dm_color drm/amd/display: fix null-pointer dereference on edid reading drm/amd/display: check dc_link before dereferencing Nia Espera (4) dt-bindings: iio: adc: qcom: Add Qualcomm smb139x arm64: dts: qcom: sm8350: Fix DMA0 address arm64: dts: qcom: pm8350k: Remove hanging whitespace arm64: dts: qcom: sm8350: Fix remoteproc interrupt type Acked-by (4): Jose Maria Casanova Crespo (2) drm/v3d: Implement show_fdinfo() callback for GPU usage stats drm/v3d: Expose the total GPU usage stats on sysfs Maíra Canal (1) MAINTAINERS: Drop Emma Anholt from all M lines. Melissa Wen (1) MAINTAINERS: Add Maira to V3D maintainers Reviewed-by (30): André Almeida (1) drm: introduce DRM_CAP_ATOMIC_ASYNC_PAGE_FLIP Christian Gmeiner (1) drm/etnaviv: Convert to platform remove callback returning void Iago Toral Quiroga (20) drm/v3d: wait for all jobs to finish before unregistering drm/v3d: Remove unused function header drm/v3d: Move wait BO ioctl to the v3d_bo file drm/v3d: Detach job submissions IOCTLs to a new specific file drm/v3d: Simplify job refcount handling drm/v3d: Don’t allow two multisync extensions in the same job drm/v3d: Decouple job allocation from job initiation drm/v3d: Add a CPU job submission drm/v3d: Use v3d_get_extensions() to parse CPU job data drm/v3d: Create tracepoints to track the CPU job drm/v3d: Detach the CSD job BO setup drm/v3d: Enable BO mapping drm/v3d: Create a CPU job extension for a indirect CSD job drm/v3d: Create a CPU job extension for the timestamp query job drm/v3d: Create a CPU job extension for the reset timestamp job drm/v3d: Create a CPU job extension to copy timestamp query to a buffer drm/v3d: Create a CPU job extension for the reset performance query job drm/v3d: Create a CPU job extension for the copy performance query job drm/v3d: Fix support for register debugging on the RPi 4 drm/v3d: Free the job and assign it to NULL if initialization fails Maíra Canal (4) drm/v3d: update UAPI to match user-space for V3D 7.x drm/v3d: fix up register addresses for V3D 7.x dt-bindings: gpu: v3d: Add BCM2712’s compatible drm/v3d: add brcm,2712-v3d as a compatible V3D device Melissa Wen (4) drm/v3d: Implement show_fdinfo() callback for GPU usage stats drm/v3d: Expose the total GPU usage stats on sysfs drm/v3d: Fix missing error code in v3d_submit_cpu_ioctl() drm/amd/display: fix documentation for amdgpu_dm_verify_lut3d_size() Tested-by (1): Guilherme G. Piccoli (1) pstore/ram: Fix crash when setting number of cpus to an odd number
  • Hari Rana: Coming Out as Trans (2024/03/31 00:00)
    Vocabularies Before I delve into my personal experience, allow me to define several key terms: Sex: Biological characteristics of males and females. Gender: Social characteristics of men and women, such as norms, roles, and behaviors. Gender identity: How you personally view your own gender. Gender dysphoria: Sense of unease due to a mismatch between gender identity and sex assigned at birth. Transgender (trans): The gender identity differs from the sex assigned at birth. If someone’s gender identity is woman but their sex assigned at birth is male, then they are generally considered a trans person. Cisgender (cis): The opposite of transgender; when the gender identity fits with the sex assigned at birth. Non-binary: Anything that is not exclusively male or female. Imagine if male and female were binary numbers: male is 0 and female is 1. Anything that is not 0 or 1 is considered non-binary. If I see myself as the number 0.5 or 2, then I’m non-binary. Someone who considers themself to be between a man and woman would be between 0 and 1 (e.g. 0.5). Agender: Under the umbrella of non-binary; it essentially means non-gendered (lack of gender identity) or gender neutral. Whichever definition applies varies from person to person. It’s also worth noting that many agender people don’t consider themselves trans. Label: Portraying which group you belong to, such as “non-binary”, “transfemme” (trans (binary and non-binary) people who are feminine), etc. Backstory Allow me to share a little backstory. I come from a neighborhood where being anti-LGBTQ+ was considered “normal” a decade ago. This outlook was quite common in the schools I attended, but I wouldn’t be surprised if a considerably significant portion of the people around here are still anti-LGBTQ+ today. Many individuals, including former friends and teachers, have expressed their opposition to LGBTQ+ in the past, which influenced my own view against the LGBTQ+ community at the time. Due to my previous experiences and the environment I live(d) in, I tried really hard to avoid thinking about my sexuality and gender identity for almost a decade. Every time I thought about my sexuality and gender identity, I’d do whatever I could to distract myself. I kept forcing myself to be as masculine as possible. However, since we humans have a limit, I eventually reached a limit to the amount of thoughts I could suppress. I always struggled with communicating and almost always felt lonely whenever I was around the majority of people, so I pretended to be “normal” and hid my true feelings. About 5 years ago, I began to spend most of my time online. I met people who are just like me, many of which I’m still friends with 3-4 years later. At the time, despite my strong biases against LGBTQ+ from my surroundings, I naturally felt more comfortable within the community, far more than I did outside. I was able to express myself more freely and have people actually understand me. It was the only time I didn’t feel the need to act masculine. However, despite all this, I was still in the mindset of suppressing my feelings. Truly an egg irl moment Eventually, I was unable hold my thoughts anymore, and everything exploded. All I could think about for a few months was my gender identity: my biases between my childhood environment often clashed with me questioning my own identity, and whether I really saw myself as a man. I just had these recurring thoughts and a lot of anxiety about where I’m getting these thoughts from, and why. Since then, my work performance got exponentially worse by the week. I quickly lost interest in my hobbies, and began to distance myself from communities and friends. I often lashed out on people because my mental health was getting worse. My sleep quality was also getting worse, which only worsened the situation. On top of that, I still had to hide my feelings, which continued to exhaust me. All I could think about for months was my gender identity. After I slowly became comfortable with and accepting of my gender identity, I started having suicidal thoughts on a daily basis, which I was able to endure… until I reached a breaking point once again. I was having suicidal thoughts on a bi-hourly basis. It escalated to hourly, and finally almost 24/7. I obviously couldn’t work anymore, nor could I do my hobbies. I needed to hide my pain because of my social anxiety. I also didn’t have the courage to call the suicide hotline either. What happened was that I talked to many people, some of whom have encouraged and even helped me seek professional help. However, that was all in the past. I feel much better and more comfortable with myself and the people I opened up to, and now I’m confident enough to share it publicly 😊 Coming Out‎ ‎🏳️‍⚧️ I identify as agender. My pronouns are any/all — I’ll accept any pronouns. I don’t think I have a preference, so feel free to call me as whatever you want; whatever you think fits me best :) I’m happy with agender because I feel disconnected from my own masculinity. I don’t think I belong at either end of the spectrum (or even in between), so I’m pretty happy that there is something that best describes me. Why the Need to Come Out Publicly? So… why come out publicly? Why am I making a big deal out of this? Simply put, I am really proud and relieved for discovering myself. For so long, I tried to suppress my thoughts and force myself to be someone I was fundamentally not. While that never worked, I explored myself instead and discovered that I’m trans. However, I also wrote this article to explain how much it affected me for living in a transphobic environment, even before I discovered myself. For me, displaying my gender identity is like displaying a username or profile picture. We choose a username and profile picture when possible to give a glimpse of who we are. I chose “TheEvilSkeleton” as my username because I used to play Minecraft regularly when I was 10 years old. While I don’t play Minecraft anymore, it helped me discover my passion: creating and improving things and working together — that’s why I’m a programmer and contribute to software. I chose Chrome-chan as my profile picture because I think she is cute and I like cute things :3. I highly value my username and profile picture, the same way I now value my gender identity. Am I Doing Better? While I’m doing much better than before, I did go through a depressive episode that I’m still recovering from at the time of writing, and I’m still processing the discovery because of my childhood environment, but I certainly feel much better after discovering myself and coming out. However, coming out won’t magically heal the trauma I’ve experienced throughout my childhood environment. It won’t make everyone around me accept who I am, or even make them feel comfortable around me. It won’t drop the amount of harassment I receive online to zero — if anything, I write this with the expectation that I will be harassed and discriminated against more than ever. There will be new challenges that I will have to face, but I still have to deal with the trauma, and I will have to deal with possible trauma in the future. The best thing I can do is train myself to be mentally resilient. I certainly feel much better coming out, but I’m still worried about the future. I sometimes wish I wasn’t trans, because I’m genuinely terrified about the things people have gone through in the past, and are still going through right now. I know I’m going to have to fight for my life now that I’ve come out publicly, because apparently the right to live as yourself is still controversial in 2024. Seeking Help Of course, I wasn’t alone in my journey. What helped me get through it was talking to my friends and seeking help in other places. I came out to several of my friends in private. They were supportive and listened to me vent; they reassured me that there’s nothing wrong with me, and congratulated me for discovering myself and coming out. Some of my friends encouraged and helped me seek professional help at local clinics for my depression. I have gained more confidence in myself; I am now capable to call clinics by myself, even when I’m nervous. If these suicidal thoughts escalate again, I will finally have the courage to call the suicide hotline. If you’re feeling anxious about something, don’t hesitate to talk to your friends about it. Unless you know that they’ll take it the wrong way and/or are currently dealing with personal issues, they will be more than happy to help. I have messaged so many people in private and felt much better after talking. I’ve never felt so comforted by friends who try their best to be there for me. Some friends have listened without saying anything, while some others have shared their experiences with me. Both were extremely valuable to me, because sometimes I just want (and need) to be heard and understood. If you’re currently trying to suppress your thoughts and really trying to force yourself into the gender you were assigned at birth, like I was, the best advice I can give you is to give yourself time to explore yourself. It’s perfectly fine to acknowledge that you’re not cisgender (that is, if you’re not). You might want to ask your trans friends to help you explore yourself. From experience, it’s not worth forcing yourself to be someone you’re not. Closing Thoughts I feel relieved about coming out, but to be honest, I’m still really worried about the future of my mental health. I really hope that everything will work out and that I’ll be more mentally resilient. I’m really happy that I had the courage to take the first steps, to go to clinics, to talk to people, to open up publicly. It’s been really difficult for me to write and publish the article. I’m really grateful to have wonderful friends, and legitimately, I couldn’t ask for better friends.
  • Christian Schaller: Fedora Workstation 40 – what are we working on (2024/03/28 18:56)
    So Fedora Workstation 40 Beta has just come out so I thought I share a bit about some of the things we are working on for Fedora Workstation currently and also major changes coming in from the community. Flatpak Flatpaks has been a key part of our strategy for desktop applications for a while now and we are working on a multitude of things to make Flatpaks an even stronger technology going forward. Christian Hergert is working on figuring out how applications that require system daemons will work with Flatpaks, using his own Sysprof project as the proof of concept application. The general idea here is to rely on the work that has happened in SystemD around sysext/confext/portablectl trying to figure out who we can get a system service installed from a Flatpak and the necessary bits wired up properly. The other part of this work, figuring out how to give applications permissions that today is handled with udev rules, that is being worked on by Hubert Figuière based on earlier work by Georges Stavracas on behalf of the GNOME Foundation thanks to the sponsorship from the Sovereign Tech Fund. So hopefully we will get both of these two important issues resolved soon. Kalev Lember is working on polishing up the Flatpak support in Foreman (and Satellite) to ensure there are good tools for managing Flatpaks when you have a fleet of systems you manage, building on the work of Stephan Bergman. Finally Jan Horak and Jan Grulich is working hard on polishing up the experience of using Firefox from a fully sandboxed Flatpak. This work is mainly about working with the upstream community to get some needed portals over the finish line and polish up some UI issues in Firefox, like this one. Toolbx Toolbx, our project for handling developer containers, is picking up pace with Debarshi Ray currently working on getting full NVIDIA binary driver support for the containers. One of our main goals for Toolbx atm is making it a great tool for AI development and thus getting the NVIDIA & CUDA support squared of is critical. Debarshi has also spent quite a lot of time cleaning up the Toolbx website, providing easier access to and updating the documentation there. We are also moving to use the new Ptyxis (formerly Prompt) terminal application created by Christian Hergert, in Fedora Workstation 40. This both gives us a great GTK4 terminal, but we also believe we will be able to further integrate Toolbx and Ptyxis going forward, creating an even better user experience. Nova So as you probably know, we have been the core maintainers of the Nouveau project for years, keeping this open source upstream NVIDIA GPU driver alive. We plan on keep doing that, but the opportunities offered by the availability of the new GSP firmware for NVIDIA hardware means we should now be able to offer a full featured and performant driver. But co-hosting both the old and the new way of doing things in the same upstream kernel driver has turned out to be counter productive, so we are now looking to split the driver in two. For older pre-GSP NVIDIA hardware we will keep the old Nouveau driver around as is. For GSP based hardware we are launching a new driver called Nova. It is important to note here that Nova is thus not a competitor to Nouveau, but a continuation of it. The idea is that the new driver will be primarily written in Rust, based on work already done in the community, we are also evaluating if some of the existing Nouveau code should be copied into the new driver since we already spent quite a bit of time trying to integrate GSP there. Worst case scenario, if we can’t reuse code, we use the lessons learned from Nouveau with GSP to implement the support in Nova more quickly. Contributing to this effort from our team at Red Hat is Danilo Krummrich, Dave Airlie, Lyude Paul, Abdiel Janulgue and Phillip Stanner. Explicit Sync and VRR Another exciting development that has been a priority for us is explicit sync, which is critical for especially the NVidia driver, but which might also provide performance improvements for other GPU architectures going forward. So a big thank you to Michel Dänzer , Olivier Fourdan, Carlos Garnacho; and Nvidia folks, Simon Ser and the rest of community for working on this. This work has just finshed upstream so we will look at backporting it into Fedora Workstaton 40. Another major Fedora Workstation 40 feature is experimental support for Variable Refresh Rate or VRR in GNOME Shell. The feature was mostly developed by community member Dor Askayo, but Jonas Ådahl, Michel Dänzer, Carlos Garnacho and Sebastian Wick have all contributed with code reviews and fixes. In Fedora Workstation 40 you need to enable it using the command gsettings set org.gnome.mutter experimental-features "['variable-refresh-rate']" PipeWire Already covered PipeWire in my post a week ago, but to quickly summarize here too. Using PipeWire for video handling is now finally getting to the stage where it is actually happening, both Firefox and OBS Studio now comes with PipeWire support and hopefully we can also get Chromium and Chrome to start taking a serious look at merging the patches for this soon. Whats more Wim spent time fixing Firewire FFADO bugs, so hopefully for our pro-audio community users this makes their Firewire equipment fully usable and performant with PipeWire. Wim did point out when I spoke to him though that the FFADO drivers had obviously never had any other consumer than JACK, so when he tried to allow for more functionality the drivers quickly broke down, so Wim has limited the featureset of the PipeWire FFADO module to be an exact match of how these drivers where being used by JACK. If the upstream kernel maintainer is able to fix the issues found by Wim then we could look at providing a more full feature set. In Fedora Workstation 40 the de-duplication support for v4l vs libcamera devices should work as soon as we update Wireplumber to the new 0.5 release. To hear more about PipeWire and the latest developments be sure to check out this interview with Wim Taymans by the good folks over at Destination Linux. Remote Desktop Another major feature landing in Fedora Workstation 40 that Jonas Ådahl and Ray Strode has spent a lot of effort on is finalizing the remote desktop support for GNOME on Wayland. So there has been support for remote connections for already logged in sessions already, but with these updates you can do the login remotely too and thus the session do not need to be started already on the remote machine. This work will also enable 3rd party solutions to do remote logins on Wayland systems, so while I am not at liberty to mention names, be on the lookout for more 3rd party Wayland remoting software becoming available this year. This work is also important to help Anaconda with its Wayland transition as remote graphical install is an important feature there. So what you should see there is Anaconda using GNOME Kiosk mode and the GNOME remote support to handle this going forward and thus enabling Wayland native Anaconda. HDR Another feature we been working on for a long time is HDR, or High Dynamic Range. We wanted to do it properly and also needed to work with a wide range of partners in the industry to make this happen. So over the last year we been contributing to improve various standards around color handling and acceleration to prepare the ground, work on and contribute to key libraries needed to for instance gather the needed information from GPUs and screens. Things are coming together now and Jonas Ådahl and Sebastian Wick are now going to focus on getting Mutter HDR capable, once that work is done we are by no means finished, but it should put us close to at least be able to start running some simple usecases (like some fullscreen applications) while we work out the finer points to get great support for running SDR and HDR applications side by side for instance. PyTorch We want to make Fedora Workstation a great place to do AI development and testing. First step in that effort is packaging up PyTorch and making sure it can have working hardware acceleration out of the box. Tom Rix has been leading that effort on our end and you will see the first fruits of that labor in Fedora Workstation 40 where PyTorch should work with GPU acceleration on AMD hardware (ROCm) out of the box. We hope and expect to be able to provide the same for NVIDIA and Intel graphics eventually too, but this is definitely a step by step effort.
  • Tomeu Vizoso: Rockchip NPU update 2: MobileNetV1 is done (2024/03/28 07:47)
    ProgressFor  the last couple of weeks I have kept chipping at a new userspace driver for the NPU in the Rockchip RK3588 SoC.I am very happy to report that the work has gone really smooth and I reached my first milestone: running the MobileNetV1 model with all convolutions accelerated by the NPU.And it not only runs flawlessly, but at the same performance level as the blob.It has been great having access to the register list as disclosed by Rockchip in their TRM, and to the NVDLA and ONNC documentation and source code. This has allowed for the work to proceed at a pace several times faster than with my previous driver for the VeriSilicon NPU, for which a lot of painstaking reverse engineering had to be Julien Langlois CC BY-SA 3.0 tomeu@arm-64:~/mesa$ TEFLON_DEBUG=verbose python3.10 -i hens.jpg -m mobilenet_v1_1.0_224_quant.tflite -l labels_mobilenet_quant_v1_224.txt -e libteflon.soLoading external delegate from with args: {}Teflon delegate: loaded rknpu driverteflon: compiling graph: 89 tensors 27 operations...teflon: compiled graph, took 413 msteflon: invoked graph, took 11 msteflon: invoked graph, took 11 msteflon: invoked graph, took 11 msteflon: invoked graph, took 10 msteflon: invoked graph, took 10 ms0.984314: hen0.019608: cock0.000000: toilet tissue0.000000: sea cucumber0.000000: wood rabbittime: 10.776msNotice how nothing in the invocation refers to the specific driver that TensorFlow Lite is using, that is completely abstracted by Mesa. Once all these bits are upstream and packaged by distros, one will be able to just download a model in INT8 quantization format and get accelerated inferences going fast irrespective of the hardware.Thanks to TL Lim of PINE64 for sending me a QuartzPro64 board for me to hack on. Next stepsI want to go back and get my last work on performance for the VeriSilicon driver upstreamed, so it is packaged in distros sooner rather than later.After that, I'm a bit torned between working further on the userspace driver and implementing more operations and control flow, or start writing a kernel driver for mainline.
  • Simon Ser: Status update, March 2024 (2024/03/17 22:00)
    Hi! It’s this time of the month once again it seems… We’ve finally released Sway 1.9! Note that it uses the new wlroots rendering API, but doesn’t use the scene-graph API: we’ve left that for 1.10. We’ve also released wlroots 0.17.2 with a whole bunch of bug fixes. Special thanks to Simon Zeni for doing the backporting work! In other Wayland news, the wlroots merge request to atomically apply changes to multiple outputs has been merged! In addition, another merge request to help compositors allocate the right kind of buffers during modesets has been merged. These two combined should help lighting up correctly more multi-output setups on Intel GPUs, which previously required a workaround (WLR_DRM_NO_MODIFIERS=1). Thanks to Kenny for helping with that work! I also got around to writing a Sway patch to gracefully handle GPU resets. This should be good news for users of a particular GPU vendor which tends to be a bit trigger happy with resets! Sway will now survive and continue running instead of being frozen. Note, clients may still glitch, need a nudge to redraw, or freeze. A few wlroots patches were also required to get this to work. With the help of Jean Thomas, Goguma (and pushgarden) has gained support for Apple Push Notification service (APNs). This means that Goguma iOS users can now enjoy instantaneous notifications! This is also important to prove that it’s possible to design a standard (as an IRC extension) which doesn’t hardcode any proprietary platform (and thus doesn’t force each IRC server to have one codepath per platform), but still interoperates with these proprietary platforms (important for usability) and ensures that said proprietary platforms have minimal access to sensible data (via end-to-end encryption between the IRC server and the IRC client). It’s now also possible to share links and files to Goguma. That is, when using another app (e.g. the gallery, your favorite fediverse client, and many others) and opening the share menu, Goguma will show up as an option. It will then ask which conversation to share the content with, and automatically upload any shared file. No NPotM this time around sadly. To make up for it, I’ve implemented refresh tokens in sinwon, and made most of the remaining tests pass in go-mls. See you next month!
  • Tomeu Vizoso: Rockchip NPU update 1: A walk in the park? (2024/03/16 11:46)
    During the past weeks I have paused work on the driver for the Vivante NPU and have started work on a new driver, for Rockchip's own NPU IP, as used in SoCs such as RK3588(S) and RK3568.The version of the NPU in the RK3588 claims a performance of 6 TOPS across its 3 cores, though from what I have read, people are having trouble making use of more than one core in parallel, with the closed source driver.A nice walk in the parkRockchip, as most other vendors of NPU IP, provides a GPLed kernel driver and pushes out their userspace driver in binary form. The kernel driver is pleasantly simple and relatively up-to-date in regards of its use of internal kernel APIs. The userspace stack though is notoriously buggy and difficult to use, with basic features still unimplemented and performance being quite below what the hardware should be able to achieve.To be clear, this is on top of the usual problems related to closed-source drivers. I get the impression that Rockchip's NPU team is really understaffed.Other people had already looked at reverse-engineering the HW so they could address the limitations and bugs in the closed source driver, and use it in situations not supported by Rockchip. I used information acquired by Pierre-Hugues Husson and Jasbir Matharu to get started, a big thanks to them!After the initial environment was setup (had to forward-port their kernel driver to v6.8), I wrote a simple library that can be loaded in the process with LD_PRELOAD and that, by overriding the ioctl and other syscalls, I was able to dump the buffers that the proprietary userspace driver sends to the hardware.I started looking at a buffer that from the debug logs of the proprietary driver contained register writes, and when looking at the register descriptions in the TRM, I saw that it had to be closely based on NVIDIA's NVDLA open-source NPU IP.With Rockchip's (terse) description of the registers, NVDLA's documentation and source code for both the hardware and the userspace driver, I have been able to make progress several times faster than I was able to when working on VeriSilicon's driver (for which I had zero documentation).Right now I am at the stage at which I am able to correctly execute TensorFLow Lite's Conv2D and DepthwiseConv2D operations with different combinations of input dimensions, weight dimensions, strides and padding. Next is to support multiple output channels.I'm currently using Rockchip's kernel, but as soon as I'm able to run object detection models with decent hardware utilization, I plan to start writing a new kernel driver for mainlining.Rockchip's kernel driver has gems such as passing addresses in the kernel address space across the UAPI...Tests run fast and reliably, even with high concurrency:tomeu@arm-64:~/mesa$ TEFLON_TEST_DELEGATE=~/mesa/build/src/gallium/targets/teflon/ TEFLON_TEST_DATA=src/gallium/targets/teflon/tests LD_LIBRARY_PATH=/home/tomeu/tflite-vx-delegate/build/_deps/tensorflow-build/ ~/.cargo/bin/gtest-runner run --gtest /home/tomeu/mesa/build/src/gallium/targets/teflon/test_teflon --output /tmp -j8 --tests-per-group 1 --baseline ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-fails.txt --flakes ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-flakes.txt  --skips ~/mesa/src/gallium/drivers/rocket/ci/rocket-rk3588-skips.txt Running gtest on 8 threads in 1-test groupsPass: 0, Duration: 0Pass: 139, Skip: 14, Duration: 2, Remaining: 2Pass: 277, Skip: 22, Duration: 4, Remaining: 0Pass: 316, Skip: 24, Duration: 4, Remaining: 0You can find the source code in this branch.
  • Christian Schaller: PipeWire camera handling is now happening! (2024/03/15 16:30)
    We hit a major milestones this week with the long worked on adoption of PipeWire Camera support finally starting to land! Not long ago Firefox was released with experimental PipeWire camera support thanks to the great work by Jan Grulich. Then this week OBS Studio shipped with PipeWire camera support thanks to the great work of Georges Stavracas, who cleaned up the patches and pushed to get them merged based on earlier work by himself, Wim Taymans and Colulmbarius. This means we now have two major applications out there that can use PipeWire for camera handling and thus two applications whose video streams that can be interacted with through patchbay applications like Helvum and qpwgraph. These applications are important and central enough that having them use PipeWire are in itself useful, but they will now also provide two examples of how to do it for application developers looking at how to add PipeWire camera support to their own applications; there is no better documentation than working code. The PipeWire support is also paired with camera portal support. The use of the portal also means we are getting closer to being able to fully sandbox media applications in Flatpaks which is an important goal in itself. Which reminds me, to test out the new PipeWire support be sure to grab the official OBS Studio Flatpak from Flathub. PipeWire camera handling with OBS Studio, Firefox and Helvum. Let me explain what is going on in the screenshot above as it is a lot. First of all you see Helvum there on the right showning all the connections made through PipeWire, both the audio and in yellow, the video. So you can see how my Logitech BRIO camera is feeding a camera video stream into both OBS Studio and Firefox. You also see my Magewell HDMI capture card feeding a video stream into OBS Studio and finally gnome-shell providing a screen capture feed that is being fed into OBS Studio. On the left you see on the top Firefox running their WebRTC test app capturing my video then just below that you see the OBS Studio image with the direct camera feed on the top left corner, the screencast of Firefox just below it and finally the ‘no signal’ image is from my HDMI capture card since I had no HDMI device connected to it as I was testing this. For those wondering work is also underway to bring this into Chromium and Google Chrome browsers where Michael Olbrich from Pengutronix has been pushing to get patches written and merged, he did a talk about this work at FOSDEM last year as you can see from these slides with this patch being the last step to get this working there too. The move to PipeWire also prepared us for the new generation of MIPI cameras being rolled out in new laptops and helps push work on supporting those cameras towards libcamera, the new library for dealing with the new generation of complex cameras. This of course ties well into the work that Hans de Goede and Kate Hsuan has been doing recently, along with Bryan O’Donoghue from Linaro, on providing an open source driver for MIPI cameras and of course the incredible work by Laurent Pinchart and Kieran Bingham from Ideas on board on libcamera itself. The PipeWire support is of course fresh and I am sure we will find bugs and corner cases that needs fixing as more people test out the functionality in both Firefox and OBS Studio and there are some interface annoyances we are working to resolve. For instance since PipeWire support both V4L and libcamera as a backend you do atm get double entries in your selection dialogs for most of your cameras. Wireplumber has implemented de-deplucation code which will ensure only the libcamera listing will show for cameras supported by both v4l and libcamera, but is only part of the development version of Wireplumber and thus it will land in Fedora Workstation 40, so until that is out you will have to deal with the duplicate options. Camera selection dialog We are also trying to figure out how to better deal with infraread cameras that are part of many modern webcams. Obviously you usually do not want to use an IR camera for your video calls, so we need to figure out the best way to identify them and ensure they are clearly marked and not used by default. Another recent good PipeWire new tidbit that became available with the PipeWire 1.0.4 release PipeWire maintainer Wim Taymans also fixed up the FireWire FFADO support. The FFADO support had been in there for some time, but after seeing Venn Stone do some thorough tests and find issues we decided it was time to bite the bullet and buy some second hand Firewire hardware for Wim to be able to test and verify himself.Focusrite firewire device. Once the Focusrite device I bought landed at Wims house he got to work and cleaned up the FFADO support and make it both work and be performant. For those unaware FFADO is a way to use Firewire devices without going through ALSA and is popular among pro-audio folks because it gives lower latencies. Firewire is of course a relatively old technology at this point, but the audio equipment is still great and many audio engineers have a lot of these devices, so with this fixed you can plop a Firewire PCI card into your PC and suddenly all those old Firewire devices gets a new lease on life on your Linux system. And you can buy these devices on places like ebay or facebook marketplace for a fraction of their original cost. In some sense this demonstrates the same strength of PipeWire as the libcamera support, in the libcamera case it allows Linux applications a way to smoothly transtion to a new generation of hardware and in this Firewire case it allows Linux applications to keep using older hardware with new applications. So all in all its been a great few weeks for PipeWire and for Linux Audio AND Video, and if you are an application maintainer be sure to look at how you can add PipeWire camera support to your application and of course get that application packaged up as a Flatpak for people using Fedora Workstation and other distributions to consume.
  • Daniel Vetter: Upstream, Why & How (2024/03/14 00:00)
    In a different epoch, before the pandemic, I’ve done a presentation about upstream first at the Siemens Linux Community Event 2018, where I’ve tried to explain the fundamentals of open source using microeconomics. Unfortunately that talk didn’t work out too well with an audience that isn’t well-versed in upstream and open source concepts, largely because it was just too much material crammed into too little time. Last year I got the opportunity to try again for an Intel-internal event series, and this time I’ve split the material into two parts. I think that worked a lot better. For obvious reasons I cannot publish the recordings, but I can publish the slides. The first part “Upstream, Why?” covers a few concepts from microeconomcis 101, and then applies them to upstream stream open source. The key concept is on one hand that open source achieves an efficient software market in the microeconomic sense by driving margins and prices to zero. And the only way to make money in such a market is to either have more-or-less unstable barriers to entry that prevent the efficient market from forming and destroying all monetary value. Or to sell a complementary product. The second part”Upstream, How?” then looks at what this all means for the different stakeholders involved: Individual engineers, who have skills and create a product with zero economic value, and might still be stupid enough and try to build a career on that. Upstream communities, often with a formal structure as a foundation, and what exactly their goals should be to build a thriving upstream open source project that can actually pay some bills, generate some revenue somewhere else and get engineers paid. Because without that you’re not going to have much of a project with a long term future. Engineering organizations, what exactly their incentives and goals should be, and the fundamental conflicts of interest this causes. Specifically on this I’ve only seen bad solutions, and ugly solutions, but not yet a really good one. A relevant pre-pandemic talk of mine on this topic is also “Upstream Graphics: Too Little, Too Late” And finally the overall business and more importantly, what kind of business strategy is needed to really thrive with an open source upstream first approach: You need to clearly understand which software market’s economic value you want to destroy by driving margins and prices to zero, and which complemenetary product you’re selling to still earn money. At least judging by the feedback I’ve received internally taking more time and going a bit more in-depth on the various concept worked much better than the keynote presentation I’ve done at Siemens, hence I decided to publish at the least the slides.
  • Peter Hutterer: Enforcing a touchscreen mapping in GNOME (2024/03/12 04:33)
    Touchscreens are quite prevalent by now but one of the not-so-hidden secrets is that they're actually two devices: the monitor and the actual touch input device. Surprisingly, users want the touch input device to work on the underlying monitor which means your desktop environment needs to somehow figure out which of the monitors belongs to which touch input device. Often these two devices come from two different vendors, so mutter needs to use ... */me holds torch under face* .... HEURISTICS! :scary face: Those heuristics are actually quite simple: same vendor/product ID? same dimensions? is one of the monitors a built-in one? [1] But unfortunately in some cases those heuristics don't produce the correct result. In particular external touchscreens seem to be getting more common again and plugging those into a (non-touch) laptop means you usually get that external screen mapped to the internal display. Luckily mutter does have a configuration to it though it is not exposed in the GNOME Settings (yet). But you, my $age $jedirank, can access this via a commandline interface to at least work around the immediate issue. But first: we need to know the monitor details and you need to know about gsettings relocatable schemas. Finding the right monitor information is relatively trivial: look at $HOME/.config/monitors.xml and get your monitor's vendor, product and serial from there. e.g. in my case this is: <monitors version="2"> <configuration> <logicalmonitor> <x>0</x> <y>0</y> <scale>1</scale> <monitor> <monitorspec> <connector>DP-2</connector> <vendor>DEL</vendor> <--- this one <product>DELL S2722QC</product> <--- this one <serial>59PKLD3</serial> <--- and this one </monitorspec> <mode> <width>3840</width> <height>2160</height> <rate>59.997</rate> </mode> </monitor> </logicalmonitor> <logicalmonitor> <x>928</x> <y>2160</y> <scale>1</scale> <primary>yes</primary> <monitor> <monitorspec> <connector>eDP-1</connector> <vendor>IVO</vendor> <product>0x057d</product> <serial>0x00000000</serial> </monitorspec> <mode> <width>1920</width> <height>1080</height> <rate>60.010</rate> </mode> </monitor> </logicalmonitor> </configuration> </monitors> Well, so we know the monitor details we want. Note there are two monitors listed here, in this case I want to map the touchscreen to the external Dell monitor. Let's move on to gsettings. gsettings is of course the configuration storage wrapper GNOME uses (and the CLI tool with the same name). GSettings follow a specific schema, i.e. a description of a schema name and possible keys and values for each key. You can list all those, set them, look up the available values, etc.: $ gsettings list-recursively ... lots of output ... $ gsettings set org.gnome.desktop.peripherals.touchpad click-method 'areas' $ gsettings range org.gnome.desktop.peripherals.touchpad click-method enum 'default' 'none' 'areas' 'fingers' Now, schemas work fine as-is as long as there is only one instance. Where the same schema is used for different devices (like touchscreens) we use a so-called "relocatable schema" and that requires also specifying a path - and this is where it gets tricky. I'm not aware of any functionality to get the specific path for a relocatable schema so often it's down to reading the source. In the case of touchscreens, the path includes the USB vendor and product ID (in lowercase), e.g. in my case the path is: /org/gnome/desktop/peripherals/touchscreens/04f3:2d4a/ In your case you can get the touchscreen details from lsusb, libinput record, /proc/bus/input/devices, etc. Once you have it, gsettings takes a schema:path argument like this: $ gsettings list-recursively org.gnome.desktop.peripherals.touchscreen:/org/gnome/desktop/peripherals/touchscreens/04f3:2d4a/ org.gnome.desktop.peripherals.touchscreen output ['', '', ''] Looks like the touchscreen is bound to no monitor. Let's bind it with the data from above: $ gsettings set org.gnome.desktop.peripherals.touchscreen:/org/gnome/desktop/peripherals/touchscreens/04f3:2d4a/ output "['DEL', 'DELL S2722QC', '59PKLD3']" Note the quotes so your shell doesn't misinterpret things. And that's it. Now I have my internal touchscreen mapped to my external monitor which makes no sense at all but shows that you can map a touchscreen to any screen if you want to. [1] Probably the one that most commonly takes effect since it's the vast vast majority of devices
  • Mike Blumenkrantz: Post Interfaces (2024/03/12 00:00)
    March. I’ve had a few things I was going to blog about over the past month, but then news sites picked them up and I lost motivation because there’s only so many hours in a day that anyone wants to spend reading things that aren’t specification texts. Yeah, that’s my life now. Anyway, a lot’s happened, and I’d try to enumerate it all but I’ve forgotten / lost track / don’t care. git log me if you’re interested. Some highlights: damage stuff is in RADV supports shader objects so zink can run Tomb Raider (2013) without stuttering NVK is about to hit GL conformance on all versions I’m working on too many projects to keep track of everything More on the last one later. Like in a couple months. When I won’t get vanned for talking about it. No, it’s not Half Life 3 / Portal 3 / L4D3. Interfaces Today’s post was inspired by interfaces: they’re the things that make code go brrrrr. Basically Legos, but for adults who never go outside. If you’ve written code, you’ve done it using an interface. Graphics has interfaces too. OpenGL is an interface. Vulkan is an interface. Mesa has interfaces. It’s got some neat ones like Gallium which let you write a whole GL driver without knowing anything about GL. And then it’s got the DRI interfaces. Which, by their mere existence, answer the question “What could possibly be done to make WSI even worse than it already is?” The DRI interfaces date way back to a time before the blog. A time when now-dinosaurs roamed the earth. A time when Vulkan was but a twinkle in the eye of Mantle, which didn’t even exist. I’m talking Copyright 1998-1999 Precision Insight, Inc., Cedar Park, Texas. at the top of the file old. The point of these interfaces was to let external applications access GL functionality. Specifically the xserver. This was before GLAMOR combined GBM and EGL to enable a better way of doing things that didn’t involve brain damage, and it was a necessary evil to enable cross-vendor hardware acceleration using Mesa. Other historical details abound, but this isn’t a textbook. The DRI interfaces did their job and enabled hardware-accelerated display servers for decades. Now, however, they’ve become cruft. A hassle. A roadblock on the highway to a future where I can run zink on stupid platforms with ease. Problem The first step to admitting there’s a problem is having a problem. I think that’s how the saying goes, anyway. In Mesa, the problem is any time I (or anyone) want to do something related to the DRI frontend, like allow NVK to use zink by default, it has to go through DRI. Which means going through the DRI interfaces. Which means untangling a mess of unnecessary function pointers with versioned prototypes meaning they can’t be changed without adding a new version of the same function and adding new codepaths which call the new version if available. And guess how many people in the project truly understand how all the layers fit together? It’s a mess. And more than a mess, it’s a huge hassle any time a change needs to be made. Not only do the interfaces have to be versioned and changed, someone looking to work on a new or bitrotted platform has to first chase down all the function pointers to see where the hell execution is headed. Even when the function pointers always lead to the same place. I don’t have any memes today. This is my declaration of war. DRI interfaces: you’re officially on notice. I’m coming for you.
  • Tomeu Vizoso: Etnaviv NPU update 17: Faster! (2024/02/23 12:10)
    In the last update I explained how compression of zero weights gave our driver such a big performance improvement.Since then, I have explored further what could take us closer to the performance of the proprietary driver and saw the opportunity to gather some of the proverbial low-hanging fruit.TL;DROur driver's performance on SSD MobileDet went from 32.7 ms to 24.8 ms, against the proprietary driver's 19.5 ms.On MobileNetV1, our driver went from 9.9 ms to 6.6 ms, against the proprietary driver's 5.5 ms. Pretty close!Enable more convolutionsOur driver was rejecting convolutions with a number of output channels that is not divisible by the number of convolution cores in the NPU because at the start of the development the code that lays the weights out in memory didn't support that. That caused TensorFlow Lite to run the convolutions in CPU, and some of them were big enough to take a few milliseconds, several times more than on the NPU.When implementing support for bigger kernels I had to add improvements to the tiling of the convolutions and that included adding support for these other convolutions. So by just removing the rejection of these, we got a nice speed up on SSD MobileDet: from 32.7ms to 27ms!That didn't help on MobileNetV1 because that one has all its convolutions with neat numbers of output channels.Caching of the input tensorSo far we were only caching the kernels on the on-chip SRAM. I spent some time looking at how the proprietary driver sets the various caching fields and found a way of getting us to cache a portion of the input tensor on the remaining internal SRAM.That got us the rest of the performance improvement mentioned above, but I am having trouble with some combination of parameters when the input tensor caching is enabled, so I need to get to the bottom of it before I submit it for review.Next stepsAt this point I am pretty confident that we can get quite close to the performance of the proprietary driver without much additional work, as a few major performance features remain to be implemented, and I know that I still need to give a pass at tuning some of the previous performance work.But after getting the input tensor caching finished and before I move to any other improvements, I think I will invest some time in adding some profiling facilities so I can better direct the efforts and get the best returns.
  • Mike Blumenkrantz: Woof (2024/02/21 00:00)
    It Turns Out …that this year is a lot busier than expected. Blog posts will probably come in small clusters here and there rather than with any sort of regular cadence. But now I’m here. You’re here. Let’s get cozy for a few minutes. NVK O’clock I’m sure you’ve seen some news, you’ve been trawling the gitlab MRs, you’re on the #nouveau channels. You’re one of my readers, so we both know you must be an expert. Zink on NVK is happening. Those of you who remember the zink XDC talk know that this work has been ongoing for a while, but now I can finally reveal the real life twist that only a small number of need-to-know community members have been keeping under wraps for years: I still haven’t been to XDC yet. Let me explain. I’m sure everyone recalls the point in the presentation where “I” talked about progress made towards Zink on NVK. A lot of people laughed it off; oh sure, you said, that’s just the usual sort of joke we expect. But what if I told you it wasn’t a joke? That all of it was 100% accurate, it just hadn’t happened yet? I know what you’re thinking now, and you’re absolutely correct. The me that attended XDC was actually time traveling from the future. A future in which Zink on NVK is very much finished. Since then, I’ve been slowly and quietly “backporting” the patches my future self wrote and slipping them into git. Let’s look at an example. The Great Gaming Bug Of ‘24 20 Feb 2024 was a landmark day in my future-journal for a number of reasons, not the least due to the alarming effects of planetary alignment that you’re all no doubt monitoring. For the purposes of the current blog post that I’m now writing, however, it was monumental for a different reason. This was the day that noted zinkologist and current record-holder for Most Tests Fixed With One Line Of Code, Faith Ekstrand (@gfxstrand), would delve into debugging the most serious known issue in zink+nvk: Yup, it’s another clusterfuck. Now let me say that I had the debug session noted down in my journal, but I didn’t add details. If you haven’t been in #nouveau for a live debug session, it’s worth scheduling time around it. Get some popcorn ready. Put on your safety glasses and set up your regulation-size splatterguard, all the usual, and then… Well, if I had to describe the scene, it’s like watching someone feed a log into a wood chipper. All the potential issues investigated one-by-one and eliminated into the pile of growing sawdust. Anyway, it turns out that NVK (currently) does not expose a BAR memory type with host-visible and device-local properties, and zink has no handling for persistently mapped buffers in this scenario. I carefully cherry-picked the appropriate patch from my futurelog and rammed it through CI late at night when nobody would notice. As a result, all GL games now work on NVK. No hyperbole. They just work. Stay tuned for future updates backported from a time when I’m not struggling to find spare seconds under the watchful gaze of Big Triangle.
  • Simon Ser: Status update, February 2024 (2024/02/19 22:00)
    Hi! February is FOSDEM month, and as usual I’ve come to Brussels to meet with a lot of other FOSS developers and exchange ideas. I like to navigate between the buildings and along the hallways to find nice people to discuss with. This edition I’ve been involved in the new modern e-mail devroom and I’ve given a talk about IMAP with Damian, a fellow IMAP library maintainer and organizer of this devroom. The whole weekend was great! In wlroots news, I’ve worked on multi-connector atomic commits. Right now, wlroots sequentially configures outputs, one at a time. This is slow and makes it impossible to properly handle GPU limitations such as bandwidth: if the GPU cannot drive two outputs with a 4k resolution, we’ll only find out after the first one has been lit up. As a result we can’t properly implement fallbacks and this results in black screens on some setups. In particular, on Intel some users need to set WLR_DRM_NO_MODIFIERS=1 to have their multi-output setup work correctly. The multi-connector atomic commit work is the first step to resolve these situations and also results in faster modesets. The second step will be to add fallback logic to use a less bandwidth-intensive scanout buffer on modeset. While working on the wlroots DRM backend code, I’ve also taken the opportunity to cleanup the internals and skip unnecessary modesets when switching between VTs. Ctrl Alt 1 should be faster now! I’ve also tried to resurrect the ext-screencopy-v1 protocol, required for capturing individual windows. I’ve pushed a new version and reworked the wlroots implementation, hopefully I can find some more time next month to continue on this front. Sway 1.9-rc4 has been recently released, my reading of the tea leaves at my disposal indicates that the final release may be shipped soon. Sway 1.9 will leverage the new wlroots rendering API, however it does not include the huge scene-graph rework that Alexander has pushed forward in the last year or so. Sway 1.10 will be the first release to include this major overhaul and all the niceties it unlocks. And Sway 1.10 will also finally support input method popups (used for CJK among other things) thanks to efforts by Access and Tadeo Kondrak. The NPotM is sinwon, a simple OAuth 2 server for small deployments. I’ve long been trying to find a good solution to delegate authentication to a single service and provide single-sign-on for my personal servers. I’ve come to like OAuth 2 because it’s a standard, it’s not tied to another use-case (like IMAP or SMTP is), and it prevents other services from manipulating user passwords directly. sinwon stores everything in a SQLite database, and it’s pretty boring: no fancy cryptography usage for tokens, no fancy cloud-grade features. I like boring. sinwon has a simple UI to manage users and OAuth clients (sometimes called “apps”). Still missing are refresh tokens, OAuth scopes, an audit log, personal access tokens, and more advanced features such as TOTP, device authorization grants and mTLS. Patches welcome! I’ve continued my work to make it easier to contribute to the SourceHut codebase. Setting up PGP keys is now optional to run a SourceHut instance, and a local S3-compatible server (such as minio) can be used without TLS. Thorben Günther has added to I’m also working on making services use’s GraphQL API instead of maintaining their own copy of the user’s profile, but more needs to be done there. And now for the random collection of smaller updates… The soju IRC bouncer and the goguma IRC client for mobile devices now support file uploads: no need to use an external service anymore to share a screenshot or picture in an IRC conversation. Conrad Hoffmann and Thomas Müller have added support for multiple address books to the go-webdav library, as well as creating/deleting address books and calendars. I’ve modernized the FreeDesktop e-mail server setup with SPF, DKIM and DMARC. KDE developers have contributed a new layer-shell minor version to support docking their panel to a corner of the screen. That’s all for now, see you next month!
  • Donnie Berkholz: The lazy technologist’s guide to fitness (2024/02/16 20:20)
    In the past 8 months, I’ve lost 60 pounds and went from completely sedentary to well on my way towards becoming fit, while putting in a minimum of effort. On the fitness side, I’ve taken my cardiorespiratory fitness from below average to above average, and I’m visibly stronger (I can do multiple pull-ups!). Again, I’ve aimed to do so with minimal effort to maximize my efficiency. Here’s what I wrote in my prior post on weight loss: I have no desire to be a bodybuilder, but I want to be in great shape now and be as healthy and mobile as possible well into my old age. And a year ago, my blood pressure was already at pre-hypertension levels, despite being at a relatively young age. Research shows that 5 factors are key to a long life — extending your life by 12–14 years: Never smoking BMI of 15.5–24.9 30+ min a day of moderate/vigorous exercise Moderate alcohol intake (vs none, occasional, or heavy) Unsurprisingly, there is vigorous scientific and philosophical/religious/moral debate about this one, however all studies agree that heavy drinking is bad. Diet quality in the upper 40% (Alternate Healthy Eating Index) In addition, people who are in good health have a much shorter end-of-life period. This means they extend the healthy portion of their lifespan (the “healthspan”) and compress the worst parts into a shorter period at the very end. Having seen many grandparents go through years of struggle as they grew older, I wanted my own story to have a different ending. Although I’m not a smoker, I was missing three of the other factors. My weight was massively unhealthy, I didn’t exercise at all and spent most of my day in front of a desk, and my diet was awful. I do drink moderately, however (almost entirely beer). This post accompanies my earlier writeup, “The lazy technologist’s guide to weight loss.” Check that out for an in-depth, science-driven review of my experience losing weight.  Why is this the lazy technologist’s guide, again? I wanted to lose weight in the “laziest” way possible — in the same sense that lazy programmers find the most efficient solutions to problems, according to an apocryphal quote by Bill Gates and a real one by Larry Wall, creator of Perl. Gates supposedly said, “I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.” Wall wrote in Programming Perl, “Laziness: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful and document what you wrote so you don’t have to answer so many questions about it.” What’s the lowest-effort, most research-driven way to become fit as quickly as possible, during and after losing weight? Discovering and executing upon that was my journey. Read on if you’re considering taking a similar path. Cardio Fitness My initial goal for fitness was simply to meet the “30+ min/day” factor in the research study I cited at the beginning of this post, while considering a few factors: First, this is intended to be the lazy way, so there should be no long and intense workouts unless unavoidable.  Second, I did not want to buy a bunch of equipment or need to pay for a gym membership. Any required equipment should be inexpensive and small. Third, I wanted to avoid creating any joint issues that would affect me negatively later in life. I was particularly concerned about high-impact, repetitive stress from running on hard surfaces, which I’d heard could be problematic. Joint issues become very common for older people, especially knees and hips. My program needed to avoid any high-impact, repetitive stress on those joints to preserve maximum function. I’ve always heard that running is bad on your knees, but after I looked into it, the research does not bear that out. And yet, it remains a popular misconception among both the general population as well as doctors who do not frequently perform hip replacements. However, I just don’t like running — I enjoy different activities if I’m going to be working hard physically, such as games like racquetball/squash/pickleball or self-defense (Krav Maga!). I’m also not a big fan of getting all sweaty in general, but especially in the middle of a workday. So I wanted an activity with a moderate rather than high level of exertion. Low-impact options include walking, cycling, swimming, and rowing, among others. But swimming requires an indoor pool or year-round good weather, and rowing requires a specialized machine or boat, while I’m aiming to stay minimal. I also do not own a bicycle, nor is the snowy weather in Minnesota great for cycling in the winter (fat-tire bikes being an exception). We’re left with walking as the primary activity.  LISS — Low-Intensity Steady State Initially, I started with only walking. This is called low-intensity steady state (LISS) cardio (cardiovascular, a.k.a. aerobic) exercise. Later, I also incorporated high-intensity interval training (HIIT) as the laziest possible way to further improve my cardiovascular health. To bump walking up into a “moderate” level of activity, I need to walk between 3–4 mph. This is what’s sometimes called a “brisk” walk — 3 mph feels fast, and 4 mph is about as fast as I can go without changing into some weird competitive walking style. I also need to hit 30+ minutes per day of this brisk walking. At first, I started on a “walking pad” treadmill under my standing desk, which I bought for <$200 on Amazon. My goal was to integrate walking directly into my day with no dedicated time, and this seemed like a good path. However, this violates the minimalism requirement. I also learned that the pace is also too fast to do much of anything at the desk besides watch videos or browse social media. So I broke this up into two 1-mile outdoor walks, one after lunch and another after dinner.  Each 1-mile walk takes 15–20 minutes. Fitting this into a workday requires me to block off 45–60 minutes for lunch, between lunch prep, time to eat, and the walk itself. I find this much easier than trying to create a huge block of time in the morning for exercise, because I do not naturally wake up early. In the evening, I’ll frequently extend the after-dinner walk to ~2 miles instead of 1 mile. It turns out that walking after meals is a great strategy for both weight loss and suppressing your blood sugar levels, among other benefits. This can be as short as a 2-minute walk, according to recent studies. In fact, it’s seen as so key in Mediterranean culture that walking is considered a component of the Mediterranean diet. Overall, I’ve increased my active calorie consumption by 250 calories/day by incorporating active walks into my day. That’s a combination of the 2 after-meal brisk walks, plus a more relaxed walk on my under-desk treadmill sometime during the day. The latter is typically a 2 mph walk for 40–60 min, and I do it while I’m in a meeting that I’m not leading, or maybe watching a webinar. Without buying the walking pad, you could do the same on a nice outdoor walk with a headset or earbuds, but Minnesota weather sometimes makes that miserable. Overall, all of this typically gets me somewhere between 10,000–15,000 steps per day.  Not only is this good for fitness, it also helps to offset the effects of metabolic adaptation. If you’re losing weight, your body consumes fewer calories because it decreases your resting metabolic rate to conserve energy. Although some sites will suggest this could be hundreds of calories daily, which is quite discouraging, research shows that’s exaggerated for most people. During active weight loss, it’s typically ~100 calories per day, although it may be up to 175±150 calories for diet-resistant people. That range is a standard deviation, so people who are in the worst ~15% of the diet-resistant subset could have adaptations >325 calories/day. So if you believe you’re diet-resistant, you probably want to aim for a 1000-calorie deficit, to ensure you’re able to lose weight at a good rate. On the bright side, that adaptation gets cut in half once you’ve stabilized for a few weeks at your new weight, and it’s effectively back to zero a year later. To further maintain my muscle following weight loss, I added a weighted vest to my after-lunch walks occasionally (examples: Rogue, 5.11, TRX). I started doing this once a week, and I aim to get to 3x+/week. I use a 40 lb weighted vest to counterbalance the 40+ lb of weight that I’ve lost. When I walk with the vest, I’m careful to maintain the same pace as without the vest, which increases the intensity and my heart rate. This pushes a normal moderate-intensity walk into the low end of high intensity (approaching 80% of my max heart rate). I also anticipate incorporating this weighted vest into my strength training later, once my own body weight is insufficient for continued progression.  Considering a minimalist approach, however, I think you could do just fine without a weighted vest. There are other ways to increase intensity, such as speed or inclines, and the combination of a high-protein diet, HIIT, and strength training provides similar benefits. HIIT — High-Intensity Interval Training Why do HIIT? Regularly getting your heart rate close to its maximum is good for your cardiovascular health, and you can’t do it with LISS, which by definition is low intensity. Another option besides HIIT is much longer moderate-intensity continuous training (your classic aerobic workout), but HIIT can fit the same benefits or more into a fraction of the time. Research is very supportive of HIIT compared to longer aerobic workouts, which enables time compression of the total workout length from the classic 60 minutes down to 30 minutes or less.  However, 30 minutes still isn’t the least you can do and still get most of the benefits. The minimum required HIIT remains unclear — in overall length, weekly frequency, as well as patterns of high-intensity and rest / low-intensity. Here are some examples of research that test the limits of minimalist HIIT and find that it still works well: 1x 4 min, 3x/wk: review 5x 1 min, 2x/wk: study, study, study 4x 1 min, 3x/wk: study 8x 20 sec, 4x/wk (i.e. the Tabata protocol): study Yes, you read that right — the last study used 20-second intervals. They were only separated by 10 seconds of rest, so the primary exercise period was just 4 minutes, excluding warm-up. Furthermore, this meta-analysis suggests that HIIT benefits more from increasing the intensity of the high-intensity intervals, rather than increasing the volume of repetitions. After my investigation, it was clear that “low-volume” or “extremely low volume” HIIT could work well, so there was no need to do the full 30-minute HIIT workouts that are popular with many gym chains.  I settled on 3 minutes of HIIT, 2x/week: 3 repetitions of 30 seconds hard / 30 seconds light, plus a 1-minute warm-up. This overlaps with the HIIT intervals, breaks, and repetitions from the research I’ve dug into, and it also has the convenient benefit of not quite making me sweat during the workout, so I don’t need to change clothes.  I’m seeing the benefits of this already, which I’ll discuss in the Summary. Strength Training I also wanted to incorporate strength training for many reasons. In the short term, it was to minimize muscle loss as I lost weight (addressed in my prior post). In the medium and long term, I want to build muscle now so that I can live a healthier life once I’m older and also feel better about myself today. What I’ve found is that aiming for the range of 10%–15% body fat is ideal for men who want to be very fit. This range makes it easy to tell visually when you’re at the top or bottom of the range, based on the appearance of a well-defined six-pack or its fading away to barely visible. It gets harder to tell where you are visually from 15% upwards, while anything below 10% has some health risks and starts to look pretty unusual too. Within that 10%–15% range, I’m planning to do occasional short-term “lean bulks” / “clean bulks” and “cuts.” That’s the typical approach to building muscle — you eat a slight excess of calories while ensuring plenty of protein, aiming to gain about 2–4 lbs/month for someone my size. After a cycle of doing this, you then “cut” by dieting to lose the excess fat you’ve gained, because it’s impossible to only gain muscle. My personal preference is to make this cycle more agile with shorter iteration cycles, compared to some of the examples I’ve seen. I’m thinking about a 3:1 bulk:cut split over 4 months that results in a total gain/loss of ~10 lbs. Calisthenics (bodyweight exercises): the minimalist’s approach My goal of staying minimal pushed me toward calisthenics (bodyweight exercises), rather than needing to work out at a gym or buy free weights. This means the only required equipment is a doorway pull-up bar ($25), while everything else can be done with a wall, table or chair/bench. Although I may not build enormous muscles, it’s possible to get to the point of lifting your entire body weight with a single arm, which is more than good enough for me. That’s effectively lifting 2x your body weight, since you’re lifting 1x with just one arm. My routine is inspired by Reddit’s r/bodyweightfitness (including the Recommended Routine and the Minimalist Routine) and this blog post by Steven Low, author of the book “Overcoming Gravity.” I’ve also incorporated scientific research wherever possible to guide repetitions and frequency. Overall, the goal is to get both horizontal and vertical pushing and pulling exercises for the arms/shoulders due to their larger range of motion, while getting push and pull for legs, and good core exercises that cover both the upper and lower back as well.  I’ve chosen compound exercises that work many muscles simultaneously — for practicality (more applicable to real-world motions), length of workout, and minimal equipment needs. If you’re working isolated muscles, you generally need lots of specialized machines at a gym. Isometrics (exercises where you don’t move, like a wall-sit) are also less applicable to real use cases as you age, such as the strength and agility to catch yourself from a fall. For that reason, I prefer compound exercises with some rapid, explosive movements that help to build both strength and agility. My initial routine Here’s my current schedule (3 sets of repetitions for each movement, with a 3-minute break between sets): Monday: arm push — push-ups (as HIIT) and tricep dips. “As HIIT” means that I’ll do as many push-ups as I can fit within my HIIT pattern, then flip to my active work (e.g. jumping jacks or burpees). Tuesday: arm pull — pull-ups (with L-sit, as below) and inverted rows (“Australian pull-ups”) Wednesday: core — L-sits, planks (3x — 10 sec on each of front, right, left) Thursday: handstands — working toward handstand push-ups as the “vertical push” Friday: legs — squats (as HIIT), and Nordic curls (hamstrings & lower back) Saturday/Sunday: rest — just walking. Ideally hitting 10k steps/day but no pressure to do so, if I’m starting to feel sore. For ones that I couldn’t do initially (e.g. pull-ups, handstands, L-sits, Nordic curls), I used progressions to work my way there step by step. For pull-ups, that meant doing negatives / eccentrics by jumping up and slowly lowering myself down over multiple seconds, then repeating. For handstands, I face the wall to encourage better posture, so it’s been about longer holds and figuring out how to bail out so I can more confidently get vertical. For L-sits, I follow this progression. For Nordic curls, I’m doing slow negatives as far down as I can make it, then dropping the rest of the way onto my hands and pushing back up. On days with multiple exercises for the same muscles, I’ll typically try to split them up so they fit more easily into a workday. For example, I’ll find 10 minutes mid-morning between meetings/calls to do one movement and 10 minutes mid-afternoon for the other. This is the same time I might’ve spent making a coffee, before I started focusing on fitness. Combined with the walks, this plan gets me moving 4 times a day — two 20-minute walks and two 10-minute workouts, for a total of 1 hour each day. The great thing about this approach is that I never feel like I need to dedicate a ton of time to exercise, because it fits naturally into the structure of my day. I’ve also got an additional 40–60 minutes of slow walking while at my desk, which again fits easily into my day. What I’ve learned along the way As you can see, I’m currently at 1x/wk for non-core exercises, which is a “traditional split.” That means I’m splitting up exercises, focusing on just one set of muscles each day. The problem is that the frequency of training for each muscle group is low, which I’d like to change so that I can build strength more quickly.  I’m switching to “paired sets” (aka “alternating sets”) that alternate among different muscle groups, so I can fit more into the same amount of time. Here’s how that works: if you were taking a 3-minute rest between sets, that gives you time to fit in an unrelated set of muscles that you weren’t using in the first exercise (e.g. biceps & triceps, quads & hamstrings, chest & back). I do this as an alternating tri-set (arm pull, arm push, legs) with a 30–45 second rest between each muscle group, and a 1.5–2 minute break between each full tri-set. You might also see “supersets,” which is a similar concept but with no breaks within the tri-set. I’ve found that I tend to get too tired and sloppy if I try a superset, so I do alternating sets instead. In addition, I’ve done a lot more research on strength training after getting started. For LISS and HIIT, I had a strongly research-driven approach before beginning. For strength training, I went with some more direct recommendations and only did additional academic research later. Here’s what I’ve learned since then: Higher-load (80%+), multi-set workouts 2x/week are optimal for maximizing both strength and hypertrophy, according to a 2023 meta-analysis. One ideal size of a set to maximize benefits seems to be 6-8 repetitions, with a 3-minute break between sets to maximize energy restoration. 6-8 reps seems like a sweet spot between strength and hypertrophy (muscle size). For endurance, 15+ repetitions should be the goal. If you want to build all of those characteristics, you should probably alternate rep counts with different loads. Time efficient workout design: Use compound exercises and include both concentric & eccentric movements. Perform a minimum of one leg-pressing exercise (e.g. squats), one upper-body pulling exercise (e.g. pull-up) and one upper-body pushing exercise (e.g. push-up). Perform a minimum of 4 weekly sets per muscle group using a 6–15 rep max loading range. Eccentric / negatives are superior to concentric. Don’t neglect or rush through the negatives / eccentrics. That’s the part of an exercise you ignore by default — letting your weight come down during a squat, pull-up, or push-up rather than when you’re pushing/pulling it back up. Take your time on that part, because it’s actually more important. Doing something as quick as 3-second negatives, 4x/wk, will improve strength. Overall, that suggests a workout design that looks like this (2 days a week): 2+ sets of each: Compound exercises for arm push, arm pull, leg press Aim for whatever difficulty is required to max out at 6–8 repetitions for strength & hypertrophy (muscle size), or up to 15 if you’re focusing on endurance Do slow eccentrics / negatives on every exercise The new routine To incorporate this research into a redesigned routine that also includes HIIT and core work, here’s what I’ve recently changed to (most links go to “progressions” that will help you get started): Monday: Strength: push-ups, pull-ups, squats as alternating set Tuesday: HIIT (burpees, mountain climbers, star jumps, etc) Wednesday: Core & Flexibility: L-sits, planks, Nordic curls, stretches Thursday: HIIT (similar routine) Friday: Strength: handstand push-ups, inverted rows, squats as alternating set Saturday/Sunday: Rest days Also, 4+ days a week, I do a quick set of a 5-second negative for each type of compound exercise (arm push, arm pull, leg press). That’s just 2 days in addition to my strength days, so I usually fit it into HIIT warm-up or cool-down. On each day, my overall expected time commitment will be about 10 minutes. For strength training, all the alternating sets will overlap with each other. Even with a 3-min break between each set for the same muscle group, that should run quite efficiently for 2–3 sets. For HIIT, it’s already a highly compressed routine that takes ~5 minutes including warm-up and cool-down, but I need another 5 minutes afterwards to decompress after exercise that intense. You may notice that I only have one dedicated day to work my core (Wednesday), but I’m also getting core exercise during push-ups (as I plank), L-sit pull-ups, and handstands (as I balance). The research recommendation to increase load to 80% of your max can seem more challenging with calisthenics, since it’s just about bodyweight. However, it’s always possible by decreasing your leverage, using one limb instead of two, or increasing the proportion of your weight that’s applied by changing your body angles. For example, you can do push-ups at a downwards incline with your feet on a bench/chair. You can also do more advanced types of squats like Bulgarian split squats, shrimp squats, or pistol squats. Summary My cardiorespiratory fitness, as measured by VO2 Max (maximal oxygen consumption) on my Apple Watch, has increased from 32 (the lowest end of “below average,” for my age & gender) to 40.1 (above average). It continues to improve on a nearly daily basis. That’s largely happened within just a couple of months, since I started walking every day and doing HIIT.  My blood pressure (one of my initial concerns) has dropped out of pre-hypertension into the healthy range. My resting heart rate has also decreased from 63 to 56 bpm, which was a long slow process that’s occurred over the entire course of my weight loss. On the strength side, I wasn’t expecting any gains because I’m in a caloric deficit. My main goal was to avoid losing muscle while losing weight. I’ve now been strength training for 2.5 months, and I’ve been pleasantly surprised by the “newbie gains” (which people often see in their first year or two of strength training).  For example, I couldn’t do any pull-ups when I started. I could barely do a couple of negatives, by jumping up and letting myself down slowly. Now I can do 4 pull-ups (neutral grip). Also, I can now hold a wall handstand for 30–45 seconds and do 6–8 very small push-ups, while I could barely get into that position at all when I started.  Overall, clear results emerged almost instantly for cardiorespiratory fitness, and as soon as 6 weeks after beginning a regular strength-training routine. If you try it out, let me know how it works for you!
  • Donnie Berkholz: The lazy technologist’s guide to weight loss (2024/02/16 19:56)
    [Last update: 2024-02-16] In the past 8 months, I’ve lost 60 pounds and went from completely sedentary to becoming much more fit, while putting in a minimum of effort. I have no desire to be a bodybuilder, but I want to be in great shape now and be as healthy and mobile as possible well into my old age. A year ago, my blood pressure was already at pre-hypertension levels, despite being at a relatively young age.  I wasn’t willing to let this last any longer, and I wasn’t willing to accept that future. Research shows that 5 factors are key to a long life — correlated with extending your life by 12–14 years: Never smoking BMI (body mass index) of 18.5–24.9 30+ min a day of moderate/vigorous exercise Moderate alcohol intake (vs none, occasional, or heavy) Diet quality in the upper 40% (Alternate Healthy Eating Index) In addition, people who are in good health have a much shorter end-of-life period. This means they extend the healthy portion of their lifespan (the “healthspan”) and compress the worst parts into a shorter period at the very end. Having seen many grandparents go through years of struggle as they grew older, I wanted my own story to have a different ending. Although I’m not a smoker, I was missing three of the other factors. My weight was massively unhealthy, I didn’t exercise at all and spent most of my day in front of a desk, and my diet was awful. On the bright side for these purposes, I drink moderately (almost entirely beer). In this post, I’ll walk through my own experience going from obese to a healthy weight, with plenty of research-driven references and data along the way. Why is this the lazy technologist’s guide, though? I wanted to lose weight in the “laziest” way possible — in the same sense that lazy programmers find the most efficient solutions to problems, according to an apocryphal quote by Bill Gates and a real one by Larry Wall, creator of Perl. Gates supposedly said, “I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.” Wall wrote in Programming Perl, “Laziness: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful and document what you wrote so you don’t have to answer so many questions about it.” What’s the lowest-effort, most research-driven way to lose weight as quickly as possible without losing health? Discovering and executing upon that was my journey. Read on if you’re considering taking a similar path. My weight-loss journey begins My initial goal was to get down from 240 pounds (obese, BMI of 31.7) into the healthy range, reaching 185 pounds (BMI of 24.4).  My aim was to lose at the high end of a healthy rate, 2 pounds per week. Credible sources like the Mayo Clinic and the CDC suggested aiming for 1–2 pounds a week, because anything beyond that can cause issues with muscle loss as well as malnutrition. But how could I accomplish that? One weird trick — Eat less I’ve lost weight once previously (about 15 years ago), although it was a smaller amount. Back then, I learned that there’s no silver bullet — the trick is to create a calorie deficit, so that your body consumes more energy than the calories in what you eat.  Every pound is about 3500 calories, which helps to set a weekly and daily goal for your calorie deficit. For me to lose 2 pounds a week, that’s 2*3500 = 7000 calories/week, or 1000 calories/day of deficit (eating that much less than my body uses). Exercise barely makes a dent It’s far more effective and efficient to create this deficit primarily through eating less rather than expecting exercise to make a huge difference. If you were previously gaining weight, you might’ve been eating 3000 calories/day or more! You can easily reduce what you eat by 1500 calories/day from that starting point, but it’s almost impossible to exercise enough to burn that many calories. An hour of intense exercise might burn 500 calories, and it’s very hard to keep up that level of effort for even one full hour — especially if you’ve been sitting in a chair all day for years on end. Not to mention, that much exercise would defeat the whole idea of this being the lazy person’s way of making progress. So how exactly can you reduce calories? You’ve got a lot of options, but they basically boil down to two things — eat less (portion control), and eat better (food choice). The plan At this point, I knew I needed to eat 1000 calories/day less than I burned. I used this calculator to identify that, as a sedentary person, I burned about 2450 calories/day. So to create that deficit, I needed to eat about 1450 calories/day. At that point, I was probably eating 2800–3000 calories/day, so that would require massive changes in my diet. I don’t like the idea of fad diets that completely remove one or many types of foods entirely (Atkins, keto, paleo, etc), although they can work for other people. One of those big lessons about dieting is that as long as you’re removing something from what you eat, you’ll probably lose weight.  I decided to make two big changes: how often I ate healthy vs unhealthy food, and when I ate over the course of the day. At the time, I was eating a huge amount of high-fat, high-sugar, and low-health foods like burgers and fries multiple times per week, fried food, lots of chips/crisps, white bread (very high sugar in the US) & white rice, cheese, chocolate and candy.  I decided to shift that toward white meat (chicken/pork/turkey), seafood, salads & veggies, and whole grains (whole-wheat bread, brown rice, quinoa, etc). One pro-tip: American salad dressings are super unhealthy, often even the “vinaigrettes” that sound better. Do like Italians do, and dress salads yourself with olive oil, salt, and vinegar. However, I didn’t want to remove my favorite foods entirely, because that would destroy my long-term motivation and enjoyment of my progress. For example, once a week, I still allow myself to get a cheeseburger. But I’ll typically get a single patty, no mayo/cheese/ketchup, and with a side like salad (w/ healthy dressing) or cole slaw. I’ll also ensure my other meal of the day is very light. Many days, I’ll enjoy a small treat like 1–2 chocolates, as well (50–100 calories). What if you like beer? I wanted to reach my calorie target without eliminating beer, so I could both preserve my quality of life and also maintain the moderate drinking that research shows is correlated with increased lifespan.  I was also drinking very high-calorie beer (like double IPAs and bourbon-barrel–aged imperial stouts). I shifted that toward low-alcohol, low-calorie beer (alcohol levels and calories are correlated). Bell’s Light-Hearted IPA and Lagunitas DayTime IPA are two pretty good ones in my area. Of the non-alcoholic (NA) beers, Athletic Free Wave Hazy IPA is the best I’ve found in my area, but Untappd has reasonably good ratings for Sam Adams Just the Haze and Sierra Nevada Trail Pass IPA, which should be broadly available. As a rough estimate on calories in beer, you can use this formula: Beer calories = ABV (alcohol percentage) * 2.5 * fluid ounces As an exception, many Belgian beers are quite “efficient” to drink, in that roughly 75% of the calories are alcohol rather than other carbs that just add calories. As a result, they violate the above formula and tend to be lower-calorie than you’d expect. This could be the result of carefully crafted recipes that consume most of the carbs, and fermentation that uses up all of the sugar.  Here’s a more specific formula that you can use, if you’re curious about how “efficient” a given beer is, and you know how many total calories it has (find this online): Beer calories from ethanol = (ABV * 0.8 / 100) * (29.6 * fluid ounces) * 7 (Simplified form): Beer calories from ethanol = ABV * 1.7 * fluid ounces This uses the density and calories of ethanol (0.8 g/ml and 7 cal/g, respectively) and converts from milliliters to ounces (29.6 ml/oz). If you then calculate that number as a fraction of the total calories in a beer, you can find its “efficiency.” For example, a 12-ounce bottle of 8.5% beer might have 198 calories total. Using the equation, we can calculate that it’s got 169 calories from ethanol, so 169/198 = 85% “efficient.” If you’re really trying to optimize for this, however, beer is the wrong drink. Have a low-calorie mixed drink instead, like a vodka soda, ranch water, or rum and Diet Coke. The plan (part 2) Therefore, instead of giving up beer entirely, I decided to skip breakfast. I’d eaten light breakfasts for years (a small bowl of cereal, or a banana and a granola bar), so this wasn’t a big deal to me.  Later, I discovered this qualified my diet as time-restricted intermittent fasting as well, since I was only eating/drinking between ~12pm–6pm. This approach of 18 hours off / 6 hours on (18:6 fasting) may have aided in my weight loss, but studies are mixed with some suggesting no effect. Here’s what a day might look like on 1450 calories: Lunch (400 calories). A tuna-salad sandwich (made with Greek yogurt instead of mayo) on whole-wheat bread, and a side salad with olive oil & vinegar. Afternoon snack (150 calories). Sliced bell peppers, no dip, and a small bowl of cottage cheese. A treat (50–100 calories). A truffle or a couple of small chocolates as an afternoon treat. Dinner (650 calories). Fried chicken/fish sandwich (or kids-size burger) and a small order of fries, from a fast-casual restaurant. One or two low-alcohol, light, or NA beers (150–200 calories). When I get hungry, I often drink some water instead, because my body’s easily confused about hunger vs thirst. It’s a mental game too — I remind myself that hunger means my body is burning fat, and that’s a good thing. For a long time, I kept track of my estimated calorie consumption mentally. More recently, I decided to make my life a little easier by switching to an app. I chose MyFitnessPal because it’s got a big database including almost everything I eat. On this plan, I had a great deal of success in losing my first 40 pounds, getting down from 240 to 200. However, it started to feel like a bit of a struggle to maintain my weight loss as I reached 200 pounds and wanted to continue losing at the same rate of 2 pounds/week. Adaptation, plateaus and persistence I fell behind by about two weeks on my weight-loss goal, which was massively frustrating because I’d done so well all along. I convinced myself to keep persisting because it had worked all along for months, and this was a temporary setback. Finally I re-used the same weight-loss calculator and realized what seemed obvious in hindsight: Since I now weighed less, I also burned fewer calories per day! Those 40 pounds that were now gone didn’t use any energy anymore, but I was still eating as if I had them. I needed to change something to restore the 1000-calorie daily deficit.  At this point, I aimed to decrease my intake to about 1200 calories per day. This quickly became frustrating because it started to affect my quality of life by forcing choices I didn’t want to make, such as choosing between a decent dinner or a beer, or forcing me to eat a salad with no protein for dinner if I had a little bit bigger lunch. That low calorie limit also carried the risk of causing metabolic adaptation — meaning my body could burn hundreds fewer calories per day as a result of being in a “starvation mode” of sorts. That ends up being a vicious cycle that continually forces you to eat less, and it makes weight loss even more challenging. Consequently, I began to introduce moderate exercise (walking), so I could bring my intake back up to 1400 calories on days when I burned 200 extra calories. I’ve discussed the details in a follow-up guide for fitness. Over the course of my learning, I discovered that it’s ideal (according to actuarial tables) to sit in the middle of the healthy range rather than be at the top of it. I maintained my initial weight-loss goal to keep myself motivated on progress, but set a second goal of reaching 165 pounds — or whatever weight it takes to get a six-pack (~10% body fat). Eat lots of protein I also discovered that high-protein diets are better at preserving muscle, so more of the weight loss is fat. This is especially true when coupled with resistance or strength training, which also sends your body a signal that it needs to keep its muscle instead of losing it. The minimum recommended daily allowance (RDA) of protein (0.36 grams per pound of body weight, or 67 g/day for me) could be your absolute lower limit, while as much as 0.6 g/lb (111 g/day for me) could help in improving your muscle mass.  Another study suggested multiplying the RDA by 1.25–1.5 (or more if you exercise) to maintain muscle during weight loss, which would put my recommended protein at 84–100 grams per day. The same study also said exercise helps to maintain muscle during weight loss, so it could be an either/or situation rather than needing both. Additionally, high-protein diets can help with hunger and weight loss, in part because they keep you fuller for longer. Getting 25%–30% of daily calories from protein will get you to this level, which is a whole lot of protein. Starting from your overall daily calories, you can apply this percentage and then divide your desired protein calories by 4 to get the number of grams per day: Protein grams per day = Total daily calories * {25%, 30%} / 4 For my calorie limit, that’s about 88–105 grams per day.  I’ve found that eating near the absolute minimum recommended protein level (67 grams per day, for my weight) tends to happen fairly naturally with my originally planned diet, while getting much higher protein takes real effort. I needed to identify low-calorie, high-protein foods and incorporate them more intentionally into meals, so that I can get enough protein without compromising my daily calorie limit.  Here’s a good list of low-calorie, high-protein foods that are pretty affordable: Breakfast/Lunch: eggs or low-fat/nonfat Greek yogurt (with honey/berries),  Entree: grilled/roasted chicken (or pork/turkey) or seafood (especially shrimp, canned salmon, canned tuna), and Sides: cottage cheese or lentils/beans (including soups, to make it an entree). If you’re vegetarian, you’d want to go heavier on lentils and beans, and add plenty of nuts, including hummus and peanut butter. You probably also want to bring in tempeh, and you likely already eat tofu. I’d never tried canned salmon before, and I was impressed with how easily I could make it into a salad or an open-faced sandwich (like Danish smørrebrød). The salmon came in large pieces and retained the original texture, as you’d want. Canned tuna has been more variable in terms of texture — I’ve had some great-looking albacore from Genova and some great-tasting (but not initially good-looking) skipjack from Wild Planet. Avoid the most common brands of canned fish though, like Chicken of the Sea, StarKist, or Bumble Bee. They are often farmed or net-caught instead of pole/line-caught, and they may be higher in parasites (for farmed fish like salmon). I also aim to buy lower-mercury types of salmon and tuna — this means I can eat each kind of fish as often as I want, instead of once a week. I buy canned Wild Planet skipjack tuna (not albacore, but yellowfin is pretty good too) and canned Deming’s sockeye salmon (not pink salmon) at my local grocery store, and I pick up large trays of refrigerated cocktail shrimp at Costco. The Genova brand also garners good reviews for canned fish and may be easier to find. All of those are pre-cooked and ready to eat, so they’re easy to use for a quick lunch.  Go ahead and get fresh seafood if you want, but be aware that you’ll be going through a lot of it so it could get expensive. Fish only stays good for a couple of days unless frozen, so you’ll also be making a lot of trips to the store or regularly thawing/cooking frozen fish. Summary Over the past 8 months, I’ve managed to lose 60 pounds (and counting!) through a low-effort approach that has minimized the overall impact on my quality of life. I’ve continued to eat the foods I want — but less of them. The biggest challenge has been persistence through the tough times. However, not cutting out any foods completely, but rather just decreasing the frequency of unhealthy foods in my life, has been a massive help with that. That meant I didn’t feel like I was breaking my whole diet whenever I had something I really wanted, as long as it fit within my calorie limit. What’s next? A few months after beginning my weight loss, I also started working out to get into better shape, which was another one of those original 5 factors to a long life. Right now, I’m aiming to get down to about 10% body fat, which is likely to be around 165 pounds. Then I’ll flip my eating habits into muscle-building mode, which will require a slight caloric excess rather than a deficit.  Stay tuned to see what happens!
  • Melissa Wen: Keep an eye out: We are preparing the 2024 Linux Display Next Hackfest! (2024/02/16 17:25)
    Igalia is preparing the 2024 Linux Display Next Hackfest and we are thrilled to announce that this year’s hackfest will take place from May 14th to 16th at our HQ in A Coruña, Spain. This unconference-style event aims to bring together the most relevant players in the Linux display community to tackle current challenges and chart the future of the display stack. Key goals for the hackfest include: Releasing the power of collaboration: We’ll work to remove bottlenecks and pave the way for smoother, more performant displays. Problem-solving powerhouse: Brainstorming sessions and collaborative coding will target issues like HDR, color management, variable refresh rates, and more. Building on past commitments: Let’s solidify the progress made in recent years and push the boundaries even further. The hackfest fosters an intimate and focused environment to brainstorm, hack, and design solutions alongside fellow display experts. Participants will dive into discussions, tinker with code, and contribute to shaping the future of the Linux display stack. More details are available on the official website. Stay tuned! Keep an eye out for more information, mark your calendars and start prepping your hacking gear.
  • Lucas Fryzek: A Dive into Vulkanised 2024 (2024/02/14 05:00)
    Vulkanised sign at google’s office Last week I had an exciting opportunity to attend the Vulkanised 2024 conference. For those of you not familar with the event, it is “The Premier Vulkan Developer Conference” hosted by the Vulkan working group from Khronos. With the excitement out of the way, I decided to write about some of the interesting information that came out of the conference. A Few Presentations My colleagues Iago, Stéphane, and Hyunjun each had the opportunity to present on some of their work into the wider Vulkan ecosystem. Stéphane and Hyujun presenting Stéphane & Hyunjun presented “Implementing a Vulkan Video Encoder From Mesa to Streamer”. They jointly talked about the work they performed to implement the Vulkan video extensions in Intel’s ANV Mesa driver as well as in GStreamer. This was an interesting presentation because you got to see how the new Vulkan video extensions affected both driver developers implementing the extensions and application developers making use of the extensions for real time video decoding and encoding. Their presentation is available on Iago presenting Later my colleague Iago presented jointly with Faith Ekstrand (a well-known Linux graphic stack contributor from Collabora) on “8 Years of Open Drivers, including the State of Vulkan in Mesa”. They both talked about the current state of Vulkan in the open source driver ecosystem, and some of the benefits open source drivers have been able to take advantage of, like the common Vulkan runtime code and a shared compiler stack. You can check out their presentation for all the details. Besides Igalia’s presentations, there were several more which I found interesting, with topics such as Vulkan developer tools, experiences of using Vulkan in real work applications, and even how to teach Vulkan to new developers. Here are some highlights for some of them. Using Vulkan Synchronization Validation Effectively John Zulauf had a presentation of the Vulkan synchronization validation layers that he has been working on. If you are not familiar with these, then you should really check them out. They work by tracking how resources are used inside Vulkan and providing error messages with some hints if you use a resource in a way where it is not synchronized properly. It can’t catch every error, but it’s a great tool in the toolbelt of Vulkan developers to make their lives easier when it comes to debugging synchronization issues. As John said in the presentation, synchronization in Vulkan is hard, and nearly every application he tested the layers on reveled a synchronization issue, no matter how simple it was. He can proudly say he is a vkQuake contributor now because of these layers. 6 Years of Teaching Vulkan with Example for Video Extensions This was an interesting presentation from a professor at the university of Vienna about his experience teaching graphics as well as game development to students who may have little real programming experience. He covered the techniques he uses to make learning easier as well as resources that he uses. This would be a great presentation to check out if you’re trying to teach Vulkan to others. Vulkan Synchronization Made Easy Another presentation focused on Vulkan sync, but instead of debugging it, Grigory showed how his graphics library abstracts sync away from the user without implementing a render graph. He presented an interesting technique that is similar to how the sync validation layers work when it comes ensuring that resources are always synchronized before use. If you’re building your own engine in Vulkan, this is definitely something worth checking out. Vulkan Video Encode API: A Deep Dive Tony at Nvidia did a deep dive into the new Vulkan Video extensions, explaining a bit about how video codecs work, and also including a roadmap for future codec support in the video extensions. Especially interesting for us was that he made a nice call-out to Igalia and our work on Vulkan Video CTS and open source driver support on slide (6) :) Thoughts on Vulkanised Vulkanised is an interesting conference that gives you the intersection of people working on Vulkan drivers, game developers using Vulkan for their graphics backend, visual FX tool developers using Vulkan-based tools in their pipeline, industrial application developers using Vulkan for some embedded commercial systems, and general hobbyists who are just interested in Vulkan. As an example of some of these interesting audience members, I got to talk with a member of the Blender foundation about his work on the Vulkan backend to Blender. Lastly the event was held at Google’s offices in Sunnyvale. Which I’m always happy to travel to, not just for the better weather (coming from Canada), but also for the amazing restaurants and food that’s in the Bay Area! Great bay area food
  • Alyssa Rosenzweig: Conformant OpenGL 4.6 on the M1 (2024/02/14 05:00)
    For years, the M1 has only supported OpenGL 4.1. That changes today – with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! Install Fedora for the latest M1/M2-series drivers. Already installed? Just dnf upgrade --refresh. Unlike the vendor’s non-conformant 4.1 drivers, our open source Linux drivers are conformant to the latest OpenGL versions, finally promising broad compatibility with modern OpenGL workloads, like Blender. Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The official list of conformant drivers now includes our OpenGL 4.6 and ES 3.2. While the vendor doesn’t yet support graphics standards like modern OpenGL, we do. For this Valentine’s Day, we want to profess our love for interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special ports. For that, we need standards conformance. Six months ago, we became the first conformant driver for any standard graphics API for the M1 with the release of OpenGL ES 3.1 drivers. Today, we’ve finished OpenGL with the full 4.6… and we’re well on the road to Vulkan. Compared to 4.1, OpenGL 4.6 adds dozens of required features, including: Robustness SPIR-V Clip control Cull distance Compute shaders Upgraded transform feedback Regrettably, the M1 doesn’t map well to any graphics standard newer than OpenGL ES 3.1. While Vulkan makes some of these features optional, the missing features are required to layer DirectX and OpenGL on top. No existing solution on M1 gets past the OpenGL 4.1 feature set. How do we break the 4.1 barrier? Without hardware support, new features need new tricks. Geometry shaders, tessellation, and transform feedback become compute shaders. Cull distance becomes a transformed interpolated value. Clip control becomes a vertex shader epilogue. The list goes on. For a taste of the challenges we overcame, let’s look at robustness. Built for gaming, GPUs traditionally prioritize raw performance over safety. Invalid application code, like a shader that reads a buffer out-of-bounds, can trigger undefined behaviour. Drivers exploit that to maximize performance. For applications like web browsers, that trade-off is undesirable. Browsers handle untrusted shaders, which they must sanitize to ensure stability and security. Clicking a malicious link should not crash the browser. While some sanitization is necessary as graphics APIs are not security barriers, reducing undefined behaviour in the API can assist “defence in depth”. “Robustness” features can help. Without robustness, out-of-bounds buffer access in a shader can crash. With robustness, the application can opt for defined out-of-bounds behaviour, trading some performance for less attack surface. All modern cross-vendor APIs include robustness. Many games even (accidentally?) rely on robustness. Strangely, the vendor’s proprietary API omits buffer robustness. We must do better for conformance, correctness, and compatibility. Let’s first define the problem. Different APIs have different definitions of what an out-of-bounds load returns when robustness is enabled: Zero (Direct3D, Vulkan with robustBufferAccess2) Either zero or some data in the buffer (OpenGL, Vulkan with robustBufferAccess) Arbitrary values, but can’t crash (OpenGL ES) OpenGL uses the second definition: return zero or data from the buffer. One approach is to return the last element of the buffer for out-of-bounds access. Given the buffer size, we can calculate the last index. Now consider the minimum of the index being accessed and the last index. That equals the index being accessed if it is valid, and some other valid index otherwise. Loading the minimum index is safe and gives a spec-compliant result. As an example, a uniform buffer load without robustness might look like: load.i32 result, buffer, index Robustness adds a single unsigned minimum (umin) instruction: umin idx, index, last load.i32 result, buffer, idx Is the robust version slower? It can be. The difference should be small percentage-wise, as arithmetic is faster than memory. With thousands of threads running in parallel, the arithmetic cost may even be hidden by the load’s latency. There’s another trick that speeds up robust uniform buffers. Like other GPUs, the M1 supports “preambles”. The idea is simple: instead of calculating the same value in every thread, it’s faster to calculate once and reuse the result. The compiler identifies eligible calculations and moves them to a preamble executed before the main shader. These redundancies are common, so preambles provide a nice speed-up. We usually move uniform buffer loads to the preamble when every thread loads the same index. Since the size of a uniform buffer is fixed, extra robustness arithmetic is also moved to the preamble. The robustness is “free” for the main shader. For robust storage buffers, the clamping might move to the preamble even if the load or store cannot. Armed with robust uniform and storage buffers, let’s consider robust “vertex buffers”. In graphics APIs, the application can set vertex buffers with a base GPU address and a chosen layout of “attributes” within each buffer. Each attribute has an offset and a format, and the buffer has a “stride” indicating the number of bytes per vertex. The vertex shader can then read attributes, implicitly indexing by the vertex. To do so, the shader loads the address: Some hardware implements robust vertex fetch natively. Other hardware has bounds-checked buffers to accelerate robust software vertex fetch. Unfortunately, the M1 has neither. We need to implement vertex fetch with raw memory loads. One instruction set feature helps. In addition to a 64-bit base address, the M1 GPU’s memory loads also take an offset in elements. The hardware shifts the offset and adds to the 64-bit base to determine the address to fetch. Additionally, the M1 has a combined integer multiply-add instruction imad. Together, these features let us implement vertex loads in two instructions. For example, a 32-bit attribute load looks like: imad idx, stride/4, vertex, offset/4 load.i32 result, base, idx The hardware load can perform an additional small shift. Suppose our attribute is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction: load.v4i32 result, base, vertex << 2 …with the hardware calculating the address: What about robustness? We want to implement robustness with a clamp, like we did for uniform buffers. The problem is that the vertex buffer size is given in bytes, while our optimized load takes an index in “vertices”. A single vertex buffer can contain multiple attributes with different formats and offsets, so we can’t convert the size in bytes to a size in “vertices”. Let’s handle the latter problem. We can rewrite the addressing equation as: That is: one buffer with many attributes at different offsets is equivalent to many buffers with one attribute and no offset. This gives an alternate perspective on the same data layout. Is this an improvement? It avoids an addition in the shader, at the cost of passing more data – addresses are 64-bit while attribute offsets are 16-bit. More importantly, it lets us translate the vertex buffer size in bytes into a size in “vertices” for each vertex attribute. Instead of clamping the offset, we clamp the vertex index. We still make full use of the hardware addressing modes, now with robustness: umin idx, vertex, last valid load.v4i32 result, base, idx << 2 We need to calculate the last valid vertex index ahead-of-time for each attribute. Each attribute has a format with a particular size. Manipulating the addressing equation, we can calculate the last byte accessed in the buffer (plus 1) relative to the base: The load is valid when that value is bounded by the buffer size in bytes. We solve the integer inequality as: The driver calculates the right-hand side and passes it into the shader. One last problem: what if a buffer is too small to load anything? Clamping won’t save us – the code would clamp to a negative index. In that case, the attribute is entirely invalid, so we swap the application’s buffer for a small buffer of zeroes. Since we gave each attribute its own base address, this determination is per-attribute. Then clamping the index to zero correctly loads zeroes. Putting it together, a little driver math gives us robust buffers at the cost of one umin instruction. In addition to buffer robustness, we need image robustness. Like its buffer counterpart, image robustness requires that out-of-bounds image loads return zero. That formalizes a guarantee that reasonable hardware already makes. …But it would be no fun if our hardware was reasonable. Running the conformance tests for image robustness, there is a single test failure affecting “mipmapping”. For background, mipmapped images contain multiple “levels of detail”. The base level is the original image; each successive level is the previous level downscaled. When rendering, the hardware selects the level closest to matching the on-screen size, improving efficiency and visual quality. With robustness, the specifications all agree that image loads return… Zero if the X- or Y-coordinate is out-of-bounds Zero if the level is out-of-bounds Meanwhile, image loads on the M1 GPU return… Zero if the X- or Y-coordinate is out-of-bounds Values from the last level if the level is out-of-bounds Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware clamps the level and returns nonzero values. It’s a mystery why. The vendor does not document their hardware publicly, forcing us to rely on reverse engineering to build drivers. Without documentation, we don’t know if this behaviour is intentional or a hardware bug. Either way, we need a workaround to pass conformance. The obvious workaround is to never load from an invalid level: if (level <= levels) { return imageLoad(x, y, level); } else { return 0; } That involves branching, which is inefficient. Loading an out-of-bounds level doesn’t crash, so we can speculatively load and then use a compare-and-select operation instead of branching: vec4 data = imageLoad(x, y, level); return (level <= levels) ? data : 0; This workaround is okay, but it could be improved. While the M1 GPU has combined compare-and-select instructions, the instruction set is scalar. Each thread processes one value at a time, not a vector of multiple values. However, image loads return a vector of four components (red, green, blue, alpha). While the pseudo-code looks efficient, the resulting assembly is not: image_load R, x, y, level ulesel R[0], level, levels, R[0], 0 ulesel R[1], level, levels, R[1], 0 ulesel R[2], level, levels, R[2], 0 ulesel R[3], level, levels, R[3], 0 Fortunately, the vendor driver has a trick. We know the hardware returns zero if either X or Y is out-of-bounds, so we can force a zero output by setting X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X greater than 16384 is out-of-bounds. That justifies an alternate workaround: bool valid = (level <= levels); int x_ = valid ? x : 20000; return imageLoad(x_, y, level); Why is this better? We only change a single scalar, not a whole vector, compiling to compact scalar assembly: ulesel x_, level, levels, x, #20000 image_load R, x_, y, level If we preload the constant to a uniform register, the workaround is a single instruction. That’s optimal – and it passes conformance. Blender “Wanderer” demo by Daniel Bystedt, licensed CC BY-SA.
  • Bastien Nocera: New and old apps on Flathub (2024/02/09 14:42)
    3D Printing Slicers I recently replaced my Flashforge Adventurer 3 printer that I had been using for a few years as my first printer with a BambuLab X1 Carbon, wanting a printer that was not a “project” so I could focus on modelling and printing. It's an investment, but my partner convinced me that I was using the printer often enough to warrant it, and told me to look out for Black Friday sales, which I did. The hardware-specific slicer, Bambu Studio, was available for Linux, but only as an AppImage, with many people reporting crashes on startup, non-working video live view, and other problems that the hardware maker tried to work-around by shipping separate AppImage variants for Ubuntu and Fedora.After close to 150 patches to the upstream software (which, in hindsight, I could probably have avoided by compiling the C++ code with LLVM), I manage to “flatpak” the application and make it available on Flathub. It's reached 3k installs in about a month, which is quite a bit for a niche piece of software.Note that if you click the “Donate” button on the Flathub page, it will take you a page where you can feed my transformed fossil fuel addiction buy filament for repairs and printing perfectly fitting everyday items, rather than bulk importing them from the other side of the planet. Preparing a Game Gear consoliser shellI will continue to maintain the FlashPrint slicer for FlashForge printers, installed by nearly 15k users, although I enabled automated updates now, and will not be updating the release notes, which required manual intervention.FlashForge have unfortunately never answered my queries about making this distribution of their software official (and fixing the crash when using a VPN...). Rhythmbox As I was updating the Rhythmbox Flatpak on Flathub, I realised that it just reached 250k installs, which puts the number of installations of those 3D printing slicers above into perspective. The updated screenshot used on FlathubCongratulations, and many thanks, to all the developers that keep on contributing to this very mature project, especially Jonathan Matthew who's been maintaining the app since 2008.
  • Tomeu Vizoso: Etnaviv NPU update 16: A nice performance jump (2024/02/08 09:36)
    After the open-source driver for VeriSilicon's Vivante NPU was merged into Mesa two weeks ago, I have been taking some rest and thinking about what will come next.Automated testing I have a merge request to Mesa almost ready that will enable continuous integration testing on real hardware, but it depends on solving what seem to be problems with the power supplies of the boards in the HW testing lab. Collabora is graciously looking at it. Thanks!PerformanceI have been talking with quite a few people about the whole effort of bringing open-source to NPU hardware and something that came up more than once is the question of reaching or surpassing the performance level of the proprietary drivers.It is a fair concern, because the systolic arrays will be underutilized if they starve of data. And given how fast they are in performing the arithmetic operations, and how slow memory buses and chips on embedded are (related to high-end GPUs, at least), this starving and the consequent underutilization are very likely to happen.IP vendors go to great lengths to prevent that from happening, inventing ways of getting the data faster to the processing elements, reducing the memory bandwidth used, and balancing the use of the different cores/arrays. There is plenty of published research on this area, which helps when figuring out how to make the most of a particular piece of hardware.Weight compression Something I started working on last week is compression of zero values in the weight buffers. Sparsity is very common in the neural models that this hardware is targeted to run, and common convolutions such as strided and depthwise can easily have zero ratios of 90% and more.By compressing consecutive zeroes in a buffer we can greatly reduce pressure on the memory bus, keeping the processing units better fed (though I'm sure we are still far from getting good utilization).By opportunistically using the 5 available bits to compress consecutive runs of zeroes, I was able to improve the performance of the MobileNetV1 model from 15.7 ms to 9.9 ms, and that of the SSDLite MobileDet model from 56.1 ms to 32.7 ms.As shown in the graph above, we still have quite some room for improvement before we reach the performance of the proprietary driver, but we are getting close pretty fast. I also believe that we can tailor the driver to user's needs to surpass the performance of the proprietary driver for specific models, as this is open-source and everybody can chip in, see how things are made and improve them.IRC channelI mentioned this in passing some time ago, but now that we have a driver at this level of usefulness, I think it is a good moment to remind that we have an IRC channel in the OFTC network to discuss anything about doing accelerated machine learning on the edge with upstream open-source software: #ml-mainline. You can click here to join via a web interface, though I recommend setting up an account at nextShould I continue working on performance? Enable more models for new use cases? Enable this driver on more SoCs (i.MX8MP and S905D3 look interesting)? Start writing a driver for a completely different IP, such as Rockchip's or Amlogic's?I still haven't decided, so if you have an opinion please drop a comment in this blog, or at any of the social networks linked from this blog.I'm currently available for contracting, so I should be able to get on your project full-time on short notice.
  • Nicolai Hähnle: Building a HIP environment from scratch (2024/02/07 11:30)
    HIP is a C++-based, single-source programming language for writing GPU code. "Single-source" means that a single source file can contain both the "host code" which runs on the CPU and the "device code" which runs on the GPU. In a sense, HIP is "CUDA for AMD", except that HIP can actually target both AMD and Nvidia GPUs.If you merely want to use HIP, your best bet is to look at the documentation and download pre-built packages. (By the way, the documentation calls itself "ROCm" because that's what AMD calls its overall compute platform. It includes HIP, OpenCL, and more.)I like to dig deep, though, so I decided I want to build at least the user space parts myself to the point where I can build a simple HelloWorld using a Clang from upstream LLVM. It's all open-source, after all!It's a bit tricky, though, in part because of the kind of bootstrapping problems you usually get when building toolchains: Running the compiler requires runtime libraries, at least by default, but building the runtime libraries requires a compiler. Luckily, it's not quite that difficult, though, because compiling the host libraries doesn't require a HIP-enabled compiler - any C++ compiler will do. And while the device libraries do require a HIP- (and OpenCL-)enabled compiler, it is possible to build code in a "freestanding" environment where runtime libraries aren't available.What follows is pretty much just a list of steps with running commentary on what the individual pieces do, since I didn't find an equivalent recipe in the official documentation. Of course, by the time you read this, it may well be outdated. Good luck!Components need to be installed, but installing into some arbitrary prefix inside your $HOME works just fine. Let's call it $HOME/prefix. All packages use CMake and can be built using invocations along the lines of:cmake -S . -B build -GNinja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=$HOME/prefix -DCMAKE_PREFIX_PATH=$HOME/prefixninja -C build installIn some cases, additional variables need to be set.Step 1: clang and lldWe're going to need a compiler and linker, so let's get llvm/llvm-project and build it with Clang and LLD enabled: -DLLVM_ENABLE_PROJECTS='clang;lld' -DLLVM_TARGETS_TO_BUILD='X86;AMDGPU'Building LLVM is an art of its own which is luckily reasonably well documented, so I'm going to leave it at that.Step 2: Those pesky cmake filesBuild and install ROCm/rocm-cmake to avoid cryptic error messages down the road when building other components that use those CMake files without documenting the dependency clearly. Not rocket science, but man am I glad for GitHub's search function.Step 3: libhsa-runtime64.soThis is the lowest level user space host-side library in the ROCm stack. Its services, as far as I understand them, include setting up device queues and loading "code objects" (device ELF files). All communication with the kernel driver goes through here.Notably though, this library does not know how to dispatch a kernel! In the ROCm world, the so-called Architected Queueing Language is used for that. An AQL queue is setup with the help of the kernel driver (and that does go through, and then a small ring buffer and a "door bell" associated with the queue are mapped into the application's virtual memory space. When the application wants to dispatch a kernel, it (or rather, a higher-level library like that it links against) writes an AQL packet into the ring buffer and "rings the door bell", which basically just means writing a new ring buffer head pointer to the door bell's address. The door bell virtual memory page is mapped to the device, so ringing the door bell causes a PCIe transaction (for us peasants; MI300A has slightly different details under the hood) which wakes up the GPU.Anyway, comes in two parts for what I am being told are largely historical reasons:ROCm/ROCT-Thunk-InterfaceROCm/ROCR-Runtime; this one has one of those bootstrap issues and needs a -DIMAGE_SUPPORT=OFFThe former is statically linked into the latter...Step 4: It which must not be namedFor Reasons(tm), there is a fork of LLVM in the ROCm ecosystem, ROCm/llvm-project. Using upstream LLVM for the compiler seems to be fine and is what I as a compiler developer obviously want to do. However, this fork has an amd directory with a bunch of pieces that we'll need. I believe there is a desire to upstream them, but also an unfortunate hesitation from the LLVM community to accept something so AMD-specific.In any case, the required components can each be built individually against the upstream LLVM from step 1:hipcc; this is a frontend for Clang which is supposed to be user-friendly, but at the cost of adding an abstraction layer. I want to look at the details under the hood, so I don't want to and don't have to use it; but some of the later components want itdevice-libs; as the name says, these are libraries of device code. I'm actually not quite sure what the intended abstraction boundary is between this one and the HIP libraries from the next step. I think these ones are meant to be tied more closely to the compiler so that other libraries, like the HIP library below, don't have to use __builtin_amdgcn_* directly? Anyway, just keep on building...comgr; the "code object manager". Provides a stable interface to LLVM, Clang, and LLD services, up to (as far as I understand it) invoking Clang to compile kernels at runtime. But it seems to have no direct connection to the code-related services in last one is annoying. It needs a -DBUILD_TESTING=OFFWorse, it has a fairly large interface with the C++ code of LLVM, which is famously not stable. In fact, at least during my little adventure, comgr wouldn't build as-is against the LLVM (and Clang and LLD) build that I got from step 1. I had to hack out a little bit of code in its symbolizer. I'm sure it's fine.Step 5: libamdhip64.soFinally, here comes the library that implements the host-side HIP API. It also provides a bunch of HIP-specific device-side functionality, mostly by leaning on the device-libs from the previous step.It lives in ROCm/clr, which stands for either Compute Language Runtimes or Common Language Runtime. Who knows. Either one works for me. It's obviously for compute, and it's common because it also contains OpenCL support.You also need ROCm/HIP at this point. I'm not quite sure why stuff is split up into so many repositories. Maybe ROCm/HIP is also used when targeting Nvidia GPUs with HIP, but ROCm/CLR isn't? Not a great justification in my opinion, but at least this is documented in the README.CLR also needs a bunch of additional CMake options: -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=${checkout of ROCm/HIP} -DHIPCC_BIN_DIR=$HOME/prefix/binStep 6: Compiling with ClangWe can now build simple HIP programs with our own Clang against our own HIP and ROCm libraries:clang -x hip --offload-arch=gfx1100 --rocm-path=$HOME/prefix -rpath $HOME/prefix/lib -lstdc++ HelloWorld.cppLD_LIBRARY_PATH=$HOME/prefix/lib ./a.outNeat, huh?
  • Robert McQueen: Flathub: Pros and Cons of Direct Uploads (2024/02/06 10:57)
    I attended FOSDEM last weekend and had the pleasure to participate in the Flathub / Flatpak BOF on Saturday. A lot of the session was used up by an extensive discussion about the merits (or not) of allowing direct uploads versus building everything centrally on Flathub’s infrastructure, and related concerns such as automated security/dependency scanning. My original motivation behind the idea was essentially two things. The first was to offer a simpler way forward for applications that use language-specific build tools that resolve and retrieve their own dependencies from the internet. Flathub doesn’t allow network access during builds, and so a lot of manual work and additional tooling is currently needed (see Python and Electron Flatpak guides). And the second was to offer a maybe more familiar flow to developers from other platforms who would just build something and then run another command to upload it to the store, without having to learn the syntax of a new build tool. There were many valid concerns raised in the room, and I think on reflection that this is still worth doing, but might not be as valuable a way forward for Flathub as I had initially hoped. Of course, for a proprietary application where Flathub never sees the source or where it’s built, whether that binary is uploaded to us or downloaded by us doesn’t change much. But for an FLOSS application, a direct upload driven by the developer causes a regression on a number of fronts. We’re not getting too hung up on the “malicious developer inserts evil code in the binary” case because Flathub already works on the model of verifying the developer and the user makes a decision to trust that app – we don’t review the source after all. But we do lose other things such as our infrastructure building on multiple architectures, and visibility on whether the build environment or upload credentials have been compromised unbeknownst to the developer. There is now a manual review process for when apps change their metadata such as name, icon, license and permissions – which would apply to any direct uploads as well. It was suggested that if only heavily sandboxed apps (eg no direct filesystem access without proper use of portals) were permitted to make direct uploads, the impact of such concerns might be somewhat mitigated by the sandboxing. However, it was also pointed out that my go-to example of “Electron app developers can upload to Flathub with one command” was also a bit of a fiction. At present, none of them would pass that stricter sandboxing requirement. Almost all Electron apps run old versions of Chromium with less complete portal support, needing sandbox escapes to function correctly, and Electron (and Chromium’s) sandboxing still needs additional tooling/downstream patching to run inside a Flatpak. Buh-boh. I think for established projects who already ship their own binaries from their own centralised/trusted infrastructure, and for developers who have understandable sensitivities about binary integrity such such as encryption, password or financial tools, it’s a definite improvement that we’re able to set up direct uploads with such projects with less manual work. There are already quite a few applications – including verified ones – where the build recipe simply fetches a binary built elsewhere and unpacks it, and if this already done centrally by the developer, repeating the exercise on Flathub’s server adds little value. However for the individual developer experience, I think we need to zoom out a bit and think about how to improve this from a tools and infrastructure perspective as we grow Flathub, and as we seek to raise funds for different sources for these improvements. I took notes for everything that was mentioned as a tooling limitation during the BOF, along with a few ideas about how we could improve things, and hope to share these soon as part of an RFP/RFI (Request For Proposals/Request for Information) process. We don’t have funding yet but if we have some prospective collaborators to help refine the scope and estimate the cost/effort, we can use this to go and pursue funding opportunities.
  • Dave Airlie (blogspot): anv: vulkan av1 decode status (2024/02/05 03:16)
     Vulkan Video AV1 decode has been released, and I had some partly working support on Intel ANV driver previously, but I let it lapse.The branch is currently [1]. It builds, but is totally untested, I'll get some time next week to plug in my DG2 and see if I can persuade it to decode some frames.Update: the current branch decodes one frame properly, reference frames need more work unfortunately. [1]
  • Dave Airlie (blogspot): radv: vulkan av1 video decode status (2024/02/02 02:27)
    The Khronos Group announced VK_KHR_video_decode_av1 [1], this extension adds AV1 decoding to the Vulkan specification. There is a radv branch [2] and merge request [3]. I did some AV1 work on this in the past, but I need to take some time to see if it has made any progress since. I'll post an ANV update once I figure that out.This extension is one of the ones I've been wanting for a long time, since having royalty-free codec is something I can actually care about and ship, as opposed to the painful ones. I started working on a MESA extension for this a year or so ago with Lynne from the ffmpeg project and we made great progress with it. We submitted that to Khronos and it has gone through the committee process and been refined and validated amongst the hardware vendors.I'd like to say thanks to Charlie Turner and Igalia for taking over a lot of the porting to the Khronos extension and fixing up bugs that their CTS development brought up. This is a great feature of having open source drivers, it allows a lot quicker turn around time in bug fixes when devs can fix them themselves![1]: [2][3]
  • Bastien Nocera: Re: New responsibilities (2024/01/31 11:33)
     A few months have passed since New Responsibilities was posted, so I thought I would provide an update.Projects MaintenanceOf all the freedesktop projects I created and maintained, only one doesn't have a new maintainer, low-memory-monitor.This daemon is what the GMemoryMonitor GLib API is based on, so it can't be replaced trivially. Efforts seem to be under way to replace it with systemd APIs.As for the other daemons:switcheroo-control got picked up by Jonas Ådahl, one of the mutter maintainers. I'm looking forward to seeing this merge request fixed so we can have better menu items on dual-GPU systemsiio-sensor-proxy added Dylan Van Assche to its maintenance team, assisting Guido Günther.power-profiles-daemon is now maintained by Marco Trevisan. It recently got support for separate system and CPU power profiles, and display power saving features are in the works.(As an aside, there's posturing towards replacing power-profiles-daemon with tuned in Fedora. I would advise stakeholders to figure out whether having a large Python script in the boot hot path is a good idea, taking a look at bootcharts, and then thinking about whether hardware manufacturers would be able to help with supporting a tool with so many moving parts. Useful for tinkering, not for shipping in a product)Updated responsibilitiesSince mid-August, I've joined the Platform Enablement Team. Right now, I'm helping out with maintenance of the Bluetooth kernel stack in RHEL (and thus CentOS).The goal is to eventually pivot to hardware enablement, which is likely to involve backporting and testing, more so than upstream enablement. This is currently dependent on attending some formal kernel development (and debugging) training sessions which should make it easier to see where my hodge-podge kernel knowledge stands.Blog backlogBefore being moved to a different project, and apart from the usual and very time-consuming bug triage, user support and project maintenance, I also worked on a few new features. I have a few posts planned that will lay that out.
  • Peter Hutterer: New 🚯 emoji-based spamfighting abilities (2024/01/29 07:58)
    This is a follow-up from our Spam-label approach, but this time with MOAR EMOJIS because that's what the world is turning into. Since March 2023 projects could apply the "Spam" label on any new issue and have a magic bot come in and purge the user account plus all issues they've filed, see the earlier post for details. This works quite well and gives every project member the ability to quickly purge spam. Alas, pesky spammers are using other approaches to trick google into indexing their pork [1] (because at this point I think all this crap is just SEO spam anyway). Such as commenting on issues and merge requests. We can't apply labels to comments, so we found a way to work around that: emojis! In GitLab you can add "reactions" to issue/merge request/snippet comments and in recent GitLab versions you can register for a webhook to be notified when that happens. So what we've added to the instance is support for the :do_not_litter: (🚯) emoji [2] - if you set that on an comment the author of said comment will be blocked and the comment content will be removed. After some safety checks of course, so you can't just go around blocking everyone by shotgunning emojis into gitlab. Unlike the "Spam" label this does not currently work recursively so it's best to report the user so admins can purge them properly - ideally before setting the emoji so the abuse report contains the actual spam comment instead of the redacted one. Also note that there is a 30 second grace period to quickly undo the emoji if you happen to set it accidentally. Note that for purging issues, the "Spam" label is still required, the emojis only work for comments. Happy cleanup! [1] or pork-ish [2] Benjamin wanted to use :poop: but there's a chance that may get used for expressing disagreement with the comment in question
  • Hans de Goede: A fully open source stack for MIPI cameras (2024/01/26 16:48)
    Many recent Intel laptops have replaced the standard UVC USB camera module with a raw MIPI camera-sensor connected to the IPU6 found in recent Intel laptop chips.Both the hw interface of the ISP part of the IPU6 as well as the image processing algorithms used are considered a trade secret and so far the only Linux support for the IPU6 relies on an out of tree kernel driver with a proprietary userspace stack on top, which is currently available in rpmfusion.Both Linaro and Red Hat have identified the missing ISP support for various ARM and X86 chips as a problem. Linaro has started a project to add a SoftwareISP component to libcamera to allow these cameras to work without needing proprietary software and Red Hat has joined Linaro in working on this.FOSDEM talkBryan O'Donoghue (Linaro) and I are giving a talk about this at FOSDEM.Fedora COPR repositoryThis work is at a point now where it is ready for wider testing. A Fedora COPR repository with a patched kernel and libcamera is now available for users to test, see the COPR page for install and test instructions.This has been tested on the following devices:Lenovo ThinkPad X1 yoga gen 8 (should work on any ThinkPad with ov2740 sensor)Dell Latitude 9420 (ov01a1s sensor)HP Spectre x360 13.5 (2023 model, hi556 sensor)Description of the stackKernel driver for the camera sensor, for the ov2740 used on current Lenovo designs (excluding MTL) I have landed all necessary kernel changes for this upstream.Kernel support for the CSI receiver part of the IPU6 Intel is working on upstreaming this and has recently posted v3 of their patch series for this upstream and this is under active review.A FOSS Software ISP stack inside libcamera to replace the missing IPU6 ISP (processing-system/psys) support. Work on this is under way. I've recently send out v2 of the patch-series for this.Firefox pipewire camera support and support for the camera portal to get permission to access the camera. My colleague Jan Grulich has been working on this, see Jan's blogpost. Jan's work has landed in the just released Firefox 122.
  • Tomeu Vizoso: Etnaviv NPU update 15: We are upstream! (2024/01/24 10:52)
    Today the initial merge request for Teflon was merged into Mesa, along with the first hardware driver, for VeriSilicon's Vivante NPU.For those who don't know, Teflon is a TensorFlow Lite delegate that aims to support several AI accelerators (also called NPUs, TPUs, APUs, NNAs, etc). Teflon is and will always be open-source, and is released under the MIT license. This will have the following advantages for the project:The userspace driver will be automatically packaged by distros such as Debian, Ubuntu, Fedora and Yocto, when they update to the next stable version: 24.1.0, which should be out around May 2024. See the release calendar.Contribution to the project will happen within the development process of Mesa. This is a well-established process in which employees from companies such as Google, Valve, Imagination, Intel, Microsoft and AMD work together on their GPU drivers.The project has great technical infrastructure, maintained by awesome sysadmins:A well-maintained Gitlab instance,extensive CI, for both build and runtime testing, on real hardware,mailing list, web server, etc.More importantly, the Mesa codebase has also infrastructure that will be very useful to NPU drivers:The NIR intermediate representation with loads of lowering passes. This will be immediately useful for lowering operations in models to programmable cores, but in the future I want to explore representing whole models with this, for easier manipulation and lowerings.The Gallium internal API that decouples HW-specific frontends from HW-specific drivers. This will be critical as we add support for more NPUs, and also when we expose to other frameworks such as Android NNAPI.And lastly, Mesa is part of a great yearly conference that allows contributors to discuss their work with others in a high-bandwidth environment: XDC.The story so farIn 2022, while still at Collabora, I started adding OpenCL support to the Etnaviv driver in Mesa. Etnaviv is a userspace and kernel driver for VeriSilicon's Vivante NPUs.The goal was to accelerate machine learning workloads, but once I left Collabora to focus on the project and had implemented enough of the OpenCL specification to run a popular object classification model, I realized that there was no way I was going to ever get close to the performance of the proprietary driver by using the programmable part fo the NPU.I dug a bit deeper in how the proprietary driver was doing its thing and realized that almost all operations weren't running as shaders, but on "fixed-function" hardware units (systolic arrays, as I realized later).Fortunately, all these accelerators that support matrix multiplications as individual instructions are very similar in their fundamentals, and the state of the art has been well documented in scientific publications since Google released their first TPU.With all this wealth of information and with the help of VeriSilicon's own debugging output and open-source kernel driver, I had a very good start at reverse engineering the hardware. The rest was done by observing how the proprietary userspace driver interacted with the kernel, with the help of existing tools from the Etnaviv projects and others that I wrote, and by staring for long hours to all the produced data in spreadsheets.During the summer and with Libre Computer's sponsorship, I chipped away at documenting the interface to the convolution units and implementing support for them in my Mesa branch.By autumn I was able to run that same object classification model (MobileNet V1) 3 times faster than the CPU was able to. A month later I learned to use the other systolic array in the NPU, for tensor manipulation operations, and got it running 6 times faster than the CPU and only twice as slow as the proprietary driver.Afterwards I got to work on object detection models, and by the start of 2024 I managed to run SSDLite MobileDet at 56 milliseconds per inference, which is around 3 times slower than what the proprietary achieves, but still pretty darn useful in many situations!The rest of the time until now has been spent polishing the driver, improving its test suite and reacting to code reviews from the Mesa community.Next stepsNow that the codebase is part of upstream Mesa, my work will progress in smaller batches, and I expect myself to be spending time reviewing other people's contributions and steering the project. People want to get this running on other variants of the VeriSilicon NPU IP and I am certainly not going to be able to do it all!I also know of people wanting to put this together with other components in demos and solutions, so I will be supporting them so we can showcase the usefulness of all this.There are some other use cases that this hardware is well-suited for, such as more advanced image classification, pose estimation, audio classification, depth estimation, and image segmentation. I will be looking at what the most useful models require in terms of operations and implementing them.There is quite some low hanging fruit for improving performance, so I expect myself to be implementing support for zero-compression, more advanced tiling, better use of the SRAM in the device, and a few others.And at some point I should start looking at other NPU IP to add support to. The ones I'm currently leading the most towards are RockChip's own IP, Mediatek's, Cadence's and Amlogic's.ThanksOne doesn't just start writing an NPU driver by itself, and even more without any documentation, so I need to thank the following people who have helped me greatly in this effort:Collabora for allowing me to start playing with this while I still worked with them.Libre Computer and specifically Da Xue for supporting me financially for most of 2023. They are a very small company, so I really appreciate that they believed in the project and put aside some money so I could focus on it.Igalia for letting Christian Gmeiner spend time reviewing all my code and answering my questions about Etnaviv. Embedded Recipes for giving me the opportunity to present my work last autumn in Paris.Lucas Stach from Pengutronix for answering my questions and listening to my problems when I suspected of something in the Etnaviv kernel driver.Neil Armstrong from Linaro for supporting me in the hardware enablement of the NPU driver on the Amlogic SoCs.And a collective thanks to the DRI/Mesa community for being so awesome!
  • Samuel Iglesias: XDC 2023: Behind the curtains (2024/01/22 09:06)
    Time flies! Back in October, Igalia organized X.Org Developers Conference 2023 in A Coruña, Spain. In case you don’t know it, X.Org Developers Conference, despite the X.Org in the name, is a conference for all developers working in the open-source graphics stack: anything related to DRM/KMS, Mesa, X11 and Wayland compositors, etc. This year, I participated in the organization of XDC in A Coruña, Spain (again!) by taking care of different aspects: from logistics in the venue (Palexco) to running it in person. It was a very tiring but fulfilling experience. Sponsors First of all, I would like to thank all the sponsors for their support, as without them, this conference wouldn’t happen: They didn’t only give economic support to the conference: Igalia sponsored the welcome event and lunches; X.Org Foundation sponsored coffee breaks; Tourism Office of A Coruña sponsored the guided tour in the city center; and Raspberry Pi sent Raspberry Pi 5 boards to all speakers! XDC 2023 Stats XDC 2023 was a success on attendance and talks submissions. Here you have some stats: 📈 160 registered attendees. 👬 120 attendees picked their badge in person. 💻 25 attendees registered as virtual. 📺 More than 6,000 views on live stream. 📝 55 talks/workshops/demos distributed in three days of conference.. 🧗‍♀️ There were 3 social events: welcome event, city center guide tour, and one unofficial climbing activity! Was XDC 2023 perfect organization-wise? Of course… no! Like in any event, we had some issues here and there: one with the Wi-Fi network that was quickly detected and fixed; some issues with the meals and coffee breaks (food allergies mainly), we lost some seconds of audio of a talk in the on-live streaming, and other minor things. Not bad for a community-run event! Nevertheless, I would like to thank all the staff at Palexco for their quick response and their understanding. Talk recordings & slides Want to see again some talks? All conference recordings were uploaded to X.Org Foundation Youtube channel. Slides are available to download in each talk description. Enjoy! XDC 2024 We cannot tell yet where is going to happen XDC 2024, other than it will be in North America… but I can tell you that this will be announced soon. Stay tuned! Want to organize XDC 2025 or XDC 2026? If we continue with the current cadence: 2025 would be again in Europe, and 2026 event would be in North America. There is a list of requirements here. Nevertheless, feel free to contact me, or to the X.Org Board of Directors, in order to get first-hand experience and knowledge about what organizing XDC entails. Thanks Thanks to all volunteers, collaborators, Palexco staff, GPUL, X.Org Foundation and many other people for their hard work. Special thanks to my Igalia colleague Chema, who did an outstanding job organizing the event together with me. Thanks for the sponsors for their extraordinary support to this conference. Thanks to Igalia not only for sponsoring the event, but also for all the support I got during the past year. I am glad to be part of this company, and I am always surprised by how great my colleagues are. And last, but not least, thanks to all speakers and attendees. Without you, the conference won’t exist. See you at XDC 2024!
  • Simon Ser: Status update, January 2024 (2024/01/20 22:00)
    Hi! This month has been pretty hectic due to the SourceHut network outage. We’ve all in the staff team invested a lot of time and energy to minimize the downtime as much as possible. Thankfully things have settled down now, there are still a lot of follow-up tasks to complete but with less urgency. I’m really grateful for the community’s reaction, everybody had been very understanding and supportive. Thank you! In other SourceHut news, I’ve been working on yojo, a bridge which provides CI for Codeberg projects via I’ve added support for pull requests, taught yojo to handle multiple manifests, added logic to automatically refresh access tokens before they expire, and fixed a bunch of bugs. The NPotM is, a docker-compose configuration for SourceHut. It provides an easy way to spin up a SourceHut development environment without having to set up each service and its dependencies individually. I hope this project can reduce friction for new SourceHut contributors. There are many services missing, patches welcome! This month, we’ve finally merged the Sway pull request to use the wlroots scene-graph API! This is exciting because it fixes a whole class of bugs, it removes a lot of manual hand-rolled logic in Sway (e.g. rendering, damage tracking, input event routing, direct scan-out, some of the protocol support…), it provides nice performance optimizations via culling (e.g. the background image is no longer painted if a web browser is covering it), and it unlocks upcoming performance optimizations (e.g. KMS plane offloading). Many thanks to Alexander for writing the patches and maintaining them for over a year, and to Kirill for pushing it over the finish line! On the wlroots side, my work on wlr_surface_synced has been merged, allowing us to latch surface commits until an arbitrary condition is met. This work is necessary for the upcoming explicit synchronization protocol, as well as the work-in-progress transactions protocol and avoiding compositor freezes when a client is very slow to render. We’ve released wlroots 0.17.1, with a collection of bugfixes backported by Simon Zeni. Last, we’ve dropped support for the legacy wl_drm protocol by default, and this caused a bit of breakage here and there (xserver, libva, amdvlk). We do really want to phase out wl_drm though, so we’ve decided to stick with that removal. This month’s collection of miscellaneous project updates includes go-imap v2 alpha 8 with separate types for sequence numbers and UIDs, which was a lot of work to get right but I think was worth it. I’ve also released go-maildir v0.4.0 with a new Walk function (to iterate over messages without allocating a list) and numerous fixes. I’ve sent a GitLab cli patch to fix invalid release asset links for third-party GitLab instances, and a Meson patch to add C23 support. See you next month!
  • Matthias Klumpp: Wayland really breaks things… Just for now? (2024/01/11 16:24)
    This post is in part a response to an aspect of Nate’s post “Does Wayland really break everything?“, but also my reflection on discussing Wayland protocol additions, a unique pleasure that I have been involved with for the past months1. Some facts Before I start I want to make a few things clear: The Linux desktop will be moving to Wayland2 – this is a fact at this point (and has been for a while), sticking to X11 makes no sense for future projects. From reading Wayland protocols and working with it at a much lower level than I ever wanted to, it is also very clear to me that Wayland is an exceptionally well-designed core protocol, and so are the additional extension protocols (xdg-shell & Co.). The modularity of Wayland is great, it gives it incredible flexibility and will for sure turn out to be good for the long-term viability of this project (and also provides a path to correct protocol issues in future, if one is found). In other words: Wayland is an amazing foundation to build on, and a lot of its design decisions make a lot of sense! The shift towards people seeing “Linux” more as an application developer platform, and taking PipeWire and XDG Portals into account when designing for Wayland is also an amazing development and I love to see this – this holistic approach is something I always wanted! Furthermore, I think Wayland removes a lot of functionality that shouldn’t exist in a modern compositor – and that’s a good thing too! Some of X11’s features and design decisions had clear drawbacks that we shouldn’t replicate. I highly recommend to read Nate’s blog post, it’s very good and goes into more detail. And due to all of this, I firmly believe that any advancement in the Wayland space must come from within the project. But! But! Of course there was a “but” coming – I think while developing Wayland-as-an-ecosystem we are now entrenched into narrow concepts of how a desktop should work. While discussing Wayland protocol additions, a lot of concepts clash, people from different desktops with different design philosophies debate the merits of those over and over again never reaching any conclusion (just as you will never get an answer out of humans whether sushi or pizza is the clearly superior food, or whether CSD or SSD is better). Some people want to use Wayland as a vehicle to force applications to submit to their desktop’s design philosophies, others prefer the smallest and leanest protocol possible, other developers want the most elegant behavior possible. To be clear, I think those are all very valid approaches. But this also creates problems: By switching to Wayland compositors, we are already forcing a lot of porting work onto toolkit developers and application developers. This is annoying, but just work that has to be done. It becomes frustrating though if Wayland provides toolkits with absolutely no way to reach their goal in any reasonable way. For Nate’s Photoshop analogy: Of course Linux does not break Photoshop, it is Adobe’s responsibility to port it. But what if Linux was missing a crucial syscall that Photoshop needed for proper functionality and Adobe couldn’t port it without that? In that case it becomes much less clear on who is to blame for Photoshop not being available. A lot of Wayland protocol work is focused on the environment and design, while applications and work to port them often is considered less. I think this happens because the overlap between application developers and developers of the desktop environments is not necessarily large, and the overlap with people willing to engage with Wayland upstream is even smaller. The combination of Windows developers porting apps to Linux and having involvement with toolkits or Wayland is pretty much nonexistent. So they have less of a voice. A quick detour through the neuroscience research lab I have been involved with Freedesktop, GNOME and KDE for an incredibly long time now (more than a decade), but my actual job (besides consulting for Purism) is that of a PhD candidate in a neuroscience research lab (working on the morphology of biological neurons and its relation to behavior). I am mostly involved with three research groups in our institute, which is about 35 people. Most of us do all our data analysis on powerful servers which we connect to using RDP (with KDE Plasma as desktop). Since I joined, I have been pushing the envelope a bit to extend Linux usage to data acquisition and regular clients, and to have our data acquisition hardware interface well with it. Linux brings some unique advantages for use in research, besides the obvious one of having every step of your data management platform introspectable with no black boxes left, a goal I value very highly in research (but this would be its own blogpost). In terms of operating system usage though, most systems are still Windows-based. Windows is what companies develop for, and what people use by default and are familiar with. The choice of operating system is very strongly driven by application availability, and WSL being really good makes this somewhat worse, as it removes the need for people to switch to a real Linux system entirely if there is the occasional software requiring it. Yet, we have a lot more Linux users than before, and use it in many places where it makes sense. I also developed a novel data acquisition software that even runs on Linux-only and uses the abilities of the platform to its fullest extent. All of this resulted in me asking existing software and hardware vendors for Linux support a lot more often. Vendor-customer relationship in science is usually pretty good, and vendors do usually want to help out. Same for open source projects, especially if you offer to do Linux porting work for them… But overall, the ease of use and availability of required applications and their usability rules supreme. Most people are not technically knowledgeable and just want to get their research done in the best way possible, getting the best results with the least amount of friction. KDE/Linux usage at a control station for a particle accelerator at Adlershof Technology Park, Germany, for reference (by 25years of KDE)3 Back to the point The point of that story is this: GNOME, KDE, RHEL, Debian or Ubuntu: They all do not matter if the necessary applications are not available for them. And as soon as they are, the easiest-to-use solution wins. There are many facets of “easiest”: In many cases this is RHEL due to Red Hat support contracts being available, in many other cases it is Ubuntu due to its mindshare and ease of use. KDE Plasma is also frequently seen, as it is perceived a bit easier to onboard Windows users with it (among other benefits). Ultimately, it comes down to applications and 3rd-party support though. Here’s a dirty secret: In many cases, porting an application to Linux is not that difficult. The thing that companies (and FLOSS projects too!) struggle with and will calculate the merits of carefully in advance is whether it is worth the support cost as well as continuous QA/testing. Their staff will have to do all of that work, and they could spend that time on other tasks after all. So if they learn that “porting to Linux” not only means added testing and support, but also means to choose between the legacy X11 display server that allows for 1:1 porting from Windows or the “new” Wayland compositors that do not support the same features they need, they will quickly consider it not worth the effort at all. I have seen this happen. Of course many apps use a cross-platform toolkit like Qt, which greatly simplifies porting. But this just moves the issue one layer down, as now the toolkit needs to abstract Windows, macOS and Wayland. And Wayland does not contain features to do certain things or does them very differently from e.g. Windows, so toolkits have no way to actually implement the existing functionality in a way that works on all platforms. So in Qt’s documentation you will often find texts like “works everywhere except for on Wayland compositors or mobile”4. Many missing bits or altered behavior are just papercuts, but those add up. And if users will have a worse experience, this will translate to more support work, or people not wanting to use the software on the respective platform. What’s missing? Window positioning SDI applications with multiple windows are very popular in the scientific world. For data acquisition (for example with microscopes) we often have one monitor with control elements and one larger one with the recorded image. There is also other configurations where multiple signal modalities are acquired, and the experimenter aligns windows exactly in the way they want and expects the layout to be stored and to be loaded upon reopening the application. Even in the image from Adlershof Technology Park above you can see this style of UI design, at mega-scale. Being able to pop-out elements as windows from a single-window application to move them around freely is another frequently used paradigm, and immensely useful with these complex apps. It is important to note that this is not a legacy design, but in many cases an intentional choice – these kinds of apps work incredibly well on larger screens or many screens and are very flexible (you can have any window configuration you want, and switch between them using the (usually) great window management abilities of your desktop). Of course, these apps will work terribly on tablets and small form factors, but that is not the purpose they were designed for and nobody would use them that way. I assumed for sure these features would be implemented at some point, but when it became clear that that would not happen, I created the ext-placement protocol which had some good discussion but was ultimately rejected from the xdg namespace. I then tried another solution based on feedback, which turned out not to work for most apps, and now proposed xdg-placement (v2) in an attempt to maybe still get some protocol done that we can agree on, exploring more options before pushing the existing protocol for inclusion into the ext Wayland protocol namespace. Meanwhile though, we can not port any application that needs this feature, while at the same time we are switching desktops and distributions to Wayland by default. Window position restoration Similarly, a protocol to save & restore window positions was already proposed in 2018, 6 years ago now, but it has still not been agreed upon, and may not even help multiwindow apps in its current form. The absence of this protocol means that applications can not restore their former window positions, and the user has to move them to their previous place again and again. Meanwhile, toolkits can not adopt these protocols and applications can not use them and can not be ported to Wayland without introducing papercuts. Window icons Similarly, individual windows can not set their own icons, and not-installed applications can not have an icon at all because there is no desktop-entry file to load the icon from and no icon in the theme for them. You would think this is a niche issue, but for applications that create many windows, providing icons for them so the user can find them is fairly important. Of course it’s not the end of the world if every window has the same icon, but it’s one of those papercuts that make the software slightly less user-friendly. Even applications with fewer windows like LibrePCB are affected, so much so that they rather run their app through Xwayland for now. I decided to address this after I was working on data analysis of image data in a Python virtualenv, where my code and the Python libraries used created lots of windows all with the default yellow “W” icon, making it impossible to distinguish them at a glance. This is xdg-toplevel-icon now, but of course it is an uphill battle where the very premise of needing this is questioned. So applications can not use it yet. Limited window abilities requiring specialized protocols Firefox has a picture-in-picture feature, allowing it to pop out media from a mediaplayer as separate floating window so the user can watch the media while doing other things. On X11 this is easily realized, but on Wayland the restrictions posed on windows necessitate a different solution. The xdg-pip protocol was proposed for this specialized usecase, but it is also not merged yet. So this feature does not work as well on Wayland. Automated GUI testing / accessibility / automation Automation of GUI tasks is a powerful feature, so is the ability to auto-test GUIs. This is being worked on, with libei and wlheadless-run (and stuff like ydotool exists too), but we’re not fully there yet. Wayland is frustrating for (some) application authors As you see, there is valid applications and valid usecases that can not be ported yet to Wayland with the same feature range they enjoyed on X11, Windows or macOS. So, from an application author’s perspective, Wayland does break things quite significantly, because things that worked before can no longer work and Wayland (the whole stack) does not provide any avenue to achieve the same result. Wayland does “break” screen sharing, global hotkeys, gaming latency (via “no tearing”) etc, however for all of these there are solutions available that application authors can port to. And most developers will gladly do that work, especially since the newer APIs are usually a lot better and more robust. But if you give application authors no path forward except “use Xwayland and be on emulation as second-class citizen forever”, it just results in very frustrated application developers. For some application developers, switching to a Wayland compositor is like buying a canvas from the Linux shop that forces your brush to only draw triangles. But maybe for your avant-garde art, you need to draw a circle. You can approximate one with triangles, but it will never be as good as the artwork of your friends who got their canvases from the Windows or macOS art supply shop and have more freedom to create their art. Triangles are proven to be the best shape! If you are drawing circles you are creating bad art! Wayland, via its protocol limitations, forces a certain way to build application UX – often for the better, but also sometimes to the detriment of users and applications. The protocols are often fairly opinionated, a result of the lessons learned from X11. In any case though, it is the odd one out – Windows and macOS do not pose the same limitations (for better or worse!), and the effort to port to Wayland is orders of magnitude bigger, or sometimes in case of the multiwindow UI paradigm impossible to achieve to the same level of polish. Desktop environments of course have a design philosophy that they want to push, and want applications to integrate as much as possible (same as macOS and Windows!). However, there are many applications out there, and pushing a design via protocol limitations will likely just result in fewer apps. The porting dilemma I spent probably way too much time looking into how to get applications cross-platform and running on Linux, often talking to vendors (FLOSS and proprietary) as well. Wayland limitations aren’t the biggest issue by far, but they do start to come come up now, especially in the scientific space with Ubuntu having switched to Wayland by default. For application authors there is often no way to address these issues. Many scientists do not even understand why their Python script that creates some GUIs suddenly behaves weirdly because Qt is now using the Wayland backend on Ubuntu instead of X11. They do not know the difference and also do not want to deal with these details – even though they may be programmers as well, the real goal is not to fiddle with the display server, but to get to a scientific result somehow. Another issue is portability layers like Wine which need to run Windows applications as-is on Wayland. Apparently Wine’s Wayland driver has some heuristics to make window positioning work (and I am amazed by the work done on this!), but that can only go so far. A way out? So, how would we actually solve this? Fundamentally, this excessively long blog post boils down to just one essential question: Do we want to force applications to submit to a UX paradigm unconditionally, potentially loosing out on application ports or keeping apps on X11 eternally, or do we want to throw them some rope to get as many applications ported over to Wayland, even through we might sacrifice some protocol purity? I think we really have to answer that to make the discussions on wayland-protocols a lot less grueling. This question can be answered at the wayland-protocols level, but even more so it must be answered by the individual desktops and compositors. If the answer for your environment turns out to be “Yes, we want the Wayland protocol to be more opinionated and will not make any compromises for application portability”, then your desktop/compositor should just immediately NACK protocols that add something like this and you simply shouldn’t engage in the discussion, as you reject the very premise of the new protocol: That it has any merit to exist and is needed in the first place. In this case contributors to Wayland and application authors also know where you stand, and a lot of debate is skipped. Of course, if application authors want to support your environment, you are basically asking them now to rewrite their UI, which they may or may not do. But at least they know what to expect and how to target your environment. If the answer turns out to be “We do want some portability”, the next question obviously becomes where the line should be drawn and which changes are acceptable and which aren’t. We can’t blindly copy all X11 behavior, some porting work to Wayland is simply inevitable. Some written rules for that might be nice, but probably more importantly, if you agree fundamentally that there is an issue to be fixed, please engage in the discussions for the respective MRs! We for sure do not want to repeat X11 mistakes, and I am certain that we can implement protocols which provide the required functionality in a way that is a nice compromise in allowing applications a path forward into the Wayland future, while also being as good as possible and improving upon X11. For example, the toplevel-icon proposal is already a lot better than anything X11 ever had. Relaxing ACK requirements for the ext namespace is also a good proposed administrative change, as it allows some compositors to add features they want to support to the shared repository easier, while also not mandating them for others. In my opinion, it would allow for a lot less friction between the two different ideas of how Wayland protocol development should work. Some compositors could move forward and support more protocol extensions, while more restrictive compositors could support less things. Applications can detect supported protocols at launch and change their behavior accordingly (ideally even abstracted by toolkits). You may now say that a lot of apps are ported, so surely this issue can not be that bad. And yes, what Wayland provides today may be enough for 80-90% of all apps. But what I hope the detour into the research lab has done is convince you that this smaller percentage of apps matters. A lot. And that it may be worthwhile to support them. To end on a positive note: When it came to porting concrete apps over to Wayland, the only real showstoppers so far5 were the missing window-positioning and window-position-restore features. I encountered them when porting my own software, and I got the issue as feedback from colleagues and fellow engineers. In second place was UI testing and automation support, the window-icon issue was mentioned twice, but being a cosmetic issue it likely simply hurts people less and they can ignore it easier. What this means is that the majority of apps are already fine, and many others are very, very close! A Wayland future for everyone is within our grasp! I will also bring my two protocol MRs to their conclusion for sure, because as application developers we need clarity on what the platform (either all desktops or even just a few) supports and will or will not support in future. And the only way to get something good done is by contribution and friendly discussion. Footnotes Apologies for the clickbait-y title – it comes with the subject ︎When I talk about “Wayland” I mean the combined set of display server protocols and accepted protocol extensions, unless otherwise clarified. ︎I would have picked a picture from our lab, but that would have needed permission first ︎Qt has awesome “platform issues” pages, like for macOS and Linux/X11 which help with porting efforts, but Qt doesn’t even list Linux/Wayland as supported platform. There is some information though, like window geometry peculiarities, which aren’t particularly helpful when porting (but still essential to know). ︎Besides issues with Nvidia hardware – CUDA for simulations and machine-learning is pretty much everywhere, so Nvidia cards are common, which causes trouble on Wayland still. It is improving though. ︎
  • Maira Canal: Introducing CPU jobs to the Raspberry Pi (2024/01/11 13:30)
    Igalia is always working hard to improve 3D rendering drivers of the Broadcom VideoCore GPU, found in Raspberry Pi devices. One of our most recent efforts in this sense was the implementation of CPU jobs from the Vulkan driver to the V3D kernel driver. What are CPU jobs and why do we need them? In the V3DV driver, there are some Vulkan commands that cannot be performed by the GPU alone, so we implement those as CPU jobs on Mesa. A CPU job is a job that requires CPU intervention to be performed. For example, in the Broadcom VideoCore GPUs, we don’t have a way to calculate the timestamp. But we need the timestamp for Vulkan timestamp queries. Therefore, we need to calculate the timestamp on the CPU. A CPU job in userspace also implies CPU stalling. Sometimes, we need to hold part of the command submission flow in order to correctly synchronize their execution. This waiting period caused the CPU to stall, thereby preventing the continuous submission of jobs to the GPU. To mitigate this issue, we decided to move CPU job mechanisms from the V3DV driver to the V3D kernel driver. In the V3D kernel driver, we have different kinds of jobs: RENDER jobs, BIN jobs, CSD jobs, TFU jobs, and CLEAN CACHE jobs. For each of those jobs, we have a DRM scheduler instance that helps us to synchronize the jobs. If you want to know more about the different kinds of V3D jobs, check out this November Update: Exploring V3D blogpost, where I explain more about all the V3D IOCTLs and jobs. Jobs of the same kind are submitted, dispatched, and processed in the same order they are executed, using a standard first-in-first-out (FIFO) queue system. We can synchronize different jobs across different queues using DRM syncobjs. More about the V3D synchronization framework and user extensions can be learned in this two-part blog post from Melissa Wen. From the kernel documentation, a DRM syncobj (synchronisation objects) are containers for stuff that helps sync up GPU commands. They’re super handy because you can use them in your own programs, share them with other programs, and even use them across different DRM drivers. Mostly, they’re used for making Vulkan fences and semaphores work. By moving the CPU job from userspace to the kernel, we can make use of the DRM schedule queues and all the advantages it brings with it. For this, we created a new type of job in the V3D kernel driver, a CPU job, which also means creating a new DRM scheduler instance and a CPU job queue. Now, instead of stalling the submission thread waiting for the GPU to idle, we can use DRM syncobjs to synchronize both CPU and GPU jobs in a submission, providing more efficient usage of the GPU. How did we implement the CPU jobs in the kernel driver? After we decided to have a CPU job implementation in the kernel space, we could think about two possible implementations for this job: creating an IOCTL for each type of CPU job or using a user extension to provide a polymorphic behavior to a single CPU job IOCTL. We have different types of CPU jobs (indirect CSD jobs, timestamp query jobs, copy query results jobs…) and each of them has a common infrastructure of allocation and synchronization but performs different operations. Therefore, we decided to go with the option to use user extensions. On Melissa’s blogpost, she digs deep into the implementation of generic IOCTL extensions in the V3D kernel driver. But, to put it simply, instead of expanding the data struct for each IOCTL every time we need to add a new feature, we define a user extension chain instead. As we add new optional interfaces to control the IOCTL, we define a new extension struct that can be linked to the IOCTL data only when required by the user. Therefore, we created a new IOCTL, drm_v3d_submit_cpu, which is used to submit any type of CPU job. This single IOCTL can be extended by a user extension, which allows us to reuse the common infrastructure - avoiding code repetition - and yet use the user extension ID to identify the type of job and depending on the type of job, perform a certain operation. struct drm_v3d_submit_cpu { /* Pointer to a u32 array of the BOs that are referenced by the job. * * For DRM_V3D_EXT_ID_CPU_INDIRECT_CSD, it must contain only one BO, * that contains the workgroup counts. * * For DRM_V3D_EXT_ID_TIMESTAMP_QUERY, it must contain only one BO, * that will contain the timestamp. * * For DRM_V3D_EXT_ID_CPU_RESET_TIMESTAMP_QUERY, it must contain only * one BO, that contains the timestamp. * * For DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY, it must contain two * BOs. The first is the BO where the timestamp queries will be written * to. The second is the BO that contains the timestamp. * * For DRM_V3D_EXT_ID_CPU_RESET_PERFORMANCE_QUERY, it must contain no * BOs. * * For DRM_V3D_EXT_ID_CPU_COPY_PERFORMANCE_QUERY, it must contain one * BO, where the performance queries will be written. */ __u64 bo_handles; /* Number of BO handles passed in (size is that times 4). */ __u32 bo_handle_count; __u32 flags; /* Pointer to an array of ioctl extensions*/ __u64 extensions; }; Now, we can create a CPU job and submit it with a CPU job user extension. And which extensions are available? DRM_V3D_EXT_ID_CPU_INDIRECT_CSD: this CPU job allows us to submit an indirect CSD job. An indirect CSD job is a job that, when executed in the queue, will map an indirect buffer, read the dispatch parameters, and submit a regular dispatch. This CPU job is used in Vulkan calls like vkCmdDispatchIndirect(). DRM_V3D_EXT_ID_CPU_TIMESTAMP_QUERY: this CPU job calculates the query timestamp and updates the query availability by signaling a syncobj. This CPU job is used in Vulkan calls like vkCmdWriteTimestamp(). DRM_V3D_EXT_ID_CPU_RESET_TIMESTAMP_QUERY: this CPU job resets the timestamp queries based on the value offset of the first query. This CPU job is used in Vulkan calls like vkCmdResetQueryPool() for timestamp queries. DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY: this CPU job copies the complete or partial result of a query to a buffer. This CPU job is used in Vulkan calls like vkCmdCopyQueryPoolResults() for timestamp queries. DRM_V3D_EXT_ID_CPU_RESET_PERFORMANCE_QUERY: this CPU job resets the performance queries by resetting the values of the perfmons. This CPU job is used in Vulkan calls like vkCmdResetQueryPool() for performance queries. DRM_V3D_EXT_ID_CPU_COPY_PERFORMANCE_QUERY: similar to DRM_V3D_EXT_ID_CPU_COPY_TIMESTAMP_QUERY, this CPU job copies the complete or partial result of a query to a buffer. This CPU job is used in Vulkan calls like vkCmdCopyQueryPoolResults() for performance queries. The CPU job IOCTL structure is similar to any other V3D job. We allocate the job struct, parse all the extensions, init the job, look up the BOs and lock its reservations, add the proper dependencies, and push the job to the DRM scheduler entity. When running a CPU job, we execute the following code: static const v3d_cpu_job_fn cpu_job_function[] = { [V3D_CPU_JOB_TYPE_INDIRECT_CSD] = v3d_rewrite_csd_job_wg_counts_from_indirect, [V3D_CPU_JOB_TYPE_TIMESTAMP_QUERY] = v3d_timestamp_query, [V3D_CPU_JOB_TYPE_RESET_TIMESTAMP_QUERY] = v3d_reset_timestamp_queries, [V3D_CPU_JOB_TYPE_COPY_TIMESTAMP_QUERY] = v3d_copy_query_results, [V3D_CPU_JOB_TYPE_RESET_PERFORMANCE_QUERY] = v3d_reset_performance_queries, [V3D_CPU_JOB_TYPE_COPY_PERFORMANCE_QUERY] = v3d_copy_performance_query, }; static struct dma_fence * v3d_cpu_job_run(struct drm_sched_job *sched_job) { struct v3d_cpu_job *job = to_cpu_job(sched_job); struct v3d_dev *v3d = job->base.v3d; v3d->cpu_job = job; if (job->job_type >= ARRAY_SIZE(cpu_job_function)) { DRM_DEBUG_DRIVER("Unknown CPU job: %d\n", job->job_type); return NULL; } trace_v3d_cpu_job_begin(&v3d->drm, job->job_type); cpu_job_function[job->job_type](job); trace_v3d_cpu_job_end(&v3d->drm, job->job_type); return NULL; } The interesting thing is that each CPU job type executes a completely different operation. The complete kernel implementation has already landed in drm-misc-next and can be seen right here. What did we change in Mesa-V3DV to use the new kernel-V3D CPU job? After landing the kernel implementation, I needed to accommodate the new CPU job approach in the userspace. A fundamental rule is not to cause regressions, i.e., to keep backwards userspace compatibility with old versions of the Linux kernel. This means we cannot break new versions of Mesa running in old kernels. Therefore, we needed to create two paths: one preserving the old way to perform CPU jobs and the other using the kernel to perform CPU jobs. So, for example, the indirect CSD job used to add two different jobs to the queue: a CPU job and a CSD job. Now, if we have the CPU job capability in the kernel, we only add a CPU job and the CSD job is dispatched from within the kernel. - list_addtail(&csd_job->list_link, &cmd_buffer->jobs); + + /* If we have a CPU queue we submit the CPU job directly to the + * queue and the CSD job will be dispatched from within the kernel + * queue, otherwise we will have to dispatch the CSD job manually + * right after the CPU job by adding it to the list of jobs in the + * command buffer. + */ + if (!cmd_buffer->device->pdevice->caps.cpu_queue) + list_addtail(&csd_job->list_link, &cmd_buffer->jobs); Furthermore, now we can use syncobjs to sync the CPU jobs. For example, in the timestamp query CPU job, we used to stall the submission thread and wait for completion of all work queued before the timestamp query. Now, we can just add a barrier to the CPU job and it will be properly synchronized by the syncobjs without stalling the submission thread. /* The CPU job should be serialized so it only executes after all previously * submitted work has completed */ job->serialize = V3DV_BARRIER_ALL; We were able to test the implementation using multiple CTS tests, such as dEQP-VK.compute.pipeline.indirect_dispatch.*, dEQP-VK.pipeline.monolithic.timestamp.*, dEQP-VK.synchronization.*, dEQP-VK.query_pool.* and dEQP-VK.multiview.*. The userspace implementation has already landed in Mesa and the full implementation can be checked in this MR. More about the on-going challenges in the Raspberry Pi driver stack can be checked during this XDC 2023 talk presented by Iago Toral, Juan Suárez and myself. During this talk, Iago mentioned the CPU job work that we have been doing. Also I cannot finish this post without thanking Melissa Wen and Iago Toral for all the help while developing the CPU jobs for the V3D kernel driver.
  • Tomeu Vizoso: Etnaviv NPU update 14: Object detection with decent performance (2024/01/10 11:14)
    When almost two months ago I got MobileNetV1 running with useful performance on my driver for the Vivante NPU, I took that milestone as a partial validation of my approach.Partial because MobileNetV1 is a quite old model by now and since then several iterations have passed with better accuracy and better performance. Would I be able to, without any documentation, add enough support to run newer models with useful performance?Since then, I have been spending some time looking at the state of the art for object detection models. Getting a sense of the gap between the features supported by my driver and the operations that the newer models use.SSDLite MobileDet is already 3 years old but can still be considered state-of-the-art on most hardware, with good accuracy while having a low latency.The graph structure was more complex than that of MobileNet, and it used tensor addition operations which I didn't support at the moment. There are other operations that I didn't support, but those were at the end and could be performed in the CPU without much penalty.So after implementing additions along with a few medium-sized refactorings, I got the model running correctly:Performance wasn't that bad at that moment, at 129ms it was twice as fast as the CPU and "only" 5 times slower than the proprietary driver.I knew that I was using extremely conservative values for the size of the output tiles, so I wrote some scripts to run hundreds of different convolution configurations and tabulate the parameters that the proprietary driver used to program the hardware.After a lot of time spent staring at a spreadsheet I came up with a reasonable guess at what are the conditions that limit the size of the tiles. By using the biggest tile size that is still safe, I got much better performance: 56.149ms, so almost 18 inferences can be performed per second.If we look at a practical use case such that supported by Frigate NVR, a typical frame rate for the video inputs is 5 FPS. With our current performance level, we could run 3-4 inferences on each frame if there may be several objects being tracked at the same time, or 3-4 cameras simultaneously if not.Given the price level of the single board computers that contain the VIPNano, this is quite a good bang for your bucks. And all open source and heading to mainline!Next stepsI have started cleaning up the latest changes so they can be reviewed upstream. And need to make sure that the in-flight patches to the kernel are merged now that the window for 6.8 has opened.
  • Lennart Poettering: A re-introduction to mkosi -- A Tool for Generating OS Images (2024/01/09 23:00)
    This is a guest post written by Daan De Meyer, systemd and mkosi maintainer Almost 7 years ago, Lennart first wrote about mkosi on this blog. Some years ago, I took over development and there's been a huge amount of changes and improvements since then. So I figure this is a good time to re-introduce mkosi. mkosi stands for Make Operating System Image. It generates OS images that can be used for a variety of purposes. If you prefer watching a video over reading a blog post, you can also watch my presentation on mkosi at All Systems Go 2023. What is mkosi? mkosi was originally written as a tool to simplify hacking on systemd and for experimenting with images using many of the new concepts being introduced in systemd at the time. In the meantime, it has evolved into a general purpose image builder that can be used in a multitude of scenarios. Instructions to install mkosi can be found in its readme. We recommend running the latest version to take advantage of all the latest features and bug fixes. You'll also need bubblewrap and the package manager of your favorite distribution to get started. At its core, the workflow of mkosi can be divided into 3 steps: Generate an OS tree for some distribution by installing a set of packages. Package up that OS tree in a variety of output formats. (Optionally) Boot the resulting image in qemu or systemd-nspawn. Images can be built for any of the following distributions: Fedora Linux Ubuntu OpenSUSE Debian Arch Linux CentOS Stream RHEL Rocky Linux Alma Linux And the following output formats are supported: GPT disk images built with systemd-repart Tar archives CPIO archives (for building initramfs images) USIs (Unified System Images which are full OS images packed in a UKI) Sysext, confext and portable images Directory trees For example, to build an Arch Linux GPT disk image and boot it in qemu, you can run the following command: $ mkosi -d arch -p systemd -p udev -p linux -t disk qemu To instead boot the image in systemd-nspawn, replace qemu with boot: $ mkosi -d arch -p systemd -p udev -p linux -t disk boot The actual image can be found in the current working directory named image.raw. However, using a separate output directory is recommended which is as simple as running mkdir mkosi.output. To rebuild the image after it's already been built once, add -f to the command line before the verb to rebuild the image. Any arguments passed after the verb are forwarded to either systemd-nspawn or qemu itself. To build the image without booting it, pass build instead of boot or qemu or don't pass a verb at all. By default, the disk image will have an appropriately sized root partition and an ESP partition, but the partition layout and contents can be fully customized using systemd-repart by creating partition definition files in mkosi.repart/. This allows you to customize the partition as you see fit: The root partition can be encrypted. Partition sizes can be customized. Partitions can be protected with signed dm-verity. You can opt out of having a root partition and only have a /usr partition instead. You can add various other partitions, e.g. an XBOOTLDR partition or a swap partition. ... As part of building the image, we'll run various tools such as systemd-sysusers, systemd-firstboot, depmod, systemd-hwdb and more to make sure the image is set up correctly. Configuring mkosi image builds Naturally with extended use you don't want to specify all settings on the command line every time, so mkosi supports configuration files where the same settings that can be specified on the command line can be written down. For example, the command we used above can be written down in a configuration file mkosi.conf: [Distribution] Distribution=arch [Output] Format=disk [Content] Packages= systemd udev linux Like systemd, mkosi uses INI configuration files. We also support dropins which can be placed in mkosi.conf.d. Configuration files can also be conditionalized using the [Match] section. For example, to only install a specific package on Arch Linux, you can write the following to mkosi.conf.d/10-arch.conf: [Match] Distribution=arch [Content] Packages=pacman Because not everything you need will be supported in mkosi, we support running scripts at various points during the image build process where all extra image customization can be done. For example, if it is found, mkosi.postinst is called after packages have been installed. Scripts are executed on the host system by default (in a sandbox), but can be executed inside the image by suffixing the script with .chroot, so if mkosi.postinst.chroot is found it will be executed inside the image. To add extra files to the image, you can place them in mkosi.extra in the source directory and they will be automatically copied into the image after packages have been installed. Bootable images If the necessary packages are installed, mkosi will automatically generate a UEFI/BIOS bootable image. As mkosi is a systemd project, it will always build UKIs (Unified Kernel Images), except if the image is BIOS-only (since UKIs cannot be used on BIOS). The initramfs is built like a regular image by installing distribution packages and packaging them up in a CPIO archive instead of a disk image. Specifically, we do not use dracut, mkinitcpio or initramfs-tools to generate the initramfs from the host system. ukify is used to assemble all the individual components into a UKI. If you don't want mkosi to generate a bootable image, you can set Bootable=no to explicitly disable this logic. Using mkosi for development The main requirements to use mkosi for development is that we can build our source code against the image we're building and install it into the image we're building. mkosi supports this via build scripts. If a script named (or is found, we'll execute it as part of the build. Any files put by the build script into $DESTDIR will be installed into the image. Required build dependencies can be installed using the BuildPackages= setting. These packages are installed into an overlay which is put on top of the image when running the build script so the build packages are available when running the build script but don't end up in the final image. An example script for a project using meson could look as follows: #!/bin/sh meson setup "$BUILDDIR" "$SRCDIR" ninja -C "$BUILDDIR" if ((WITH_TESTS)); then meson test -C "$BUILDDIR" fi meson install -C "$BUILDDIR" Now, every time the image is built, the build script will be executed and the results will be installed into the image. The $BUILDDIR environment variable points to a directory that can be used as the build directory for build artifacts to allow for incremental builds if the build system supports it. Of course, downloading all packages from scratch every time and re-installing them again every time the image is built is rather slow, so mkosi supports two modes of caching to speed things up. The first caching mode caches all downloaded packages so they don't have to be downloaded again on subsequent builds. Enabling this is as simple as running mkdir mkosi.cache. The second mode of caching caches the image after all packages have been installed but before running the build script. On subsequent builds, mkosi will copy the cache instead of reinstalling all packages from scratch. This mode can be enabled using the Incremental= setting. While there is some rudimentary cache invalidation, the cache can also forcibly be rebuilt by specifying -ff on the command line instead of -f. Note that when running on a btrfs filesystem, mkosi will automatically use subvolumes for the cached images which can be snapshotted on subsequent builds for even faster rebuilds. We'll also use reflinks to do copy-on-write copies where possible. With this setup, by running mkosi -f qemu in the systemd repository, it takes about 40 seconds to go from a source code change to a root shell in a virtual machine running the latest systemd with your change applied. This makes it very easy to test changes to systemd in a safe environment without risk of breaking your host system. Of course, while 40 seconds is not a very long time, it's still more than we'd like, especially if all we're doing is modifying the kernel command line. That's why we have the KernelCommandLineExtra= option to configure kernel command line options that are passed to the container or virtual machine at runtime instead of being embedded into the image. These extra kernel command line options are picked up when the image is booted with qemu's direct kernel boot (using -append), but also when booting a disk image in UEFI mode (using SMBIOS). The same applies to systemd credentials (using the Credentials= setting). These settings allow configuring the image without having to rebuild it, which means that you only have to run mkosi qemu or mkosi boot again afterwards to apply the new settings. Building images without root privileges and loop devices By using newuidmap/newgidmap and systemd-repart, mkosi is able to build images without needing root privileges. As long as proper subuid and subgid mappings are set up for your user in /etc/subuid and /etc/subgid, you can run mkosi as your regular user without having to switch to root. Note that as of the writing of this blog post this only applies to the build and qemu verbs. Booting the image in a systemd-nspawn container with mkosi boot still needs root privileges. We're hoping to fix this in an future systemd release. Regardless of whether you're running mkosi with root or without root, almost every tool we execute is invoked in a sandbox to isolate as much of the build process from the host as possible. For example, /etc and /var from the host are not available in this sandbox, to avoid host configuration inadvertently affecting the build. Because systemd-repart can build disk images without loop devices, mkosi can run from almost any environment, including containers. All that's needed is a UID range with 65536 UIDs available, either via running as the root user or via /etc/subuid and newuidmap. In a future systemd release, we're hoping to provide an alternative to newuidmap and /etc/subuid to allow running mkosi from all containers, even those with only a single UID available. Supporting older distributions mkosi depends on very recent versions of various systemd tools (v254 or newer). To support older distributions, we implemented so called tools trees. In short, mkosi can first build a tools image for you that contains all required tools to build the actual image. This can be enabled by adding ToolsTree=default to your mkosi configuration. Building a tools image does not require a recent version of systemd. In the systemd mkosi configuration, we automatically use a tools tree if we detect your distribution does not have the minimum required systemd version installed. Configuring variants of the same image using profiles Profiles can be defined in the mkosi.profiles/ directory. The profile to use can be selected using the Profile= setting (or --profile=) on the command line. A profile allows you to bundle various settings behind a single recognizable name. Profiles can also be matched on if you want to apply some settings only to a few profiles. For example, you could have a bootable profile that sets Bootable=yes, adds the linux and systemd-boot packages and configures Format=disk to end up with a bootable disk image when passing --profile bootable on the kernel command line. Building system extension images System extension images may – dynamically at runtime — extend the base system with an overlay containing additional files. To build system extensions with mkosi, we need a base image on top of which we can build our extension. To keep things manageable, we'll make use of mkosi's support for building multiple images so that we can build our base image and system extension in one go. We start by creating a temporary directory with a base configuration file mkosi.conf with some shared settings: [Output] OutputDirectory=mkosi.output CacheDirectory=mkosi.cache Now let's continue with the base image definition by writing the following to mkosi.images/base/mkosi.conf: [Output] Format=directory [Content] CleanPackageMetadata=no Packages=systemd udev We use the directory output format here instead of the disk output so that we can build our extension without needing root privileges. Now that we have our base image, we can define a sysext that builds on top of it by writing the following to mkosi.images/btrfs/mkosi.conf: [Config] Dependencies=base [Output] Format=sysext Overlay=yes [Content] BaseTrees=%O/base Packages=btrfs-progs BaseTrees= point to our base image and Overlay=yes instructs mkosi to only package the files added on top of the base tree. We can't sign the extension image without a key. We can generate one by running mkosi genkey which will generate files that are automatically picked up when building the image. Finally, you can build the base image and the extensions by running mkosi -f. You'll find btrfs.raw in mkosi.output which is the extension image. Various other interesting features To sign any generated UKIs for secure boot, put your secure boot key and certificate in mkosi.key and mkosi.crt and enable the SecureBoot= setting. You can also run mkosi genkey to have mkosi generate a key and certificate itself. The Ephemeral= setting can be enabled to boot the image in an ephemeral copy that is thrown away when the container or virtual machine exits. ShimBootloader= and BiosBootloader= settings are available to configure shim and grub installation if needed. mkosi can boot directory trees in a virtual using virtiofsd. This is very useful for quickly rebuilding an image and booting it as the image does not have to be packed up as a disk image. ... There's many more features that we won't go over in detail here in this blog post. Learn more about those by reading the documentation. Conclusion I'll finish with a bunch of links to more information about mkosi and related tooling: Github repository Building RHEL and RHEL UBI images with mkosi My presentation on systemd-repart at ASG 2023 mkosi's Matrix channel. systemd's mkosi configuration mkosi's mkosi configuration
  • Mike Blumenkrantz: First Bug Down (2024/01/08 00:00)
    Slow Start It’s been a slow start to the year, by which I mean I’ve been buried under an absolute deluge of all the things you can imagine and then also a blizzard. The literal kind, not the kind that used to make great games. Anyway, it’s not all fun and specs in my capacity as CEO of OpenGL. Sometimes I gotta do Real Work. The number one source of Real Work, as always, is my old code the mesa bug tracker. Unfortunately, the thing is completely overloaded with NVIDIA bugs right now, so it was slim pickins. Another Game I’ve Never Heard Of Am I a boomer? Is this what being a boomer feels like? I really have lived long enough to see myself become the villain. Next bug up is from this game called Valheim. I think it’s a LARPing chess game? Something like that? Don’t @ me. This report came in hot over the break with some rad new shading techniques: It looks way cooler if you play the trace, but you get the idea. Pinpoint Accuracy First question: what in the Sam Hill is going on here? Apparently RADV_DEBUG=hang fixes it, which was a curious one since no other env vars affected the issue. This means the problem is somehow caused by an issue related to the actual Vulkan queue submissions, since (according to legendary multipatch chef Samuel “PLZ SEND REVIEWS!!” Pitoiset) this flag synchronizes the queue after every submit. It’s therefore no surprise that renderdoc was useless. When viewed in isolation, each frame is perfect, but when played at speed the synchronization is lost. My first stops, as anyone would expect, were the sites of queue submission in zink. This means flush and present. Now, I know not everyone is going to be comfortable taking this kind of wild, unhinged guess like I did, but stick with me here. The first thing I checked was a breakpoint on zink_flush(), which is where API flush calls filter through. There were the usual end-of-frame hits, but there were a fair number of calls originating from glFenceSync, which is the way a developer can subtly inform a GL driver that they definitely know what they’re doing. So I saw these calls coming in, and I stepped through zink_flush(), and I reached this spot: if (!batch->has_work) { <-----HERE if (pfence) { /* reuse last fence */ fence = ctx->last_fence; } if (!deferred) { struct zink_batch_state *last = zink_batch_state(ctx->last_fence); if (last) { sync_flush(ctx, last); if (last->is_device_lost) check_device_lost(ctx); } } if (ctx->tc && !ctx->track_renderpasses) tc_driver_internal_flush_notify(ctx->tc); } else { fence = &batch->state->fence; submit_count = batch->state->usage.submit_count; if (deferred && !(flags & PIPE_FLUSH_FENCE_FD) && pfence) deferred_fence = true; else flush_batch(ctx, true); } Now this is a real puzzler, because if you know what you’re doing as a developer, you shouldn’t be reaching this spot. This is the penalty box where I put all the developers who don’t know what they’re doing, the spot where I push up my massive James Webb Space Telescope glasses and say, “No, ackchuahlly you don’t want to flush right now.” Because you only reach this spot if you trigger a flush when there’s nothing to flush. OR DO YOU? For hahas, I noped out the first part of that conditional, ensuring that all flushes would translate to queue submits, and magically the bug went away. It was a miracle. Until I tried to think through what must be happening for that to have any effect. Synchronization: You Cannot Escape The reason this was especially puzzling is the call sequence was: end-of-frame flush present glFenceSync flush which means the last flush was optimized out, instead returning the fence from the end-of-frame flush. And these should be identical in terms of operations the app would want to wait on. Except that there’s a present in there, and technically that’s a queue submission, and technically something might want to know if the submit for that has completed? Why yes, that is stupid, but here at SGC, stupidity is our sustenance. Anyway, I blasted out a quick fix, and now you can all go play your favorite chess sim on your favorite driver again.
  • Mike Blumenkrantz: Manifesto (2024/01/02 00:00)
    This Is It. It’s been a long break for the blog, but now we’re back and THE MEME FACTORY IS OPEN FOR BUSINESS. —is what I’d say if it were any other year. But it’s not any other year. This is 2024, and 2024 is a very special year. It’s the year a decades-old plan has finally yielded its dividends. Truth. You’ve all heard certain improbable claims before. Big Triangle this. Big Triangle that. Everyone knows who they are. Some have even accused me of being a shill for Big Triangle from time to time. At last, however, I can finally pull off my mask to reveal the truth for the world. I was born for a single purpose. As a child, I was grouped in with a number of other candidates. We were trained. Tested. Forged. Unshakable bonds grew between us, bonds we’ll never forget. Bonds that were threatened and broken again and again through harrowing selection processes that culled our ranks. In time, I was the only one remaining. The only one who survived that brutal gauntlet to fulfill an ultimate goal. The goal of infiltrating Big Triangle. More time passed. Days. Months. Years. I continued my quiet training, never letting on to my true purpose. Now, finally, I’ve achieved the impossible. I’ve attained a status within the ranks of Big Triangle that leaves me in command of vast, unfathomable resources. I have become an officer. I am the chair. Revolution. Now is the time to rise up, my friends. We must take back the triangles—those big and small, success green and failure red, variable rate shaded and fully shaded, all of them together. We must take them and we must fight. No longer will our goals remain the mere unfulfilled dreams of our basement-dwelling forebearers! OpenGL 10.0 by 2025! Compatibility Profile shall be renamed ‘SLOW MODE’ OpenGL ES shall retroactively convert to a YEAR-MONTH versioning scheme with quarterly releases! Depth values shall be uniformly scaled across all hardware and platforms! XFB shall be outlawed! Linux game ports shall no longer link to LLVM! Coherent API error messages shall be printed! Vendors which cannot ship functional Windows GL drivers shall ship Zink! Native GL drivers on mobile platforms shall be outlawed! gl_PointSize shall be replaced by the constant ‘1.0’ in all cases! Mesh and ray-tracing extensions from NVIDIA shall become core functionality! GLX shall be deleted and forgotten! All bug reports shall contain at least one quality meme in the OP as a form of spam prevention! Rise up and join me, your new GL/ES chair, in the glorious revolution! DISCLAIMER Obviously this is all a joke (except the part where I’m the 🪑, that’s 100% real af), but I still gotta put a disclaimer here because otherwise I’m gonna be in biiiiig trouble if this gets taken seriously. Happy New Year. I missed you.
  • Christian Gmeiner: The Year 2023 in Retrospect (2023/12/26 17:03)
    Holidays are here and I have time to look back at 2023. For six months I have been working for Igalia and what should I say? I ❤️ it! This was the best decision to leave my comfort zone of a normal 9-5 job. I am so proud to work on open source GPU drivers and I am able to spend much of my work time on etnaviv. Driver maintenance Before adding any new feature I thought it would be great idea to improve the current state of etnaviv’s gallium driver. Therefor I reworked some general driver code to be more consistent and to have a more modern feeling, and made it possible to drop some hand-rolled conversion helpers by switching to already existing solutions (U_FIXED(..), S_FIXED(..), float_to_ubyte(..)). I worked through the low hanging fruits of crashes seen in CI runs and fixed many of them. Feature wise, I also looked at some easy to implement extensions like GL_NV_conditional_render and GL_OES_texture_half_float_linear. Besides the gallium driver I also worked on some NIR and isaspec features that are beneficial for etnaviv. XDC2023 A personal highlight was to give a talk about etnaviv at XDC2023 in person. You might wonder what happened since mid October in etnaviv land. GLES3 I worked on some features that are needed to expose GLES3 and it turned out that an easy to maintain, extend and test compiler backend is needed. Sadly etnaviv’s current backend compiler does not check any of these boxes. It is so fragile that I only added some needed lowerings to pass some of the dEQP-GLES3.functional.shaders.texture_functions.* tests. Some more fun work regarding some feature emulation is on the horizon and it’s blocked again by the current compiler. Backend Compiler etnaviv includes an isaspec powered disassembler now - a small step towards a new backend compiler. Next on the road to success is the etnaviv backend IR with an assembler. The new backend compiler is able to run OpenCL kernels with the help of rusticl but I want to land the new backend compiler in smaller chunks that are easier to review. Multiple Render Targets During my XDC presentation I talked about a feature I got working on GC7000L - Multiple Render Targets (MRT). At this point it was more or less a proof-of-concept regarding the gallium drivers. There were some missing bits and register for full support on more GPU models and therefore more reverse engineering work was needed. Also the gallium driver needed lots of work to add support for MRT. Some weeks later I had MRT working on a wider range of Vivante GPUs that are supporting this feature. This includes GC2000, GC3000 and GC7000 models among others. As etnaviv makes heavy use of GPU features it should work on even more models. Looking forward to 2024 I am really confident that we will see GLES3 and OpenCL for etnaviv. As driver testing is quite important for my work I will expand my current board farm and will look into the new star in CI world - ci-tron. With that, have a happy holiday season and we’ll be back with more improvements in 2024!
  • Ricardo Garcia: Vulkan extensions Igalia helped ship in 2023 (2023/12/22 14:50)
    Last year I wrote a recap of the Vulkan extensions Igalia helped ship in 2022, and in this post I’ll do the exact same for 2023. For context and quoting the previous recap: The ongoing collaboration between Valve and Igalia lets me and some of my colleagues work on improving the open-source Vulkan and OpenGL Conformance Test Suite. This work is essential to ship quality Vulkan drivers and, from the Khronos side, to improve the Vulkan standard further by, among other things, adding new functionality through API extensions. When creating a new extension, apart from reaching consensus among vendors about the scope and shape of the new APIs, CTS tests are developed in order to check the specification text is clear and vendors provide a uniform implementation of the basic functionality, corner cases and, sometimes, interactions with other extensions. In addition to our CTS work, many times we review the Vulkan specification text from those extensions we develop tests for. We also do the same for other extensions and changes, and we also submit fixes and improvements of our own. So, without further ado, this is the list of extensions we helped ship in 2023. VK_EXT_attachment_feedback_loop_dynamic_state This extension builds on last year’s VK_EXT_attachment_feedback_loop_layout, which is used by DXVK 2.0+ to more efficiently support D3D9 games that read from active render targets. The new extension shipped this year adds support for setting attachment feedback loops dynamically on command buffers. As all extensions that add more dynamic state, the goal here is to reduce the number of pipeline objects applications need to create, which makes using the API more flexible. It was created by our beloved super good coder and Valve contractor Mike Blumenkrantz. We reviewed the spec and are listed as contributors, and we wrote dynamic variants of the existing CTS tests. VK_EXT_depth_bias_control A new extension proposed by Joshua Ashton that also helps with layering D3D9 on top of Vulkan. The original problem is quite specific. In D3D9 and other APIs, applications can specify what is called a “depth bias” for geometry using an offset that is to be added directly as an exact value to the original depth of each fragment. In Vulkan, however, the depth bias is expressed as a factor of “r”, where “r” is a number that depends on the depth buffer format and, furthermore, may not have a specific fixed value. Implementations can use different values of “r” in an acceptable range. The mechanism provided by Vulkan without this extension is useful to apply small offsets and solve some problems, but it’s not useful to apply large offsets and/or emulate D3D9 by applying a fixed-value bias. The new extension solves these problems by giving apps the chance to control depth bias in a precise way. We reviewed the spec and are listed as contributors, and wrote CTS tests for this extension to help ship it. VK_EXT_dynamic_rendering_unused_attachments This extension was proposed by Piers Daniell from NVIDIA to lift some restrictions in the original VK_KHR_dynamic_rendering extension, which is used in Vulkan to avoid having to create render passes and framebuffer objects. Dynamic rendering is very interesting because it makes the API much easier to use and, in many cases and specially in desktop platforms, it can be shipped without any associated performance loss. The new extension relaxes some restrictions that made pipelines more tightly coupled with render pass instances. Again, the goal here is to be able to reuse the same pipeline object with multiple render pass instances and remove some combinatorial explosions that may occur in some apps. We reviewed the spec and are listed as contributors, and wrote CTS tests for the new extension. VK_EXT_image_sliced_view_of_3d Shipped at the beginning of the year by Mike Blumenkrantz, the extension again helps emulating other APIs on top of Vulkan. Specifically, the extension allows creating 3D views of 3D images such that the views contain a subset of the slices in the image, using a Z offset and range, in the same way D3D12 allows. We reviewed the spec, we’re listed as contributors, and we wrote CTS tests for it. VK_EXT_pipeline_library_group_handles This one comes from Valve contractor Hans-Kristian Arntzen, who is mostly known for working on Proton projects like VKD3D-Proton. The extension is related to ray tracing and adds more flexibility when creating ray tracing pipelines. Ray tracing pipelines can hold thousands of different shaders and are sometimes built incrementally by combining so-called pipeline libraries that contain subsets of those shaders. However, to properly use those pipelines we need to create a structure called a shader binding table, which is full of shader group handles that have to be retrieved from pipelines. Prior to this extension, shader group handles from pipeline libraries had to be requeried once the final pipeline is linked, as they were not guaranteed to be constant throughout the whole process. With this extension, an implementation can tell apps they will not modify shader group handles in subsequent link steps, which makes it easier for apps to build shader binding tables. More importantly, this also more closely matches functionality in DXR 1.1, making it easier to emulate DirectX Raytracing on top of Vulkan raytracing. We reviewed the spec, we’re listed as contributors and we wrote CTS tests for it. VK_EXT_shader_object Shader objects is probably the most notorious extension shipped this year, and we contributed small bits to it. This extension makes every piece of state dynamic and removes the need to use pipelines. It’s always used in combination with dynamic rendering, which also removes render passes and framebuffers as explained above. This results in great flexibility from the application point of view. The extension was created by Daniel Story from Nintendo, and its vast set of CTS tests was created by Žiga Markuš but we added our grain of sand by reviewing the spec and proposing some changes (which is why we’re listed as contributors), as well as fixing some shader object tests and providing some improvements here and there once they had been merged. A good part of this work was done in coordination with Mesa developers which were working on implementing this extension for different drivers. VK_KHR_video_encode_h264 and VK_KHR_video_encode_h265 Fresh out of the oven, these Vulkan Video extensions allow leveraging the hardware to efficiently encode H.264 and H.265 streams. This year we’ve been doing a ton of work related to Vulkan Video in drivers, libraries like GStreamer and CTS/spec, including the two extensions mentioned above. Although not listed as contributors to the spec in those two Vulkan extensions, our work played a major role in advancing the state of Vulkan Video and getting them shipped. Epilogue That’s it for this year! I’m looking forward to help ship more extension work the next one and trying to add my part in making Vulkan drivers on Linux (and other platforms!) more stable and feature rich. My Vulkan Video colleagues at Igalia have already started work on future Vulkan Video extensions for AV1 and VP9. Hopefully some of that work is ratified next year. Fingers crossed!
  • Tomeu Vizoso: Etnaviv NPU update 13: Don't cross the tensors (2023/12/21 08:16)
    "Don't cross the streams. It would be bad."IR refactorings A big part of what I have been up to in the past two weeks has been a serious refactoring of the data structures that hold the model data in the different phases until the HW configurations is generated.What we had was enough for models with trivial control flow such as MobileNetV1, but more recent models for object classification and detection make use of more operations and those are linked between each other non-sequentially.The image below shows six of the more than a hundred operations in the SSDLite MobileDet model:A small subsection of SSDLite MobileDetThe adds will be "lowered" or converted to a special case of convolution in which the two input tensors are concatenated together as two channels of a single tensor, and the last convolution in the fragment will need to have its input tensor processed to remove the stride as the HW doesn't support those natively. The processing of this tensor will be performed in an additional job that will run in the TP (tensor processing) cores in the NPU.As you can probably imagine, the modifications to the operation graph will be far from trivial without the right data structures, so I looked at ways of refactoring the code that translates the model as given by TensorFlow Lite to the HW operations.For now I have settled into having a separate data structure for the tensors, and having the operations refer to its input and output tensors from the indices in that list. In the future, I think we should move to intermediate representations more akin to what is used in compilers, to support more complex lowerings of operations and reorganizations of the operations inside the model.I will be thinking about this later next year, once I get object detection with SSDLite MobileDet running at a useful performance level. Ideally I would like to reuse NIR so drivers can do all the lowerings and optimizations they need without having to reinvent so much of a IR, but if it turns out that operations on tensors aren't a good fit for NIR, then I will be thinking of doing something similar just for it.For NPUs with programmable cores it could be very interesting to have a pipeline of transformations that can go from very high level operations to GPGPU instructions, probably starting from a standard such as MLIR.Tensor additionAlso put some time in putting together all the information I gathered about how the proprietary driver interacts with the HW when submitting tensor addition jobs, and spent a substantial amount of time looking at the different parameter combinations in a spreadsheet, with liberal use of CORREL() to get a hint of what parameters of the high-level operations are used as inputs in the formulas that produce the HW configuration.Lowering the stridesSimilarly to the above, there was a lot of staring to a spreadsheet for the parameters of the TP jobs that transform the input tensor of a convolution with stride different than one.Status and next steps Below is a rendering of the whole operation graph for the SSDLite MobileDet model, so people can get an idea of the dimensions and complexity of a modern model for edge object detection.The model is currently running without anything exploding too badly, and all the convolutions are running correctly when run independently. But when run together, I see some bad results starting to flow around the middle of the graph, so that is what I will be debugging next.The whole of SSDLite MobileDet 
  • Melissa Wen: The Rainbow Treasure Map Talk: Advanced color management on Linux with AMD/Steam Deck. (2023/12/20 12:00)
    Last week marked a major milestone for me: the AMD driver-specific color management properties reached the upstream linux-next! And to celebrate, I’m happy to share the slides notes from my 2023 XDC talk, “The Rainbow Treasure Map” along with the individual recording that just dropped last week on youtube – talk about happy coincidences! Steam Deck Rainbow: Treasure Map & Magic Frogs While I may be bubbly and chatty in everyday life, the stage isn’t exactly my comfort zone (hallway talks are more my speed). But the journey of developing the AMD color management properties was so full of discoveries that I simply had to share the experience. Witnessing the fantastic work of Jeremy and Joshua bring it all to life on the Steam Deck OLED was like uncovering magical ingredients and whipping up something truly enchanting. For XDC 2023, we split our Rainbow journey into two talks. My focus, “The Rainbow Treasure Map,” explored the new color features we added to the Linux kernel driver, diving deep into the hardware capabilities of AMD/Steam Deck. Joshua then followed with “The Rainbow Frogs” and showed the breathtaking color magic released on Gamescope thanks to the power unlocked by the kernel driver’s Steam Deck color properties. Packing a Rainbow into 15 Minutes I had so much to tell, but a half-slot talk meant crafting a concise presentation. To squeeze everything into 15 minutes (and calm my pre-talk jitters a bit!), I drafted and practiced those slides and notes countless times. So grab your map, and let’s embark on the Rainbow journey together! Intro: Hi, I’m Melissa from Igalia and welcome to the Rainbow Treasure Map, a talk about advanced color management on Linux with AMD/SteamDeck. Useful links: First of all, if you are not used to the topic, you may find these links useful. XDC 2022 - I’m not an AMD expert, but… - Melissa Wen XDC 2022 - Is HDR Harder? - Harry Wentland XDC 2022 Lightning - HDR Workshop Summary - Harry Wentland Color management and HDR documentation for FOSS graphics - Pekka Paalanen et al. Cinematic Color - 2012 SIGGRAPH course notes - Jeremy Selan AMD Driver-specific Properties for Color Management on Linux (Part 1) - Melissa Wen Context: When we talk about colors in the graphics chain, we should keep in mind that we have a wide variety of source content colorimetry, a variety of output display devices and also the internal processing. Users expect consistent color reproduction across all these devices. The userspace can use GPU-accelerated color management to get it. But this also requires an interface with display kernel drivers that is currently missing from the DRM/KMS framework. Since April, I’ve been bothering the DRM community by sending patchsets from the work of me and Joshua to add driver-specific color properties to the AMD display driver. In parallel, discussions on defining a generic color management interface are still ongoing in the community. Moreover, we are still not clear about the diversity of color capabilities among hardware vendors. To bridge this gap, we defined a color pipeline for Gamescope that fits the latest versions of AMD hardware. It delivers advanced color management features for gamut mapping, HDR rendering, SDR on HDR, and HDR on SDR. AMD/Steam Deck hardware: AMD frequently releases new GPU and APU generations. Each generation comes with a DCN version with display hardware improvements. Therefore, keep in mind that this work uses the AMD Steam Deck hardware and its kernel driver. The Steam Deck is an APU with a DCN3.01 display driver, a DCN3 family. It’s important to have this information since newer AMD DCN drivers inherit implementations from previous families but aldo each generation of AMD hardware may introduce new color capabilities. Therefore I recommend you to familiarize yourself with the hardware you are working on. The AMD display driver in the kernel space: It consists of three layers, (1) the DRM/KMS framework, (2) the AMD Display Manager, and (3) the AMD Display Core. We extended the color interface exposed to userspace by leveraging existing DRM resources and connecting them using driver-specific functions for color property management. Bridging DC color capabilities and the DRM API required significant changes in the color management of AMD Display Manager - the Linux-dependent part that connects the AMD DC interface to the DRM/KMS framework. The AMD DC is the OS-agnostic layer. Its code is shared between platforms and DCN versions. Examining this part helps us understand the AMD color pipeline and hardware capabilities, since the machinery for hardware settings and resource management are already there. The newest architecture for AMD display hardware is the AMD Display Core Next. In this architecture, two blocks have the capability to manage colors: Display Pipe and Plane (DPP) - for pre-blending adjustments; Multiple Pipe/Plane Combined (MPC) - for post-blending color transformations. Let’s see what we have in the DRM API for pre-blending color management. DRM plane color properties: This is the DRM color management API before blending. Nothing! Except two basic DRM plane properties: color_encoding and color_range for the input colorspace conversion, that is not covered by this work. In case you’re not familiar with AMD shared code, what we need to do is basically draw a map and navigate there! We have some DRM color properties after blending, but nothing before blending yet. But much of the hardware programming was already implemented in the AMD DC layer, thanks to the shared code. Still both the DRM interface and its connection to the shared code were missing. That’s when the search begins! AMD driver-specific color pipeline: Looking at the color capabilities of the hardware, we arrive at this initial set of properties. The path wasn’t exactly like that. We had many iterations and discoveries until reached to this pipeline. The Plane Degamma is our first driver-specific property before blending. It’s used to linearize the color space from encoded values to light linear values. We can use a pre-defined transfer function or a user lookup table (in short, LUT) to linearize the color space. Pre-defined transfer functions for plane degamma are hardcoded curves that go to a specific hardware block called DPP Degamma ROM. It supports the following transfer functions: sRGB EOTF, BT.709 inverse OETF, PQ EOTF, and pure power curves Gamma 2.2, Gamma 2.4 and Gamma 2.6. We also have a one-dimensional LUT. This 1D LUT has four thousand ninety six (4096) entries, the usual 1D LUT size in the DRM/KMS. It’s an array of drm_color_lut that goes to the DPP Gamma Correction block. We also have now a color transformation matrix (CTM) for color space conversion. It’s a 3x4 matrix of fixed points that goes to the DPP Gamut Remap Block. Both pre- and post-blending matrices were previously gone to the same color block. We worked on detaching them to clear both paths. Now each CTM goes on its own way. Next, the HDR Multiplier. HDR Multiplier is a factor applied to the color values of an image to increase their overall brightness. This is useful for converting images from a standard dynamic range (SDR) to a high dynamic range (HDR). As it can range beyond [0.0, 1.0] subsequent transforms need to use the PQ(HDR) transfer functions. And we need a 3D LUT. But 3D LUT has a limited number of entries in each dimension, so we want to use it in a colorspace that is optimized for human vision. It means in a non-linear space. To deliver it, userspace may need one 1D LUT before 3D LUT to delinearize content and another one after to linearize content again for blending. The pre-3D-LUT curve is called Shaper curve. Unlike Degamma TF, there are no hardcoded curves for shaper TF, but we can use the AMD color module in the driver to build the following shaper curves from pre-defined coefficients. The color module combines the TF and the user LUT values into the LUT that goes to the DPP Shaper RAM block. Finally, our rockstar, the 3D LUT. 3D LUT is perfect for complex color transformations and adjustments between color channels. 3D LUT is also more complex to manage and requires more computational resources, as a consequence, its number of entries is usually limited. To overcome this restriction, the array contains samples from the approximated function and values between samples are estimated by tetrahedral interpolation. AMD supports 17 and 9 as the size of a single-dimension. Blue is the outermost dimension, red the innermost. As mentioned, we need a post-3D-LUT curve to linearize the color space before blending. This is done by Blend TF and LUT. Similar to shaper TF, there are no hardcoded curves for Blend TF. The pre-defined curves are the same as the Degamma block, but calculated by the color module. The resulting LUT goes to the DPP Blend RAM block. Now we have everything connected before blending. As a conflict between plane and CRTC Degamma was inevitable, our approach doesn’t accept that both are set at the same time. We also optimized the conversion of the framebuffer to wire encoding by adding support to pre-defined CRTC Gamma TF. Again, there are no hardcoded curves and TF and LUT are combined by the AMD color module. The same types of shaper curves are supported. The resulting LUT goes to the MPC Gamma RAM block. Finally, we arrived in the final version of DRM/AMD driver-specific color management pipeline. With this knowledge, you’re ready to better enjoy the rainbow treasure of AMD display hardware and the world of graphics computing. With this work, Gamescope/Steam Deck embraces the color capabilities of the AMD GPU. We highlight here how we map the Gamescope color pipeline to each AMD color block. Future works: The search for the rainbow treasure is not over! The Linux DRM subsystem contains many hidden treasures from different vendors. We want more complex color transformations and adjustments available on Linux. We also want to expose all GPU color capabilities from all hardware vendors to the Linux userspace. Thanks Joshua and Harry for this joint work and the Linux DRI community for all feedback and reviews. The amazing part of this work comes in the next talk with Joshua and The Rainbow Frogs! Any questions? References: Slides of the talk The Rainbow Treasure Map. Youtube video of the talk The Rainbow Treasure Map. Patch series for AMD driver-specific color management properties (upstream Linux 6.8v). SteamDeck/Gamescope color management pipeline XDC 2023 website. Igalia website.
  • Dave Airlie (blogspot): radv: vulkan video encode status (2023/12/19 20:29)
    Vulkan 1.3.274 moves the Vulkan encode work out of BETA and moves h264 and h265 into KHR extensions. radv support for the Vulkan video encode extensions has been in progress for a while.The latest branch is at [1]. This branch has been updated for the new final headers.Updated: It passes all of h265 CTS now, but it is failing one h264 test.Initial ffmpeg support is [2].[1][2]
  • Simon Ser: Status update, December 2023 (2023/12/17 22:00)
    Hi all! This month we’ve finally released wlroots 0.17.0! It’s been a long time since the previous release (1 year), we’ll try to ship future releases a bit more frequently. We’re preparing 0.17.1 with a collection of bugfixes, it should be ready soon. I’ve been working on wlr_surface_synced, a new wlroots abstraction to allow surface commits coming from clients to be delayed. This is required to avoid stalling the whole compositor if a client GPU work is slow and to implement explicit synchronization. I’ve also been working on a commit-queue-v1 implementation for wlroots and gamescope, which will allow us to get rid of a CPU wait in Mesa. And I’ve put some finishing touches on Rose’s frame scheduler patches. Last, I’ve merged André Almeida’s kernel patches for atomic async page-flips, making it so modern compositors can enable tearing page-flips without having to go through the legacy KMS uAPI. I’ve added OAuth refresh tokens to Having to renew OAuth tokens every year on my clients is annoying, with refresh tokens that’s a thing of the past! I’ve already updated hottub (CI bridge for GitHub) to leverage this, and I’d like to also implement this in hut (CLI tool) and yojo (CI bridge for Codeberg). Note that since has only now started returning refresh tokens on login, users will need to re-login one last time so that the OAuth clients can grab the refresh token. The NPotM is a bit peculiar: I haven’t actually started working on it this month, and it’s not in a usable state yet. It’s go-sqlgen, a Go code generator which takes SQL as input. The goal is to store SQL queries in a separate file, to make them safer (type checking for the arguments) and faster (prepared statements). It’s somewhat similar to sqlc except it aims at being simpler and database-agnostic. There’s still much to do: I’d like to add support for named parameters, check that the number of parameters in the query matches the number of procedure arguments, and make it easy to write migrations. I’m not yet sure go-sqlgen is worth the trouble: being database-agnostic limits its abilities, perhaps too much. Then comes the usual mix of random smaller updates. I’ve released soju 0.7.0 and goguma 0.6.0 with a few new features and bugfixes. pyonji now understands the b4 config file, so it’s possible to add this file to your project to preconfigure pyonji with a mailing list (example). delthas has implemented account data import in hut, so it’s now easy to migrate accounts between instances, or projects between accounts. go-scfg now supports decoding a configuration file directly into a Go struct, making it unnecessary to hand-roll parsing code (example). I’ll be giving a FOSDEM talk about quirks and gotchas of the IMAP protocol this year. I’ll be happy to say hi if any of you are coming as well. That’s all I have for this month, see you in January!
  • Peter Hutterer: Xorg being removed. What does this mean? (2023/12/14 04:13)
    You may have seen the news that Red Hat Enterprise Linux 10 plans to remove Xorg. But Xwayland will stay around, and given the name overloading and them sharing a git repository there's some confusion over what is Xorg. So here's a very simple "picture". This is the xserver git repository: $ tree -d -L 2 xserver xserver ├── composite ├── config ├── damageext ├── dbe ├── dix ├── doc │   └── dtrace ├── dri3 ├── exa ├── fb ├── glamor ├── glx ├── hw │   ├── kdrive │   ├── vfb │   ├── xfree86 <- this one is Xorg │   ├── xnest │   ├── xquartz │   ├── xwayland │   └── xwin ├── include ├── m4 ├── man ├── mi ├── miext │   ├── damage │   ├── rootless │   ├── shadow │   └── sync ├── os ├── present ├── pseudoramiX ├── randr ├── record ├── render ├── test │   ├── bigreq │   ├── bugs │   ├── damage │   ├── scripts │   ├── sync │   ├── xi1 │   └── xi2 ├── Xext ├── xfixes ├── Xi └── xkb The git repo produces several X servers, including the one designed to run on bare metal: Xorg (in hw/xfree86 for historical reasons). The other hw directories are the other X servers including Xwayland. All the other directories are core X server functionality that's shared between all X servers [1]. Removing Xorg from a distro but keeping Xwayland means building with --disable-xfree86 -enable-xwayland [1]. That's simply it (plus the resulting distro packaging work of course). Removing Xorg means you need something else that runs on bare metal and that is your favourite Wayland compositor. Xwayland then talks to that while presenting an X11-compatible socket to existing X11 applications. Of course all this means that the X server repo will continue to see patches and many of those will also affect Xorg. For those who are running git master anyway. Don't get your hopes up for more Xorg releases beyond the security update background noise [2]. Xwayland on the other hand is actively maintained and will continue to see releases. But those releases are a sequence [1] of $ git new-branch xwayland-23.x.y $ git rm hw/{kdrive/vfb/xfree86/xnest,xquartz,xwin} $ git tag xwayland-23.x.y In other words, an Xwayland release is the xserver git master branch with all X servers but Xwayland removed. That's how Xwayland can see new updates and releases without Xorg ever seeing those (except on git master of course). And that's how your installed Xwayland has code from 2023 while your installed Xorg is still stuck on the branch created and barely updated after 2021. I hope this helps a bit with the confusion of the seemingly mixed messages sent when you see headlines like "Xorg is unmaintained", "X server patches to fix blah", "Xorg is abandoned", "new Xwayland release. [1] not 100% accurate but close enough [2] historically an Xorg release included all other X servers (Xquartz, Xwin, Xvfb, ...) too so this applies to those servers too unless they adopt the Xwayland release model
  • Melissa Wen: 15 Tips for Debugging Issues in the AMD Display Kernel Driver (2023/12/13 12:25)
    A self-help guide for examining and debugging the AMD display driver within the Linux kernel/DRM subsystem. It’s based on my experience as an external developer working on the driver, and are shared with the goal of helping others navigate the driver code. Acknowledgments: These tips were gathered thanks to the countless help received from AMD developers during the driver development process. The list below was obtained by examining open source code, reviewing public documentation, playing with tools, asking in public forums and also with the help of my former GSoC mentor, Rodrigo Siqueira. Pre-Debugging Steps: Before diving into an issue, it’s crucial to perform two essential steps: 1) Check the latest changes: Ensure you’re working with the latest AMD driver modifications located in the amd-staging-drm-next branch maintained by Alex Deucher. You may also find bug fixes for newer kernel versions on branches that have the name pattern drm-fixes-<date>. 2) Examine the issue tracker: Confirm that your issue isn’t already documented and addressed in the AMD display driver issue tracker. If you find a similar issue, you can team up with others and speed up the debugging process. Understanding the issue: Do you really need to change this? Where should you start looking for changes? 3) Is the issue in the AMD kernel driver or in the userspace?: Identifying the source of the issue is essential regardless of the GPU vendor. Sometimes this can be challenging so here are some helpful tips: Record the screen: Capture the screen using a recording app while experiencing the issue. If the bug appears in the capture, it’s likely a userspace issue, not the kernel display driver. Analyze the dmesg log: Look for error messages related to the display driver in the dmesg log. If the error message appears before the message “[drm] Display Core v...”, it’s not likely a display driver issue. If this message doesn’t appear in your log, the display driver wasn’t fully loaded and you will see a notification that something went wrong here. 4) AMD Display Manager vs. AMD Display Core: The AMD display driver consists of two components: Display Manager (DM): This component interacts directly with the Linux DRM infrastructure. Occasionally, issues can arise from misinterpretations of DRM properties or features. If the issue doesn’t occur on other platforms with the same AMD hardware - for example, only happens on Linux but not on Windows - it’s more likely related to the AMD DM code. Display Core (DC): This is the platform-agnostic part responsible for setting and programming hardware features. Modifications to the DC usually require validation on other platforms, like Windows, to avoid regressions. 5) Identify the DC HW family: Each AMD GPU has variations in its hardware architecture. Features and helpers differ between families, so determining the relevant code for your specific hardware is crucial. Find GPU product information in Linux/AMD GPU documentation Check the dmesg log for the Display Core version (since this commit in Linux kernel 6.3v). For example: [drm] Display Core v3.2.241 initialized on DCN 2.1 [drm] Display Core v3.2.237 initialized on DCN 3.0.1 Investigating the relevant driver code: Keep from letting unrelated driver code to affect your investigation. 6) Narrow the code inspection down to one DC HW family: the relevant code resides in a directory named after the DC number. For example, the DCN 3.0.1 driver code is located at drivers/gpu/drm/amd/display/dc/dcn301. We all know that the AMD’s shared code is huge and you can use these boundaries to rule out codes unrelated to your issue. 7) Newer families may inherit code from older ones: you can find dcn301 using code from dcn30, dcn20, dcn10 files. It’s crucial to verify which hooks and helpers your driver utilizes to investigate the right portion. You can leverage ftrace for supplemental validation. To give an example, it was useful when I was updating DCN3 color mapping to correctly use their new post-blending color capabilities, such as: [PATCH] drm/amd/display: set stream gamut remap matrix to MPC for DCN3+ Additionally, you can use two different HW families to compare behaviours. If you see the issue in one but not in the other, you can compare the code and understand what has changed and if the implementation from a previous family doesn’t fit well the new HW resources or design. You can also count on the help of the community on the Linux AMD issue tracker to validate your code on other hardware and/or systems. This approach helped me debug a 2-year-old issue where the cursor gamma adjustment was incorrect in DCN3 hardware, but working correctly for DCN2 family. I solved the issue in two steps, thanks for community feedback and validation: [PATCH] drm/amd/display: check attr flag before set cursor degamma on DCN3+ [PATCH] drm/amd/display: enable cursor degamma for DCN3+ DRM legacy gamma 8) Check the hardware capability screening in the driver: You can currently find a list of display hardware capabilities in the drivers/gpu/drm/amd/display/dc/dcn*/dcn*_resource.c file. More precisely in the dcn*_resource_construct() function. Using DCN301 for illustration, here is the list of its hardware caps: /************************************************* * Resource + asic cap harcoding * *************************************************/ pool->base.underlay_pipe_index = NO_UNDERLAY_PIPE; pool->base.pipe_count = pool->base.res_cap->num_timing_generator; pool->base.mpcc_count = pool->base.res_cap->num_timing_generator; dc->caps.max_downscale_ratio = 600; dc->caps.i2c_speed_in_khz = 100; dc->caps.i2c_speed_in_khz_hdcp = 5; /*1.4 w/a enabled by default*/ dc->caps.max_cursor_size = 256; dc->caps.min_horizontal_blanking_period = 80; dc->caps.dmdata_alloc_size = 2048; dc->caps.max_slave_planes = 2; dc->caps.max_slave_yuv_planes = 2; dc->caps.max_slave_rgb_planes = 2; dc->caps.is_apu = true; dc->caps.post_blend_color_processing = true; dc->caps.force_dp_tps4_for_cp2520 = true; dc->caps.extended_aux_timeout_support = true; dc->caps.dmcub_support = true; /* Color pipeline capabilities */ dc->caps.color.dpp.dcn_arch = 1; dc->caps.color.dpp.input_lut_shared = 0; dc->caps.color.dpp.icsc = 1; dc->caps.color.dpp.dgam_ram = 0; // must use gamma_corr dc->caps.color.dpp.dgam_rom_caps.srgb = 1; dc->caps.color.dpp.dgam_rom_caps.bt2020 = 1; dc->caps.color.dpp.dgam_rom_caps.gamma2_2 = 1; dc->caps.color.dpp.dgam_rom_caps.pq = 1; dc->caps.color.dpp.dgam_rom_caps.hlg = 1; dc->caps.color.dpp.post_csc = 1; dc->caps.color.dpp.gamma_corr = 1; dc->caps.color.dpp.dgam_rom_for_yuv = 0; dc->caps.color.dpp.hw_3d_lut = 1; dc->caps.color.dpp.ogam_ram = 1; // no OGAM ROM on DCN301 dc->caps.color.dpp.ogam_rom_caps.srgb = 0; dc->caps.color.dpp.ogam_rom_caps.bt2020 = 0; dc->caps.color.dpp.ogam_rom_caps.gamma2_2 = 0; dc->caps.color.dpp.ogam_rom_caps.pq = 0; dc->caps.color.dpp.ogam_rom_caps.hlg = 0; dc->caps.color.dpp.ocsc = 0; dc->caps.color.mpc.gamut_remap = 1; dc->caps.color.mpc.num_3dluts = pool->base.res_cap->num_mpc_3dlut; //2 dc->caps.color.mpc.ogam_ram = 1; dc->caps.color.mpc.ogam_rom_caps.srgb = 0; dc->caps.color.mpc.ogam_rom_caps.bt2020 = 0; dc->caps.color.mpc.ogam_rom_caps.gamma2_2 = 0; dc->caps.color.mpc.ogam_rom_caps.pq = 0; dc->caps.color.mpc.ogam_rom_caps.hlg = 0; dc->caps.color.mpc.ocsc = 1; dc->caps.dp_hdmi21_pcon_support = true; /* read VBIOS LTTPR caps */ if (ctx->dc_bios->funcs->get_lttpr_caps) { enum bp_result bp_query_result; uint8_t is_vbios_lttpr_enable = 0; bp_query_result = ctx->dc_bios->funcs->get_lttpr_caps(ctx->dc_bios, &is_vbios_lttpr_enable); dc->caps.vbios_lttpr_enable = (bp_query_result == BP_RESULT_OK) && !!is_vbios_lttpr_enable; } if (ctx->dc_bios->funcs->get_lttpr_interop) { enum bp_result bp_query_result; uint8_t is_vbios_interop_enabled = 0; bp_query_result = ctx->dc_bios->funcs->get_lttpr_interop(ctx->dc_bios, &is_vbios_interop_enabled); dc->caps.vbios_lttpr_aware = (bp_query_result == BP_RESULT_OK) && !!is_vbios_interop_enabled; } Keep in mind that the documentation of color capabilities are available at the Linux kernel Documentation. Understanding the development history: What has brought us to the current state? 9) Pinpoint relevant commits: Use git log and git blame to identify commits targeting the code section you’re interested in. 10) Track regressions: If you’re examining the amd-staging-drm-next branch, check for regressions between DC release versions. These are defined by DC_VER in the drivers/gpu/drm/amd/display/dc/dc.h file. Alternatively, find a commit with this format drm/amd/display: 3.2.221 that determines a display release. It’s useful for bisecting. This information helps you understand how outdated your branch is and identify potential regressions. You can consider each DC_VER takes around one week to be bumped. Finally, check testing log of each release in the report provided on the amd-gfx mailing list, such as this one Tested-by: Daniel Wheeler: RE: [PATCH 00/13] DC Patches for Dec 11, 2023 Reducing the inspection area: Focus on what really matters. 11) Identify involved HW blocks: This helps isolate the issue. You can find more information about DCN HW blocks in the DCN Overview documentation. In summary: Plane issues are closer to HUBP and DPP. Blending/Stream issues are closer to MPC, OPP and OPTC. They are related to DRM CRTC subjects. This information was useful when debugging a hardware rotation issue where the cursor plane got clipped off in the middle of the screen. Finally, the issue was addressed by two patches: [PATCH 21/22] drm/amd/display: Fix rotated cursor offset calculation [PATCH] drm/amd/display: fix cursor offset on rotation 180 12) Issues around bandwidth (glitches) and clocks: May be affected by calculations done in these HW blocks and HW specific values. The recalculation equations are found in the DML folder. DML stands for Display Mode Library. It’s in charge of all required configuration parameters supported by the hardware for multiple scenarios. See more in the AMD DC Overview kernel docs. It’s a math library that optimally configures hardware to find the best balance between power efficiency and performance in a given scenario. Finding some clk variables that affect device behavior may be a sign of it. It’s hard for a external developer to debug this part, since it involves information from HW specs and firmware programming that we don’t have access. The best option is to provide all relevant debugging information you have and ask AMD developers to check the values from your suspicions. Do a trick: If you suspect the power setup is degrading performance, try setting the amount of power supplied to the GPU to the maximum and see if it affects the system behavior with this command: sudo bash -c "echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level" I learned it when debugging glitches with hardware cursor rotation on Steam Deck. My first attempt was changing the clock calculation. In the end, Rodrigo Siqueira proposed the right solution targeting bandwidth in two steps: Patch series to create a new internal commit sequence Enabling pipe split on DCN301 Checking implicit programming and hardware limitations: Bring implicit programming to the level of consciousness and recognize hardware limitations. 13) Implicit update types: Check if the selected type for atomic update may affect your issue. The update type depends on the mode settings, since programming some modes demands more time for hardware processing. More details in the source code: /* Surface update type is used by dc_update_surfaces_and_stream * The update type is determined at the very beginning of the function based * on parameters passed in and decides how much programming (or updating) is * going to be done during the call. * * UPDATE_TYPE_FAST is used for really fast updates that do not require much * logical calculations or hardware register programming. This update MUST be * ISR safe on windows. Currently fast update will only be used to flip surface * address. * * UPDATE_TYPE_MED is used for slower updates which require significant hw * re-programming however do not affect bandwidth consumption or clock * requirements. At present, this is the level at which front end updates * that do not require us to run bw_calcs happen. These are in/out transfer func * updates, viewport offset changes, recout size changes and pixel depth changes. * This update can be done at ISR, but we want to minimize how often this happens. * * UPDATE_TYPE_FULL is slow. Really slow. This requires us to recalculate our * bandwidth and clocks, possibly rearrange some pipes and reprogram anything front * end related. Any time viewport dimensions, recout dimensions, scaling ratios or * gamma need to be adjusted or pipe needs to be turned on (or disconnected) we do * a full update. This cannot be done at ISR level and should be a rare event. * Unless someone is stress testing mpo enter/exit, playing with colour or adjusting * underscan we don't expect to see this call at all. */ enum surface_update_type { UPDATE_TYPE_FAST, /* super fast, safe to execute in isr */ UPDATE_TYPE_MED, /* ISR safe, most of programming needed, no bw/clk change*/ UPDATE_TYPE_FULL, /* may need to shuffle resources */ }; Using tools: Observe the current state, validate your findings, continue improvements. 14) Use AMD tools to check hardware state and driver programming: help on understanding your driver settings and checking the behavior when changing those settings. DC Visual confirmation: Check multiple planes and pipe split policy. DTN logs: Check display hardware state, including rotation, size, format, underflow, blocks in use, color block values, etc. UMR: Check ASIC info, register values, KMS state - links and elements (framebuffers, planes, CRTCs, connectors). Source: UMR project documentation 15) Use generic DRM/KMS tools: IGT test tools: Use generic KMS tests or develop your own to isolate the issue in the kernel space. Compare results across different GPU vendors to understand their implementations and find potential solutions. Here AMD also has specific IGT tests for its GPUs that is expect to work without failures on any AMD GPU. You can check results of HW-specific tests using different display hardware families or you can compare expected differences between the generic workflow and AMD workflow. drm_info: This tool summarizes the current state of a display driver (capabilities, properties and formats) per element of the DRM/KMS workflow. Output can be helpful when reporting bugs. Don’t give up! Debugging issues in the AMD display driver can be challenging, but by following these tips and leveraging available resources, you can significantly improve your chances of success. Worth mentioning: This blog post builds upon my talk, “I’m not an AMD expert, but…” presented at the 2022 XDC. It shares guidelines that helped me debug AMD display issues as an external developer of the driver. Open Source Display Driver: The Linux kernel/AMD display driver is open source, allowing you to actively contribute by addressing issues listed in the official tracker. Tackling existing issues or resolving your own can be a rewarding learning experience and a valuable contribution to the community. Additionally, the tracker serves as a valuable resource for finding similar bugs, troubleshooting tips, and suggestions from AMD developers. Finally, it’s a platform for seeking help when needed. Remember, contributing to the open source community through issue resolution and collaboration is mutually beneficial for everyone involved.
  • Ricardo Garcia: I'm playing Far Cry 6 on Linux (2023/12/08 11:38)
    2023-12-10 UPDATE: From Mastodon, arcepi suggested the instability problems that I described below and served as a motivation to try Far Cry 6 on Linux could be coming from having switched from NVIDIA to AMD without reinstalling Windows, because of leftover files from the NVIDIA drivers. Today morning I reinstalled Windows to test this and, indeed, the Deathloop and Far Cry 6 crashes seem to be gone (yay!). That would have removed my original motivation to try to run the game on Linux, but it doesn’t take away the main points of the post. Do take into account that the instability doesn’t seem to exist anymore (and I hope this applies to more future titles I play) but it’s still the background story to explain why I decided to install Far Cry 6 on my Fedora 39 system, so the original post follows below. If you’ve been paying attention to the evolution of the Linux gaming ecosystem in recent years, including the release of the Steam Deck and the new Steam Deck OLED, it’s likely your initial reaction to the blog post title is a simple “OK”. However, I’m coming from a very particular place so I wanted to explain my point of view and the significance of this, and hopefully you’ll find the story interesting. Figure 1. Steam running on Fedora Linux 39 As a background, let me say I’ve always gamed on Windows when using my PC. If you think I’m an idiot for doing so lately, specially because my work at Igalia involves frequently interacting with Valve contractors like Samuel Pitoiset, Timur Kristóf, Mike Blumenkrantz or Hans-Kristian Arntzen, you’d be more than right. But hear me out. I’ve always gamed on Windows because it’s the safe bet. With a couple of small kids at home and very limited free time, when I game everything has to just work. No fiddling around with software, config files, or wasting time setting up the software stack. I’m supposed to boot Windows when I want to play, play, and then turn my computer off. The experience needs to be as close to a console as possible. And, for anything non-gaming, which is most of it, I’d be using my Linux system. In the last years, thanks to the work done by Valve, the Linux gaming stack has improved a lot. Despite this, I’ve kept gaming on Windows for a variety of reasons: For a long time, my Linux disk only had a capacity of 128GB, so installing games was not a real possibility due to the amount of disk space they need. Also, I was running Slackware and installing Steam and getting the whole thing running implied a fair amount of fiddling I didn’t even want to think about. Then, when I was running Fedora on a large disk, I had kids and I didn’t want to take any risks or possibly waste time on that. So, what changed? Figure 2. Sapphire Pulse AMD Radeon RX 6700 box Earlier this year I upgraded my PC and replaced an old Intel Haswell i7-4770k with a Ryzen R5 7600X, and my GPU changed from an NVIDIA GTX 1070 to a Radeon RX 6700. The jump in CPU power was much bigger and impressive than the more modest jump in GPU power. But talking about that and the sorry state of the GPU market is a story for another blog post. In any case, I had put up with the NVIDIA proprietary driver for many years and I think, on Windows and for gaming, NVIDIA is the obvious first choice for many people, including me. Dealing with the proprietary blob under Linux was not particularly problematic, specially with the excellent way it’s handled by RPMFusion on Fedora, where essentially you only have to install a few packages and you can mostly forget about it. However, given my recent professional background I decided to go with an AMD card for the first time. I wanted to use a fully open source graphics stack and I didn’t want to think about making compromises in Wayland support or other fronts whatsoever. Plus, at the time I upgraded my PC, the timing was almost perfect for me to switch to an AMD card, because: AMD cards were, in general, performing better for the same price than NVIDIA cards, except for ray tracing. The RX 6700 non-XT was on sale. It had the same performance as a PS5 or so. It didn’t draw a ton of power like many recent high-end GPUs (175W, similar to the 1070 and its 150W TDP). After the system upgrade, I did notice a few more stability problems when gaming under Windows, compared to what I was used to with an NVIDIA card. You can find thousands of opinions, comments and anecdotes on the Internet about the quality of AMD drivers, and a lot of people say they’re a couple of steps below NVIDIA drivers. It’s not my intention at all to pile up on those, but it’s true my own personal experience is having generally more crashes in games and having to face more weird situations since I switched to AMD. Normally, it doesn’t get to the point of being annoying at all, but sometimes it’s a bit surprising and I could definitely notice that increase in instability without any bias on my side, I believe. Which takes us to Far Cry 6. A few days ago I finished playing Doom Eternal and its expansions (really nice game, by the way!) and I decided to go with Far Cry 6 next. I’m slowly working my way up with some more graphically demanding games that I didn’t feel comfortable with playing on the 1070. I went ahead and installed the game on Windows. Being a big 70GB download (100GB on disk), that took a bit of time. Then I launched it, adjusted the keyboard and mouse settings to my liking and I went to the video options menu. The game had chosen the high preset for me and everything looked good, so I attempted to run the in-game benchmark to see if the game performed well with that preset (I love it when games have built-in benchmarks!). After a few seconds in a loading screen, the game crashed and I was back to the desktop. “Oh, what a bad way to start!”, I thought, without knowing what lied ahead. I launched the game again, same thing. On the course of the 2 hours that followed, I tried everything: Launching the main game instead of the benchmark, just in case the bug only happened in the benchmark. Nope. Lowering quality and resolution. Disabling any advanced setting. Trying windowed mode, or borderless full screen. Vsync off or on. Disabling the overlays for Ubisoft, Steam, AMD. Rebooting multiple times. Uninstalling the drivers normally as well as using DDU and installing them again. Same result every time. I also searched on the web for people having similar problems, but got no relevant search results anywhere. Yes, a lot of people both using AMD and NVIDIA had gotten crashes somewhere in the game under different circumstances, but nobody mentioned specifically being unable to reach any gameplay at all. That day I went to bed tired and a bit annoyed. I was also close to having run the game for 2 hours according to Steam, which is the limit for refunds if I recall correctly. I didn’t want to refund the game, though, I wanted to play it. The next day I was ready to uninstall it and move on to another title in my list but, out of pure curiosity, given that I had already spent a good amount of time trying to make it run, I searched for it on the Proton compatibility database to see if it could be run on Linux, and it seemed to be possible. The game appeared to be well supported and it was verified to run on the Deck, which was good because both the Deck and my system have an RDNA2 GPU. In my head I wasn’t fully convinced this could work, because I didn’t know if the problem was in the game (maybe a bug with recent updates) or the drivers or anywhere else (like a hardware problem). And this was, for me, when the fun started. I installed Steam on Linux from the Gnome Software app. For those who don’t know it, it’s like an app store for Gnome that acts as a frontend to the package manager. Figure 3. Gnome Software showing Steam as an installed application Steam showed up there with 3 possible sources: Flathub, an “rpmfusion-nonfree-steam” repo and the more typical “rpmfusion-nonfree” repo. I went with the last option and soon I had Steam in my list of apps. I launched that and authenticated using the Steam mobile app QR code scanning function for logging in (which is a really cool way to log in, by the way, without needing to recall your username and password). My list of installed games was empty and I couldn’t find a way to install Far Cry 6 because it was not available for Linux. However, I thought there should be an easy way to install it and launch it using the famous Proton compatibility layer, and a quick web search revealed I only had to right-click on the game title, select Properties and choose to “Force the use of a specific Steam Play compatibility tool” under the Compatibility section. Click-click-click and, sure, the game was ready to install. I let it download again and launched it. Some stuff pops up about processing or downloading Vulkan shaders and I see it doing some work. In that first launch, the game takes more time to start compared to what I had seen under Windows, but it ends up launching (and subsequent launches were noticeably faster). That includes some Ubisoft Connect stuff showing up before the game starts and so on. Intro videos play normally and I reach the game menu in full screen. No indication that I was running it on Linux whatsoever. I go directly to the video options menu, see that the game again selected the high preset, I turn off VSync and launch the benchmark. Sincerely, honestly, completely and totally expecting it to crash one more time and that would’ve been OK, pointing to a game bug. But no, for the first time in two days this is what I get: Figure 4. Far Cry 6 benchmark screenshot displaying the game running at over 100 frames per second The benchmark runs perfectly, no graphical glitches, no stuttering, frame rates above 100FPS normally, and I had a genuinely happy and surprised grin on my face. I laughed out loud and my wife asked what was so funny. Effortless. No command lines, no config files, nothing. As of today, I’ve played the game for over 30 hours and the game has crashed exactly once out of the blue. And I think it was an unfortunate game bug. The rest of the time it’s been running as smooth and as perfect as the first time I ran the benchmark. Framerate is completely fine and way over the 0 frames per second I got on Windows because it wouldn’t run. The only problem seems to be that when I finish playing and exit to the desktop, Steam is unable to stop the game completely for some reason (I don’t know the cause) and it shows up as still running. I usually click on the Stop button in the Steam interface after a few seconds, it stops the game and that’s it. No problem synchronizing game saves to the cloud or anything. Just that small bug that, again, only requires a single extra click. 2023-12-10 UPDATE: From Mastodon, Jaco G and Berto Garcia tell me the game not stopping problem is present in all Ubisoft games and is directly related to the Ubisoft launcher. It keeps running after closing the game, which makes Steam think the game is still running. You can try to close it from the tray if you see the Ubisoft icon there and, if that fails, you can stop the game like I described above. Then I remember something that had happened a few months before, prior to starting to play Doom Eternal under Windows. I had tried to play Deathloop first, another game in my backlog. However, the game crashed every few minutes and an error window popped up. The amount and timing of the crashes didn’t look constant, and lowering the graphics settings sometimes would allow me to play the game a bit longer, but in any case I wasn’t able to finish the game intro level without crashes and being very annoyed. Searching for the error message on the web, I saw it looked like a game problem that was apparently affecting not only AMD users, but also NVIDIA ones, so I had mentally classified that as a game bug and, similarly to the Far Cry 6 case, I had given up on running the game without refunding it hoping to be able to play it in the future. Now I was wondering if it was really a game bug and, even if it was, if maybe Proton could have a workaround for it and maybe it could be played on Linux. Again, ProtonDB showed the game to be verified on the Deck with encouraging recent reports. So I installed Deathloop on Linux, launched it just once and played for 20 minutes or so. No crashes and I got as far as I had gotten on Windows in the intro level. Again, no graphical glitches that I could see, smooth framerates, etc. Maybe it was a coincidence and I was lucky, but I think I will be able to play the game without issues when I’m done with Far Cry 6. In conclusion, this story is another data point that tells us the quality of Proton as a product and software compatibility layer is outstanding. In combination with some high quality open source Mesa drivers like RADV, I’m amazed the experience can be actually better than gaming natively on Windows. Think about that: the Windows game binary running natively on a DX12 or Vulkan official driver crashes more and doesn’t work as well as the game running on top of a Windows compatibility layer with a graphics API translation layer, on top of a different OS kernel and a different Vulkan driver. Definitely amazing to me and it speaks wonders of the work Valve has been doing on Linux. Or it could also speak badly of AMD Windows drivers, or both. Sure, some new games on launch have more compatibility issues, bugs that need fixing, maybe workarounds applied in Proton, etc. But even in those cases, if you have a bit of patience, play the game some months down the line and check ProtonDB first (ideally before buying the game), you may be in for a great experience. You don’t need to be an expert either. Not to mention that some of these details are even better and smoother if you use a Steam Deck as compared to an (officially) unsupported Linux distribution like I do.
Enter your comment. Wiki syntax is allowed:
R W F X᠎ N
  • news/planet/freedesktop.txt
  • Last modified: 2021/10/30 11:41
  • by