Skip to main content

SoBigData Articles

Designing fair algorithms for data summarization

During my TNA visit at KTH Royal Institute of Technology, we pursue the task of designing efficient algorithms to solve combinatorial optimisation problems with algorithmic fairness in perspective. It has been observed that algorithms designed solely for optimising the cost–to maximising profits–can exhibit inherent bias in their decisions, particularly towards certain groups of the society such as women and people of colour [1]. Using software systems with inherent bias is questionable and many legislations prohibit deploying systems known to exhibit bias. As a result, many well-known combinatorial problems have been reformulated with fairness constraints to comply with societal and legal norms. In some instances, introducing fairness constraints can make these problems more challenging to solve – sometimes leading to computational intractability – necessitating the development of new computational tools.

In this work, we focus on data summarisation tasks and a classical example of this task is processing search queries. For example, when searched for the term “CEO”, the result should display approximately 10 images of CEOs that accurately represent the demographics. This task can be modelled as a clustering problem  where the images denote data points, the distance between the images denote its (dis)similarity measure. We need  to find a subset of points–called cluster centres–the quality of the solution is measured using an objective function, in our case we need the cluster centres that minimise the maximum distance from the data points to its closest centre. The cluster centres are then displayed as search results. 

It has been observed that even though approximately 30% of CEOs are women, search results often fail to include even three women, failing to reflect reality. To address this, constraints are imposed to ensure a minimum number of cluster centres of each demographic group are included. This leads to the fair k-centre problem [2], which aims to ensure representation across different demographic groups. While this problem has been well studied and many algorithmic results are known [3], less attention has been given to cases where images may contain inappropriate/graphic content and must be excluded from summary results. This leads to the fair k-supplier problem,  where cluster centres must be selected from a specific subset of data points, called facilities, while simultaneously ensuring fair representation across groups and minimising the same objective as in the k-centre problem.


The protagonist of our work is the fair k-supplier problem, which is computationally challenging to solve (NP-hard), and finding a solution that optimises the objective is not possible in practice, so we resort to designing faster (polynomial-time) algorithms that return a solution within a small factor of the optimal solution, with smaller factors being more desirable. Taking up this challenge, we close an open problem by presenting an algorithm that matches the lower-bound on the approximation factor that can be achieved efficiently (in polynomial time). We present an approximation algorithm that achieves the best possible approximation factor (in polynomial-time), under the current understanding of the computational complexity, this is the best approximation factor that can be efficiently computed (in polynomial-time) for the fair k-supplier problem. 

Beyond this immediate step, we see several possibilities to extend our framework. In our current setting, we assumed that demographic groups are disjoint, but this is not the case in practice. For instance, a person could be non-binary, belong to a minority ethnic group and be part of a specific socio-economic class, forming intersecting groups. Our current formulation does not handle such scenarios, by considering intersecting groups the problem cannot be approximated to any reasonable factor efficiently (in polynomial-time). We are in the process of extending our methods to accommodate intersecting groups, by designing fixed-parameter tractable algorithms, when the exponential blow-up in running time is restricted to certain input parameters of the problem, when these parameters are small the algorithms are practical.

On a personal note, this TNA visit has offered me a valuable opportunity to collaborate with and learn from many researchers at KTH Royal Institute of Technology. I was able to engage in discussions with many researchers and to establish a network for future collaborations. These interactions have also broadened my perspective on research and helped me identify the specific problems that I want to focus on in the future.

 

 

[1] Kasy, M., Abebe, R.: Fairness, equality, and power in algorithmic decision-making. In the Proceedings of the International Conference on Fairness, Accountability, and Transparency, pp. 576–586 (2021). ACM.

[2] Kay, M., Matuszek, C. and Munson, S.A.: Unequal representation and gender stereotypes in image search results for occupations. In Proceedings of the ACM Conference on Human Factors in Computing Systems, pp. 3819-3828 (2015). ACM.

[3] Kleindessner, M., Awasthi, P. and Morgenstern, J.: Fair k-center clustering for data summarization. In Proceedings of the International Conference on Machine Learning, pp. 3448-3457 (2019). PMLR.