Summary

There’s an emerging field utilizing AI to upgrade our democratic and collective intelligence processes. Given the state of the ones we have, this is clearly needed. Further, if AI presents us with risk of ruin, a necessary response is harnessing it to improve societies ability to coordinate on its use. However, every AI driven collective intelligence system I’ve seen appears to fundamentally flawed. I will argue we need auto structuring to correct this flaw and to generally improve the approach.

Background

One representative paper, Generative Social Choice recently came out of Harvard. They select a small group demographically representative a larger population and ask them some questions on a topic, recording numerical ratings and free-form justification. They then find five clusters of replies, and summarize the opinion of each cluster (again with GPT-4). The output is then five concise statements representing the opinions of the larger population on the topic. This paper was funded by OpenAI’s Democratic Inputs to AI grants program, so expect similar projects to emerge soon.

Another project Talk to the City uses LLMs to synthesize the public opinion as captured by Collective Response Systems (CRS) such as pol.is. Essentially, once the CRS has output a slate of statements judged to be representative, TttC will cluster and summarize the results into reports using an LLM. The process described by Democratic Policy Development using Collective Dialogues about AI proposes a similar approach.

There are others, and they all share a key insight: LLMs can create connective tissue between minds enabling deliberation and cognition previously impractical except in very small groups. They also share a shaky assumption: LLMs can faithfully summarize natural language input.

For example, in Generative Social Choice they provide GPT-4 with the participant replies followed by a new statement the participants did not see, and ask it to predict their reply. The authors write:

In terms of reliability, we have seen that GPT-4 sometimes produces unpopular or imprecise statements. Our current implementation of the generative query increases robustness by generating multiple candidate statements with different approaches, and using the discriminative query to select the best among them.

TttC is quite aware of the problem, which is why a stated goal of the project is to investigate it. To this point, in a project update they report:

…one of the individuals interviewed in one of our studies was a formerly incarcerated individual who said “living a good life after prison is hard but possible” – but our AI broke this down in two atomic arguments, one saying “living a good life after prison is possible” (implicitly suggesting that it is easy) and the other saying “living a good life after prison is hard” (suggesting it might not be possible). While each statement is arguably accurate, they each fail to capture the intended message when read separately.

The credibility gap

We know LLMs create coherent natural language replies to natural language questions, but does this mean we can trust them to perform large scale quantitative analysis? What if they have biases which accumulate? Or what if they perform well in some situations but critically fail in a few? We don’t know how to account for the distribution of such errors, and so it’s unclear we can ever prove we’ve accounted for them, meaning the outputs of a democratic process which uses them simply isn’t credible.

Consider a large CRS was run to collect the opinions of hundreds of thousands of citizens about a proposed tax change, and LLMs were used to summarize the clusters of opinions for politicians to act on. Should the citizens trust it? Perhaps the whole process was published. Inputs, intermediary steps, outputs, the processing software and the trained LLM. Let’s even assume the program and any statistical methods employed, such as k-means clustering were written in provably correct code which an auditor could run to check its outputs. This rules out every source of error except the LLM’s language transformations. You might think to run another LLM to validate the first ones outputs, but then how do you trust that second one? And what would validation even look like? The output of a given LLM transformation is a sequence of tokens, and the output of the validating LLM will be another sequence of tokens, determining if they are appropriately similar is just as intractable as checking the first LLMs outputs to be appropriate. In summary, trying to correct a LLM with another LLM leads to an infinite regress problem.

Scaling credibility

LLMs are being used to make CRSs scalable, but their results are not credible. The question becomes if we can address the credibility issue while maintaining scalability. Any LLM language transformation has a potential for errors we don’t know how to quantify or correct for. To credibly use them in a democratic process, any transformation they perform should be human validated when possible, or trivially inspectable by humans in post-hoc review. I believe there are designs which can do this while still operating at scale.

One solution would be to have LLM chat agents which assist participants in translating their natural language statements into those in a formal language. When the participants approve and submit their statements, further processing can be done by provably correct code running comparatively simpler algorithms. Here, no LLM work is performed which isn’t validated by a human. Unfortunately, it is impractical to presume humans can validate the meanings of statements in formal languages.

Auto Structuring

A better solution is to use LLM chat agents to guide participants into submitting semiformal but still human-readable statements. If the semiformal format (grammar) is carefully selected, any further processing by LLMs can be prone to fewer errors, and easier to audit.

Consider a grammar which allows for a list of non-compound sentences. A natural language statement like this:

We should expand our city recycling program to accept compostable plastics. I know that the technology for processing these plastics isn’t fully developed and for a while we might simply be throwing some away, but these programs are a necessary first step to incentivize the investment in building them.

Could be represented like this:

  • I support Springfield’s recycling system adding a program to accept and separate biodegradable plastic.
  • The technology for processing biodegradable plastic is not cost-effective today.
  • The technology for processing biodegradable plastic will be developed if cities such as Springfield take the step to accept and separate them.

If instead we pick a grammar which allows non-compound statements to be arranged into a graph (where links indicate support), it could be represented like this:

  • I support Springfield’s recycling system adding a program to accept and separate biodegradable plastic.
    • I support working towards reducing Springfield’s waste stream.
    • Springfield will eventually produce less waste by implementing a biodegradable plastics program.
      • Collecting biodegradable plastic will generate more waste until cost-effective processing is deployed
      • Cost effective processing of biodegradable plastic will be deployed once sufficient material is collected

To illustrate the grammar, I added more detail than the original statement contained. You could imagine it came from the chat agent asking clarifying questions.

Creating this kind of thing automatically is the hard part. This is what I’m calling auto-structuring. As far as I can tell, modern LLMs can’t do it.

Structure is all you need

To show that this is the only hard piece, let’s assume we have it and outline the rest. So far we’ve had chat agents utilize auto-structuring to rephrase participant submissions into smaller statements, and sought approval from the participants these statements represents their views. Next we need to create associations between all statements (from all participants) to understand which agree with or disagree with each other. We’ll need LLMs again for this part. To do this, first create text embeddings of every line of every statement. Next, compare every pair of statements, and if their position in embedding space is at all similar, have a LLM score them on approval or disapproval. The result is a graph showing relationships between all statements and participants. This is the only use of the LLM which isn’t human validated, but it’s still simple to be inspected after the fact. Either by curious participants, skeptical auditors, or process designers interested in how the system is behaving.

From here, non-LLM algorithms can answer important questions.

  • Which clusters define participants? Something like the Louvain Method should work.
  • Which statements are the most representative of each cluster? Run another clustering algorithm inside each cluster while considering the agreement weights, find the sub-clusters and surface the statement with the highest incoming agreement inside each.
  • Which statements are controversial inside each cluster? Similar to above but weight nodes based on simultaneous agreement and disagreement.
  • Which statements are non-controversial? Similar to above but weight on agreement minus disagreement.
  • Which statements divide two clusters? Surface nodes which have high agreement inside the first cluster but disagreement from the second.

Surely even more streamlining and synthesis is needed to communicate the results widely, but I would argue this should be done by analysts, pundits or journalists (even if they lean on AI to do this work). We cannot remove the bias from such efforts and so instead should correctly tag and attribute it to an author.

Next steps

Again, as far as I can tell, sufficiently capable auto-structuring doesn’t exist yet. I’m still pouring over the “Argument Mining” literature as it seems like the nearest frontier of study. For example, Mirko Lenzpublished a paper in 2022 titled Comparing Unsupervised Algorithms to Construct Argument Graphs. My own attempts to use LLMs in this way have fallen flat. Having structured natural language into structure by hand, I have observed there is no formulaic way of doing it. Determining the structure of a paragraph is a new puzzle to solve, full of branching paths and dead ends. And LLMs don’t solve these puzzles well in a single pass. The LLM can be iteratively run, but this is expensive. I imagine the pragmatic solution is NLP algorithms, a tuned LLM, and a constraint solver working together.

While I don’t have the solution yet, I know that to build it (and test it), we’re going to need a substantial data set containing pairs of natural language and structured statements. As the grammar we need to use doesn’t exist, it will have to be constructed, which will be expensive. One shortcut might be found in the fact it’s much easier to transform the semiformal structure into natural language than the reverse. This means if we have people to write their own beliefs directly into this semiformal grammar, we can then have LLMs transform each of these into natural language (perhaps in a hundred different styles for each example), greatly expanding the training pairs we get per human effort. This trick might also be utilized in the event we do crack auto-structuring but find it very expensive (perhaps it takes 50 calls to a frontier model to structure one page of natural language input reliably) as we can leverage 100 examples into 10,000 and then attempt to use these to find more cost-effective strategies next.

Other benefits

Cracking this wouldn’t just bring credibility to deliberation and collective intelligence systems, it would seriously potentiate them.

Is / ought

Once statements are structured, it will be clear which are “is” factual claims and which are “ought” moral claims. In our recycling example, it would be very interesting to cluster on just the moral claims.

Crux identification

Notice which moral or factual statements split clusters. For moral cruxes, debate might help. For factual cruxes, studies tasked to find more information should be funded.

Quality ranking

It’s possible to score a structured input on how logically consistent, complete, or factual it is. You could choose to throw away those below a threshold and surface the best ones. This could encourage people to work harder on their submissions, as their input could “count” more in certain schemes.

Merged cluster maps

it may be possible to merge all the arguments from a certain cluster of people into a larger argument map. This is more useful than a textual summary, and each node could be tagged with the percent of people who endorse it in the cluster.

Representative agents

instantiate a chat agent as an “analyst” for each merged cluster map. You could ask it to summarize the map, weigh in on related policies, or even debate another representative. For very large processes with tens of thousands of people in the cluster, this is the only way to get participant beliefs into the context window of the LLM. And it should be relatively bias free, compared to asking an agent to simply pretend to be a member of the cluster (Consider: how much LLM training data has documents showing support for recycling and concern for nuclear power? These associations from the base model will surface if we’re not careful.)

Conclusion

The appeal of LLMs as intermediaries in large scale deliberative processes is clear, but because we can practical use them, we need to crack the auto-structuring problem. Doing so will not only allow us to progress on CI systems, but will open new frontiers in related problems. I’m excited about the day when software can map and analyze the structured beliefs of every published philosopher. If you’ve read this, and you feel similarly, please, say hello.