Skip to article frontmatterSkip to article content

Testing an LLM (ChatGpt)

This is a quick example of how to use chatGPT and then use it with data.

Testing an LLM (ChatGpt)

First, we import the openai library and assign our key. You obtain the key by going to the OpenAI API

import openai
openai.api_key = 'sk-proj-psuxh6Uvv82AGdItHgUTT3BlbkFJGS6vKEilhAxJ6MVvPTHu'

Now, we could go to ChatGPT and ask something right away. However, because there are several parameters to pass each time we make a request, it is easier to create a function that will make this quicker in the future.

def get_chatgpt_response(prompt):
  
  try:
      response = openai.chat.completions.create(
          model="gpt-4o",  # Using GPT-4o
          messages=[
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": prompt}
          ],
          max_tokens=150,  # Adjust the max tokens as needed
          temperature=0.7,
      )
      return response.choices[0].message.content
  except Exception as e:
      return str(e)

We can then use this function like the example below.

print(get_chatgpt_response("write an email welcoming accounting, finances and computer science faculty to a data science workshop"))
Subject: Welcome to the Data Science Workshop!

Dear Faculty Members of the Accounting, Finance, and Computer Science Departments,

I hope this message finds you well. We are thrilled to welcome you to our upcoming Data Science Workshop, designed specifically for professionals like you who are keen to explore the intersection of data science with accounting, finance, and computer science.

**Date:** [Insert Date]  
**Time:** [Insert Time]  
**Location:** [Insert Location] or [Virtual Platform Information]

This workshop aims to provide you with valuable insights into how data science can enhance research, teaching, and practical applications within your fields. We have curated a series of sessions led by experts who will cover various topics, including:

- Introduction to Data Science and Its Applications

Mixing LLM with Data.

Now, let’s try somehting wth our data. As we saw earlier in the workshop, we can use pandas. Let’s import pandas and load a file.

import pandas as pd
data_source = "/Users/fiacobelli/Dropbox/teaching/datasets/Crimes.csv"
df = pd.read_csv(data_source)

Now we will look at one of its rows using pandas.

df.iloc[0]
Unnamed: 0 4506608 ID 9878952 Case Number HX529642 Date 12/4/14 9:30 AM Block 010XX E 47TH ST IUCR 497 Primary Type BATTERY Description AGGRAVATED DOMESTIC BATTERY: OTHER DANG WEAPON Location Description APARTMENT Arrest False Domestic True Beat 222 District 2 Ward 4.0 Community Area 39.0 FBI Code 04B X Coordinate 1183896.0 Y Coordinate 1874058.0 Year 2014 Updated On 2/4/16 6:33 Latitude 41.809597 Longitude -87.601016 Location (41.809597, -87.601016) Name: 0, dtype: object

But we do not want the Unnamed, ID nor Case Number components as they are irrelevant for data “finding”

colnames = list(df.columns[3:])
crime = df.loc[0, colnames]
pandas.core.series.Series

Now that we have a human-readable output (i.e attribute-value pairs) we need to pass it to chatGPt with a prompt. However, crime is a Series. For chatGPT it needs to be a String (i.e. plain text)

myprompt = '''summarize the crime with the following characteristics:'''+str(crime)
get_chatgpt_response(myprompt)
'On December 4, 2014, at 9:30 AM, an incident of aggravated domestic battery involving a dangerous weapon occurred in an apartment located on the 1000 block of E 47th St. This crime was classified under the primary type "Battery" with a specific description of "Aggravated Domestic Battery: Other Dangerous Weapon." The incident was not associated with an arrest. It took place within the jurisdiction of police beat 222, district 2, ward 4, and community area 39. The crime is recorded under the FBI code 04B, and its geographic coordinates are approximately 41.809597 latitude and -87.601016 longitude. The details were last updated on February 4, 201'

Using the same ideas, we can summarize a set of crimes. We will first select a series of crimes and covert them into strings. Then we will pass them to GPT with a promp.

import pprint
crimes = df[colnames[3:]]
myprompt = "Describe whats common and what is different in these set of 10 crimes with the characteristics in thhis table: \n"+str(crimes.loc[1:10])
pprint.pprint(get_chatgpt_response(myprompt))
('To analyze the commonalities and differences in the given set of 10 crimes, '
 "let's break down the data considering several key attributes:\n"
 '\n'
 '### Commonalities:\n'
 '1. **Location Type**: Many crimes occurred on public places like streets and '
 'sidewalks, indicating a trend of offenses happening in open areas.\n'
 '2. **Arrest Status**: A majority of the incidents (6 out of 10) resulted in '
 'an arrest.\n'
 '3. **Domestic**: Most crimes (9 out of 10) are not domestic-related, showing '
 'these incidents are likely between individuals not in a domestic '
 'relationship.\n'
 '4. **Year**: The crimes span over a range of years from 2001 to 2015, but '
 'several occurred in the early 2000')