This is a quick example of how to use chatGPT and then use it with data.

Testing an LLM (ChatGpt)¶

First, we import the openai library and assign our key. You obtain the key by going to the OpenAI API

import openai
openai.api_key = 'sk-proj-psuxh6Uvv82AGdItHgUTT3BlbkFJGS6vKEilhAxJ6MVvPTHu'

Now, we could go to ChatGPT and ask something right away. However, because there are several parameters to pass each time we make a request, it is easier to create a function that will make this quicker in the future.

def get_chatgpt_response(prompt):
  
  try:
      response = openai.chat.completions.create(
          model="gpt-4o",  # Using GPT-4o
          messages=[
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": prompt}
          ],
          max_tokens=150,  # Adjust the max tokens as needed
          temperature=0.7,
      )
      return response.choices[0].message.content
  except Exception as e:
      return str(e)

We can then use this function like the example below.

print(get_chatgpt_response("write an email welcoming accounting, finances and computer science faculty to a data science workshop"))

Subject: Welcome to the Data Science Workshop!

Dear Faculty Members of the Accounting, Finance, and Computer Science Departments,

I hope this message finds you well. We are thrilled to welcome you to our upcoming Data Science Workshop, designed specifically for professionals like you who are keen to explore the intersection of data science with accounting, finance, and computer science.

**Date:** [Insert Date]  
**Time:** [Insert Time]  
**Location:** [Insert Location] or [Virtual Platform Information]

This workshop aims to provide you with valuable insights into how data science can enhance research, teaching, and practical applications within your fields. We have curated a series of sessions led by experts who will cover various topics, including:

- Introduction to Data Science and Its Applications

Mixing LLM with Data.¶

Now, let’s try somehting wth our data. As we saw earlier in the workshop, we can use pandas. Let’s import pandas and load a file.

import pandas as pd
data_source = "/Users/fiacobelli/Dropbox/teaching/datasets/Crimes.csv"
df = pd.read_csv(data_source)

Now we will look at one of its rows using pandas.

df.iloc[0]

Unnamed: 0                                                     4506608
ID                                                             9878952
Case Number                                                   HX529642
Date                                                   12/4/14 9:30 AM
Block                                                  010XX E 47TH ST
IUCR                                                               497
Primary Type                                                   BATTERY
Description             AGGRAVATED DOMESTIC BATTERY: OTHER DANG WEAPON
Location Description                                         APARTMENT
Arrest                                                           False
Domestic                                                          True
Beat                                                               222
District                                                             2
Ward                                                               4.0
Community Area                                                    39.0
FBI Code                                                           04B
X Coordinate                                                 1183896.0
Y Coordinate                                                 1874058.0
Year                                                              2014
Updated On                                                 2/4/16 6:33
Latitude                                                     41.809597
Longitude                                                   -87.601016
Location                                       (41.809597, -87.601016)
Name: 0, dtype: object

But we do not want the Unnamed, ID nor Case Number components as they are irrelevant for data “finding”

colnames = list(df.columns[3:])
crime = df.loc[0, colnames]

pandas.core.series.Series

Now that we have a human-readable output (i.e attribute-value pairs) we need to pass it to chatGPt with a prompt. However, crime is a Series. For chatGPT it needs to be a String (i.e. plain text)

myprompt = '''summarize the crime with the following characteristics:'''+str(crime)
get_chatgpt_response(myprompt)

'On December 4, 2014, at 9:30 AM, an incident of aggravated domestic battery involving a dangerous weapon occurred in an apartment located on the 1000 block of E 47th St. This crime was classified under the primary type "Battery" with a specific description of "Aggravated Domestic Battery: Other Dangerous Weapon." The incident was not associated with an arrest. It took place within the jurisdiction of police beat 222, district 2, ward 4, and community area 39. The crime is recorded under the FBI code 04B, and its geographic coordinates are approximately 41.809597 latitude and -87.601016 longitude. The details were last updated on February 4, 201'

Using the same ideas, we can summarize a set of crimes. We will first select a series of crimes and covert them into strings. Then we will pass them to GPT with a promp.

import pprint
crimes = df[colnames[3:]]
myprompt = "Describe whats common and what is different in these set of 10 crimes with the characteristics in thhis table: \n"+str(crimes.loc[1:10])
pprint.pprint(get_chatgpt_response(myprompt))

('To analyze the commonalities and differences in the given set of 10 crimes, '
 "let's break down the data considering several key attributes:\n"
 '\n'
 '### Commonalities:\n'
 '1. **Location Type**: Many crimes occurred on public places like streets and '
 'sidewalks, indicating a trend of offenses happening in open areas.\n'
 '2. **Arrest Status**: A majority of the incidents (6 out of 10) resulted in '
 'an arrest.\n'
 '3. **Domestic**: Most crimes (9 out of 10) are not domestic-related, showing '
 'these incidents are likely between individuals not in a domestic '
 'relationship.\n'
 '4. **Year**: The crimes span over a range of years from 2001 to 2015, but '
 'several occurred in the early 2000')

AI/ML

02 - Dataset Clustering